149 18 100MB
English Pages 939 [941] Year 2022
Lecture Notes in Networks and Systems 507
Kohei Arai Editor
Intelligent Computing Proceedings of the 2022 Computing Conference, Volume 2
Lecture Notes in Networks and Systems Volume 507
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science. For proposals from Asia please contact Aninda Bose ([email protected]).
More information about this series at https://link.springer.com/bookseries/15179
Kohei Arai Editor
Intelligent Computing Proceedings of the 2022 Computing Conference, Volume 2
123
Editor Kohei Arai Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-031-10463-3 ISBN 978-3-031-10464-0 (eBook) https://doi.org/10.1007/978-3-031-10464-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
This edition of the proceedings series, “Intelligent Computing: Proceedings of the 2022 Computing Conference” contains papers presented at the Computing Conference 2022, held virtually on the 14th and 15th of July 2022. We are delighted to announce that the complete conference proceedings were successfully executed through the will and co-operation of all its organizers, hosts, participants and all other contributors. The conference is held every year since 2013, with an aim to provide an ideal platform for researchers to exchange ideas, discuss on research results and present practical and theoretical applications in areas, such as technology trends, computing, artificial intelligence, machine vision, security, communication, ambient intelligence and e-learning. The proceedings of 2022 conference has been divided into two volumes which cover a wide range of abovementioned conference topics. This year Computing Conference received a total of 498 papers from around the globe, out of which only 179 papers were selected to be published in the proceedings for this edition. All the published papers passed the double-blind review process by an international panel of at least three international expert referees, and the decisions were taken based on the research quality. We are very pleased to report that the quality of the submissions this year turned out to be very high. The conference brings a single-track sessions covering research papers, posters, videos followed with keynote talks by experts to stimulate significant contemplation and discussions. Moreover, all authors had very professionally presented their research papers which were viewed by a large international audience online. We are confident that all the participants and the interested readers benefit scientifically from this book and will have significant impact to the research community in the longer term. Acknowledgment goes to the keynote speakers for sharing their knowledge and expertise with us. A big thanks to the session chairs and the members of the technical program committee for their detailed and constructive comments which
v
vi
Editor’s Preface
were valuable for the authors to continue improving their papers. We are also indebted to the organizing committee for their invaluable assistance to ensure the conference comes out in such a great success. We expect that the Computing Conference 2023 will be as stimulating as this most recent one was. Kohei Arai
Contents
An Adaptive Geometry and Dual Graph Approach to Sign Prediction for Weighted and Signed Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phuong Dong Tan Le, Nha Nam Ngoc Nguyen, and Dong Quan Ngoc Nguyen
1
A Stochastic Modified Limited Memory BFGS for Training Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mahsa Yousefi and Ángeles Martínez Calomardo
9
Enhanced Deep Learning Framework for Fine-Grained Segmentation of Fashion and Apparel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Usman Ahmad Usmani, Ari Happonen, and Junzo Watada
29
Towards Tackling QSAT Problems with Deep Learning and Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruiyang Xu and Karl Lieberherr
45
Laplacian Pyramid-like Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . Sangjun Han, Taeil Hur, and Youngmi Hur
59
Autonomous Vision-Based UAV Landing with Collision Avoidance Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tianpei Liao, Amal Haridevan, Yibo Liu, and Jinjun Shan
79
The Current State of the Art in Deep Learning for Image Classification: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adam Byerly, Tatiana Kalganova, and Richard Ott
88
Deep Convolutional Neural Networks for COVID-19 Detection from Chest X-Ray Images Using ResNetV2 . . . . . . . . . . . . . . . . . . . . . . 106 Tomiris Rakhymzhan, Javad Zarrin, Mahdi Maktab-Dar-Oghaz, and Lakshmi Babu Saheer
vii
viii
Contents
Deep Neural Networks for Remote Sensing Image Classification . . . . . . 117 Giorgia Miniello, Marco La Salandra, and Gioacchino Vino Linear Block and Convolutional MDS Codes to Required Rate, Distance and Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Ted Hurley A Review of Unsupervised Machine Learning Frameworks for Anomaly Detection in Industrial Applications . . . . . . . . . . . . . . . . . . . . 158 Usman Ahmad Usmani, Ari Happonen, and Junzo Watada Causal Probabilistic Based Variational Autoencoders Capable of Handling Noisy Inputs Using Fuzzy Logic Rules . . . . . . . . . . . . . . . . . . 190 Usef Faghihi, Cyrus Kalantarpour, and Amir Saki Multi-Object On-Line Tracking as an Ill-Posed Problem: Ensemble Deep Learning at the Edge for Spatial Re-identification . . . . 203 Vasanth Iyer and Asif Mehmood An Ensemble-Based Machine Learning for Predicting Fraud of Credit Card Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 Tahani Baabdullah, Danda B. Rawat, Chunmei Liu, and Amani Alzahrani Unsupervised Machine Learning Methods for City Vitality Index . . . . . 230 Jean-Sébastien Dessureault, Jonathan Simard, and Daniel Massicotte Machine Learning of a Pair of Charged Electrically Particles Inside a Closed Volume: Electrical Oscillations as Memory and Learning of System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Huber Nieto-Chaupis Marlo’s Networks of Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 Marcos Bautista López Aznar, Guillermo Címbora Acosta, and Walter Federico Gadea Complete Blood Analysis: An Android OCR-Based Interpretation . . . . 278 Malik Almaliki and Elsayed Atlam Refined Optimal Control Problem and Its Solution Using Symbolic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 Askhat Diveev Influences of Coating and Spandex Compositions of Conductive Textiles Used as Strain Sensors Using an Automated Test System . . . . . 306 Stefan Wohlrab, Phillip Petz, Florian Eibensteiner, and Josef Langer Problem Structuring Combined with Sentiment Analysis to ProductService System Performance Management . . . . . . . . . . . . . . . . . . . . . . . 322 Ingrid Saiala C. S. Feitosa and Luiz Cesar Ribeiro Carpinetti
Contents
ix
Texture Transfer Attention for Realistic Image Completion . . . . . . . . . 340 Yejin Kim, Manri Cheon, and Junwoo Lee Examining Correlation Between Trust and Transparency with Explainable Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Arnav Kartikeya Utilizing AI in Test Automation to Perform Functional Testing on Web Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359 Dalia Alamleh On the Modelling of Species Distribution: Logistic Regression Versus Density Probability Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 João Bioco, Paula Prata, Fernando Canovas, and Paulo Fazendeiro Artificial Intelligence Tools for Actuator Fault Diagnosis of an Unmanned Underwater Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . 392 Paolo Castaldi, Saverio Farsoni, Massimiliano Menghini, and Silvio Simani Applying the Delphi Method to Measure Enterprise Content Management Workflow System Performance . . . . . . . . . . . . . . . . . . . . . 404 Hisham AbouGrad and Jon Warwick A Fuzzy Epigenetic Model for Representing Degradation in Engineered Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 Maria Seale, R. Cody Salter, Natàlia Garcia-Reyero, and Alicia Ruvinsky A Voting Ensemble Technique for Gas Classification . . . . . . . . . . . . . . . 436 M. Jaleel, A. Amira, and H. Malekmohamadi Neural Networks with Superexpressive Activations and Integer Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Aleksandr Beknazaryan Mask Compliance Detection on Facial Images . . . . . . . . . . . . . . . . . . . . 452 Lorenzo Garbagna, Holly Burrows, Lakshmi Babu-Saheer, and Javad Zarrin Urban Tree Detection and Species Classification Using Aerial Imagery . . . 469 Mahdi Maktab Dar Oghaz, Lakshmi Babu Saheer, and Javad Zarrin Rectifying Homographies for Stereo Vision: Analytical Solution for Minimal Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Pasquale Lafiosca and Marta Ceccaroni Statistical Analysis of Electroencephalographic Signals in the Stimulation of Energy Data Visualizations . . . . . . . . . . . . . . . . . . . . . . . 504 O. F. Kucukler, A. Amira, and H. Malekmohamadi
x
Contents
GCANet: A Cross-Modal Pedestrian Detection Method Based on Gaussian Cross Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . 520 Peiran Peng, Feng Mu, Peilin Yan, Liqiang Song, Hui Li, Yu Chen, Jianan Li, and Tingfa Xu Automatic Classification of Felsic, Mafic, and Ultramafic Rocks in Satellite Images from Palmira and La Victoria, Colombia . . . . . . . . . . . 531 Saulo Bosquez, Germán H. Alférez, Ana María Martínez Ardila, and Benjamin L. Clausen SHAQ: Single Headed Attention with Quasi-recurrence . . . . . . . . . . . . 548 Sangeet Dandona, Warren Kushner, Nashwin Bharwani, and Ben Schreiber Dynamic Topic Modeling Reveals Variations in Online Hate Narratives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Richard Sear, Nicholas Johnson Restrepo, Yonatan Lupu, and Neil F. Johnson An Improved Bayesian TRIE Based Model for SMS Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 Abhinava Sikdar and Niladri Chatterjee Dialog Act Segmentation and Classification in Vietnamese . . . . . . . . . . . 594 Tho Chi Luong and Oanh Thi Tran ConDef: Automated Context-Aware Lexicography Using Large Online Encyclopedias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Houjun Liu and Zachary Sayyah On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 Aamir Miyajiwala, Arnav Ladkat, Samiksha Jagadale, and Raviraj Joshi A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction from Social Media Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 627 Sarang Shaikh, Sule Yildirim Yayilgan, Erjon Zoto, and Mohamed Abomhara Social Media Self-expression as Form of Coping During the 2020 Pandemic Lockdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656 Macrina P. Lazo and Christine Diane Ramos Building Wikipedia N-grams with Apache Spark . . . . . . . . . . . . . . . . . . 672 Armin Esmaeilzadeh, Jorge Ramón Fonseca Cacho, Kazem Taghva, Mina Esmail Zadeh Nojoo Kambar, and Mahdi Hajiali Selecting NLP Classification Techniques to Better Understand Causes of Mass Killings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 Abigail Sticha and Paul Brenner
Contents
xi
Sentiment Analysis on Citizenship Amendment Act of India 2019 Using Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 701 Shreya Vaghasia and Kalpdrum Passi Sentiment Analysis on Depression Detection: A Review . . . . . . . . . . . . . 718 Norma Mohamad Nor, Noorihan Abdul Rahman, Mohd Ridzwan Yaakub, and Zuriani Ahmad Zukarnain Supervised Negative Binomial Classifier for Probabilistic Record Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727 Harish Kashyap and Kiran Byadarhaly A Recipe for Low-Resource NMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739 Eryk Wdowiak Natural Language Processing Using Database Context . . . . . . . . . . . . . 747 Zheni Mincheva, Nikola Vasilev, Anatoliy Antonov, and Ventsislav Nikolov Enriching Contextualized Representations with Biomedical Ontologies: Extending KnowBert to UMLS . . . . . . . . . . . . . . . . . . . . . . 760 Guilhem Piat, Nasredine Semmar, Alexandre Allauzen, Hassane Essafi, and Julien Tourille One Step Beyond: Keyword Extraction in German Utilising Surprisal from Topic Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774 J. Nathanael Philipp, Max Kölbl, Yuki Kyogoku, Tariq Yousef, and Michael Richter Language Use and Susceptibility in Online Conversation . . . . . . . . . . . . 787 Lu Xiao, Qiyi Wu, Sucheta Soundarajan, and Jinfen Li How Does the Thread Level of a Comment Affect its Perceived Persuasiveness? A Reddit Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800 Lu Xiao and Humphrey Mensah Ultra-Low-Power Range Error Mitigation for Ultra-Wideband Precise Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814 Simone Angarano, Francesco Salvetti, Vittorio Mazzia, Giovanni Fantin, Dario Gandini, and Marcello Chiaberge The Pareto-Frontier-Based Stiffness of a Controller: Trade-off Between Trajectory Plan and Controller Design . . . . . . . . . . . 825 Zhe Shen and Takeshi Tsuchiya Remote Manipulation of a Robotic Arm with 6 DOF via IBSV Using a Raspberry Pi and Machine Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845 Sandro Balarezo, Xavier Arias, and Kevin Espín
xii
Contents
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855 Petar Durdevic and Daniel Ortiz-Arroyo Implementation of a Balanced and Fluid Movement Six-Legged Spider Robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868 Mustafa Ayad, Kwabena Boating, and Waley Zhang The Applying of Low Order Frequency-Dependent Components in Signal Processing of Autonomous Mobile Robotic Platforms . . . . . . . 882 Ivan Afanasyev, Valery Sytnikov, Oleg Strelsov, and Pavel Stupen Run-Time Dependency Graph Models for Independently Developed Robotic Software Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892 Vineet Nagrath and Christian Schlegel A Raspberry Pi Computer Vision System for Self-driving Cars . . . . . . . 910 Zach Isherwood and Emanuele Lindo Secco Correction to: Intelligent Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . Kohei Arai
C1
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 925
An Adaptive Geometry and Dual Graph Approach to Sign Prediction for Weighted and Signed Networks Phuong Dong Tan Le1(B) , Nha Nam Ngoc Nguyen2 , and Dong Quan Ngoc Nguyen1,2 1
2
Department of Applied Mathematics, University of Waterloo, Waterloo, ON, Canada [email protected] Sai Gon Joint Stock Commercial Bank, Ho Chi Minh City, Viet Nam
Abstract. In this paper, we propose a new SVM method for predicting signs of edges in weighted and signed networks. Our method is based on the notions of dual-graph operation and filtered neighborhoods of nodes in dual graphs, which allows to introduce a geometric structure on the set of nodes of dual graphs and lead to a modified SVM method for predicting edge signs in weighted and signed networks. We test our method on several real datasets. Keywords: Dual graph · Embedded SVM method · Filtered neighborhood · Weighted and signed network · Sign prediction
1
Introduction
Many social networks can be viewed as weighted and signed networks (WSN). One of the main problems in signed social networks is to predict signs of edges in WSNs (see [1,2]). Many real-world datasets, for example, Bitcoin exchanges or Epinions are explicit WSNs. Sign prediction in social networks is an active research area, and there are several papers devoted to such problem. There are two main approaches to sign prediction in social networks. The first one is based on the social status theory (see [3]), or the structural balance theory (see [4,5]), both of which are inspired by the triad relationships among nodes. The second approach (see, for example, [1]) is to base on similarities between nodes that can be measured using several similarity measures in social networks such as common neighbors, Jaccard coefficients, preferential attachments, and so on. The second approach carries a machine learning flavor, and focuses a great deal on the underlying combinatorial structures of networks. In this paper, we introduce a novel method that combines an adaptive graph structure of WSNs and dual operation to predict edge signs in WSNs. The adaptive graph structure approach used in this paper is inspired by the work of Nguyen, Xing, Lin [9,10] in which we incorporate weight information from the training set to introduce a notion of filtered neighborhoods of nodes in dual c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 1–8, 2022. https://doi.org/10.1007/978-3-031-10464-0_1
2
P. D. T. Le et al.
graphs, which in turn provides a key tool in constructing embedded SVM method for predicting signs of edges in WSNs. The structure of our paper is as follows. In Sect. 2, we introduce weighted and signed networks, and their dual graphs. In Sect. 3, for each WSN and its dual graph, we introduce filtered neighborhoods of nodes in dual graphs which incorporates weight information of the corresponding edges in the original graphs. In Sect. 4, based on the notion of filtered neighborhoods of nodes in dual graphs, we introduce an embedded SVM method, and construct a predictive model for predicting signs of edges in WSNs. In the last section, we apply our embedded SVM method to two real datasets–Bitcoin and Epinions, and describe our experimental results.
2
Dual Graphs
In this section, we introduce a notion of weighted and signed networks and its dual graphs. The main purpose of introducing such dual graphs is to transform a machine learning problem relating to edges in a network to a machine learning problem relating to nodes in a dual network, which can be studied, using the combinatorial and geometric structure of nodes in networks. Definition 1. Let G = (V, E) be a graph, where V denotes the set of nodes of G, and E the set of edges of G. Let S : E → {−1, 1} be a sign map that to each edge e ∈ E, associates a unique real number S(e) ∈ {1, −1}. We also write G = (V, E; S) to denote a graph G equipped with the sign map S. Such a graph is called a signed network (SN). The value S(e) is called the sign of edge e. The dual signed graph of G is defined as G∗ = (V ∗ , E ∗ , S ∗ ), where (i) the set of nodes V ∗ of G∗ is the set of edges E of G, and each node in V ∗ is represented by the edge e = (v, w) in E, where v, w are nodes of G; (ii) the set of edges E ∗ of G∗ is the set of pairs (e1 , e2 ), where e1 , e2 are edges of G such that e1 , e2 share a node in the graph G; (iii) the sign map S ∗ : V ∗ → {±1} is defined as S ∗ (e) = S(e) for each edge e ∈ E. Definition 2. Let G = (V, E) be a graph, where V denotes the set of nodes of G, and E the set of edges of G. Let W : E → [a, b] be a weight map with [a, b] ⊂ R such that to each edge e ∈ E, associates a unique real number W (e) ∈ [a, b]. We also write G = (V, E; W ) to denote a graph G equipped with the weight map W . Such a graph is called a weighted network (WN). The value W (e) is called the weight of edge e. The dual weighted graph of G is defined as G∗ = (V ∗ , E ∗ , W ∗ ), where (i) the set of nodes V ∗ of G∗ is the set of edges E of G, and each node in V ∗ is represented by the edge e = (v, w) in E, where v, w are nodes of G; (ii) the set of edges E ∗ of G∗ is the set of pairs (e1 , e2 ), where e1 , e2 are edges of G such that e1 , e2 share a node in the graph G;
An Adaptive Geometry and Dual Graph Approach to Sign Prediction
3
(iii) the weight map W ∗ : V ∗ → {±1} is defined as W ∗ (e) = W (e) for each edge e ∈ E. Remark 1. The dual operation transforms each weighted graph equipped with weights of edges into a weighted graph equipped with weights of nodes. Such operation allows to transform a machine learning problem relating to edge weights to the problem that is concerned with node weights which is more manageable. Remark 2. Let G = (V, E; W ) be a weighted graph with weight map W : E → [a, b], and let G∗ = (V ∗ , E ∗ ; W ∗ ) be its weighted dual graph. In many real-world datasets, one can create a signed graph and its signed dual graph, based on G and G∗ as follows. Let c be a real number such that a < c < b. We define a sign map S : E → {±1} as follows: if e is an edge such that c ≤ W (e) ≤ b, then S(e) = 1; otherwise if a ≤ W (e) < c, then S(e) = −1. Then the graph G equipped with the sign map S becomes a signed network, and the graph G∗ with the corresponding sign map S ∗ as in Definition 1 becomes its dual signed graph. In our experimental results, we begin with a weighted graph and its dual weighted graph. Then we convert them into signed graphs as above. Throughout this paper, we denote such weighted and signed graphs by G = (V, E; W, S) to indicate that the signed network structure of G = (V, E; S) is induced from the weighted network structure of G = (V, E; W ). In this paper, for a given graph G = (V, E; S) with sign map S : E → {±1}, we will use the dual graph G∗ and a modified SVM method from machine learning to predict signs of edges in G. Our modified SVM method will be based on a new notion of neighborhoods for nodes in the dual graph G∗ that were introduced by Nguyen, Xing, and Lin (see [9,10]).
3
Neighborhoods of Nodes in Dual Graphs
In many real-world datasets, a dataset is often represented as a graph G = (V, E; S), but the sign map is incomplete in the sense that the values of S are only known for a subset of edges, say E0 of E. More precisely the restriction of S to E0 , say S : E0 → {±1} is known as a training set for the signs of edges. The aim for analyzing such dataset is to predict what the sign of e is for every edge e in E but not in E0 . For example, let V = V1 ∪ V2 be a set consisting of two subsets, V1 as a group of products, and V2 as a group of buyers. We consider V as a set of nodes in the graph G = (V, E; S) in which a pair (v1 , v2 ) with v1 ∈ V1 and v2 ∈ V2 forms an edge in E if and only if buyer v2 buys product v1 . An example of such dataset is the collections of products and buyers on Amazon. In order to define the sign map S : E → {±}, we declare that for each edge e = (v1 , v2 ) in E, S(e) = 1 if buyer v2 evaluates product v1 positively, and S(e) = −1 if buyer v2 views product v1 as a bad purchase. In real data examples, the values of S are only known for a subset of edges, say E0 , for example, based on the survey of evaluations of buyers towards products in a period of time, from year t1 to t2 .
4
P. D. T. Le et al.
The aim to study such social networks is to predict, in the future, after year t2 , what the evaluations of buyers towards products are from such networks; more precisely what the values S(e) with e ∈ E \ E0 are, where e represents the edges in E \ E0 , which means that the pair of products and buyers (v1 , v2 ) appear in the network after year t2 . Such sign prediction problem plays an important role in social network analysis. Signed graphs and their dual graphs under consideration in this paper are induced from weighted graphs. Based on the motivational example as above, our aim is first to introduce a notion of neighborhoods of nodes in the dual graphs which in turn provide important information to predict signs of edges in the original graphs. We fix a weighted graph G = (V, E; W ), where W : E → [a, b] is a weight map, and a subset E0 of E such that the values W (e) for every e ∈ E0 are known. We suppose that we do not know of the values of W (e) for e ∈ E \ E0 . From the graph G, we construct a dual graph G∗ = (V ∗ , E ∗ ; W ∗ ) as in Definition 2. We define V0∗ = E0 which is a subset of nodes in the dual graph G∗ . Since the values of W are only known for edges in E0 , the values of the dual sign map W ∗ are only known for the nodes in V0∗ . We introduce a notion of neighborhoods for nodes in the dual graph G∗ as follows. Definition 3. Let e = (u, v) be a node in the dual graph G∗ that is represented by an edge e in the graph G, where u, v are nodes in G. The neighborhood of e is defined as N (e; E0 ) = {f ∈ V0∗ = E0 | f = (u, w) or f = (x, v) for some nodes u, v ∈ V }; in other words, the neighborhood of e consists of nodes f ∈ V0∗ that shares a vertex with e. Each node f in N (e; E0 ) is called a neighbor of e. For each node e in the dual graph G∗ , where e is an edge in the original graph G, let N (e; E0 ) be the neighborhood of e as above. Set ∗ f∈N (e;E0 ) W (e) ; A(e) = #N (e; E0 ) that is, A(e) is the average weight for all edges in the neighborhood of e, where #N (e; E0 ) denotes the number of elements in the neighborhood of e. We introduce a filtered neighborhood of nodes in the dual graph G∗ which plays a key role in our embedded SVM method. Definition 4. Let t > 0 be a constant that plays a role of tuning parameter. For each node e in the dual graph G∗ , we define the filtered neighborhood of e as Nftil (e; E0 ) = {f ∈ N (e; E0 ) |W ∗ (f) − A(e)| ≤ t}.
An Adaptive Geometry and Dual Graph Approach to Sign Prediction
5
Remark 3. The original neighborhood N (e; E0 ) of a node e in the dual graph G∗ may contain some nodes whose weights are much larger or smaller than the average weight A(e); those nodes are viewed as outliers (or noises) in this setting. By introducing a tuning parameter t > 0 (which will be suitably chosen, depending on datasets), one can filter neighbors of e in N (e; E0 ), and only need to keep these nodes whose weights are close to the average weight A(e), which helps remove outliers in the neighborhood of e. Remark 4. Note that the definition of neighborhood of nodes in the dual graph depend on the set V0∗ = E0 in which we know the values of S and S ∗ . Intuitively, in order to predict the signs of the edges in G, we transfer the information of signs of E0 to the signs of nodes in the dual graph. Based on the definition proposed above, we use neighborhood information of each node in the dual graph to predict its sign, which in turn predict the signs of the corresponding edges in the original graph.
4
Predictive Models for Sign Prediction
Based on the notion of filtered neighborhoods introduced in Sect. 3, we introduce embedded SVM methods to predict signs of edges in a graph. We begin by fixing the notation throughout this section. Let G = (V, E; W ) be a weighted network, where W : E → [a, b] is a weight map, and let E0 ⊂ E such that the values of W (e) are known for each e ∈ E0 . Let G∗ = (V ∗ , E ∗ ; W ∗ ) be the dual graph of G that is defined in Sect. 2, and let V0∗ = E0 as in Sect. 3. Let c ∈ R such that a < c < b. Let S : E → {±1} and S ∗ : V ∗ → {±1} be sign maps of E and V ∗ , respectively as in Remark 2. We write G = (V, E; W, S) and G∗ = (V ∗ , E ∗ ; W ∗ , S ∗ ) for weighted and signed network G and its dual weighted and signed network G∗ that are induced from such sign maps as in Remark 2. The main aim in this section is to construct a predictive model for predicting the values of S(e), where e ∈ E \ E0 , or equivalently for predicting the values of S ∗ (e) for e ∈ V ∗ \ V0∗ . The latter is a machine learning problem relating to nodes instead of edges, which has some advantages, based on the notion of filtered neighborhoods of nodes in dual graphs as in Definition 4. Write E0 = {f1 , · · · , fr } where r is the number of elements in E0 . 4.1
Embedded SVM Method
Let G = (V, E; W, S) be the weighted and signed graph introduced above, and G∗ = (V ∗ , E ∗ ; W ∗ , S ∗ ) be the dual graph of G. Fix a tuning parameter t > 0.
6
P. D. T. Le et al.
We define a feature map F : V ∗ → R≥0 as F(e) = #Nftil (e; E0 ), where R≥0 denotes the set of nonnegative real numbers, and Nftil (e; E0 ) is the filtered neighborhood of the node e ∈ V ∗ as in Definition 4. Here we use the notation #Nftil (e; E) to denote the number of elements in the set Nftil (e; E0 ). Using the feature map F, each node in the dual graph is mapped to a nonnegative real number, and thus nodes in the dual graph G∗ can be viewed as points in the reals R. One can view that the geometry of V ∗ inherits from that of R; so two nodes in the dual graph G∗ are considered to be close to each other if the distance between the corresponding points in R, represented by the feature map F, is small. Let κ : R × R → R denote a kernel function. There are many choices of kernel functions such as linear, polynomial, or Gaussian radial basis kernel (see, for example, [6].) We propose the following predictive model W t : E → R for weight map W as follows: W t (e) = α +
r
βi W (fi )κ(F(e), F(fi )),
i=1
where the fi are elements in E0 , the coefficients α, βi are unknown constants that need to be estimated using the training set E0 = {f1 , · · · , fr }. The predictive model for sign prediction for e ∈ E \ E0 is given by S˜t (e) = sign(W t (e)), where the sign function sign(·) is defined by 1 if s > 0, sign(s) = −1 if s ≤ 0.
5
Experimental Results for Real Datasets
In this section, we apply our embedded SVM method in Sect. 4 to predicting signs of edges in two real weighted and signed networks–Bitcoin network and Epinions dataset (see Table 1 for their descriptions). Below we first give a description of each network. – Bitcoin OTC. This is a weighted signed directed network of people who trade using Bitcoin on a platform called Bitcoin OTC. The dataset is available at http://snap.stanford.edu/data/soc-sign-bitcoin-otc.html) (see [7]). Since users are anonymous, it is necessary to maintain a record of users’ reputation to prevent transactions with fraudulent and risky users. Members of Bitcoin OTC rate other members’ level of trustfulness on a scale of a = −10 (total distrust) and b = +10 (total trust). In this network, we choose c = 0, and declare that if the weight of an edgeis between 0 and 10, its sign is +1, and if the edge’s weight is less than 0, its sign is −1.
An Adaptive Geometry and Dual Graph Approach to Sign Prediction
7
– Epinions. This dataset was collected by Paolo Massa in a 5-week crawl (November/December 2003) from the Epinions.com Website (see the dataset at http://www.trustlet.orgdownloaded epinions.html) (see [8]). In Epinions, each user rates the helpfulness of a review on a 1–5 scale, where a = 1 means totally not helpful and b = 5 mean totally helpful. We choose c = 3, and declare that if the weight of an edgeis between 3 and 5, its sign is +1, and if the edge’s weight is less than 3, its sign is −1. Table 1. Descriptions of datasets Network
Nodes
Bitcoin OTC Epinions
5881
Edges 35592
Edges with sign +1 31676
131828 841372 717667
In each dataset above, we use 80% of the set of edges as the training set, that is, as the subset E0 in Sect. 4, and predict signs of the remaining 20% of the set of edges, using the embedded SVM method. We use several tuning parameters t > 0, and choose the one that provide the smallest error in prediction. Similarly we apply several kernel functions κ as in [6] that results in similar errors in sign prediction. In order to assess accuracy in our prediction methods, we use the mean absolute error (MAE) and also the root mean square error (RMSE). In Table 2, we list the results of predicting edge signs in two networks–Bitcoin and Epinions. Each cell in Table 2 reports a pair of numbers (MAE, RMSE). In conclusion, based on the experimental results, our embedded SVM method performs very well in both testing datasets. Table 2. Results of predicting edge signs Network
Embedded SVM
Bitcoin OTC (0.044, 0.144) Epinions
(0.025, 0.069)
References 1. Tang, J., Chang, Y., Aggarwal, C., Liu, H.: A survey of signed network mining in social media. ACM Comput. Surv. 49(3), 42 (2016) 2. Khodadadi, A., Jalili, M.: Sign prediction in social networks based on tendency rate of equivalent micro-structures. Neurocomputing 257, 175–184 (2017) 3. Lu, C., Yu, J.X., Li, R.-H., Wei, H.: Exploring hierarchies in online social networks. IEEE Trans. Knowl. Data Eng. 28(8), 2086–2100 (2016) 4. Wu, Z., Aggarwal, C.C., Sun, J.: The troll-trust model for ranking in signed networks. In: Proceeding of 9th ACM International Conference on Web Search Data Mining, pp. 447–456 (2016)
8
P. D. T. Le et al.
5. Cartwright, D., Harary, F.: Structural balance: a generalization of Heider’s theory. Psychol. Rev. 63(5), 277 (1956) 6. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. SSS, Springer, New York (2009). https://doi.org/10.1007/978-0-387-84858-7 7. Kumar, S., Spezzano, F., Subrahmanian, V., Faloutsos, C.: Edge weight prediction in weighted signed networks. In: 2016 IEEE 16th International Conference on Data Mining, ICDM, pp. 221–230. IEEE (2016) 8. Massa, P., Avesani, P.: Trust-aware recommender systems. In: Proceedings of the 2007 ACM Conference on Recommender Systems, pp. 7–24 (2007) 9. Nguyen, D.Q.N., Xing, L., Lin, L.: Weight prediction for variants of weighted directed networks. In: 2020 IEEE International Conference on Big Data (2020) 10. Nguyen, D.Q.N., Xing, L., Lin, L.: Community detection, pattern recognition, and hypergraph-based learning: approaches using metric geometry and persistent homology. In: Fuzzy Systems and Data Mining VI, Proceedings of Frontiers in Artificial Intelligence and Applications, pp. 457–473 (2020)
A Stochastic Modified Limited Memory BFGS for Training Deep Neural Networks ´ Mahsa Yousefi and Angeles Mart´ınez Calomardo(B) Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy [email protected], [email protected]
Abstract. In this work, we study stochastic quasi-Newton methods for solving the non-linear and non-convex optimization problems arising in the training of deep neural networks. We consider the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) update in the framework of a trust-region approach. We provide an almost comprehensive overview of recent improvements in quasi-Newton based training algorithms, such as accurate selection of the initial Hessian approximation, efficient solution of the trust-region subproblem with a direct method in high accuracy and an overlap sampling strategy to assure stable quasi-Newton updating by computing gradient differences based on this overlap. We provide a comparison of the standard L-BFGS method with a variant of this algorithm based on a modified secant condition which is theoretically shown to provide an increased order of accuracy in the approximation of the curvature of the Hessian. In our experiments, both quasi-Newton updates exhibit comparable performances. Our results show that with a fixed computational time budget the proposed quasi-Newton methods provide comparable or better testing accuracy than the state of the art first-order Adam optimizer. Keywords: Quasi-Newton methods · Limited memory BFGS region · Stochastic optimization · Deep neural networks
1
· Trust
Introduction
Deep learning has become the leading technique for solving large-scale machine learning problems. After a prolonged slow start, the advent of higher computational power and the introduction of GPU computing, have made possible the training of neural networks with a high number of layers that have shown impressive efficacy in image classification tasks, natural language processing and text analytic, speech recognition and reinforcement learning among other fields. Deep Learning problems are often posed as highly nonlinear and often nonconvex unconstrained optimization problems. For instance, in image classificad tion using a training dataset {(xi , yi )}N i=1 in C classes with input xi ∈ IR and
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 9–28, 2022. https://doi.org/10.1007/978-3-031-10464-0_2
10
´ Mart´ınez Calomardo M. Yousefi and A.
target yi ∈ IR, a deep neural network training refers to solving an empirical risk minimization (ERM) problem that can be formulated as follows: minn F (w) :=
w∈IR
N 1 fi (w) N i=1
(1)
where w ∈ IRn is the vector of trainable parameters, N is the number of observations in the training dataset and fi (w) := f (w; xi , yi ) is a loss function quantifying the prediction error for the ith observation of the training dataset. Finding an efficient optimization algorithm for (1) has attracted many researchers and a number of algorithms have been proposed both in the machine learning and optimization literature. Since in large-scale machine learning problems (i.e. large n and N ) the computation of the loss function F (w) and the gradient ∇F (w) is expensive and the computation of the true Hessian ∇2 F (w) is not practical, stochastic first-order methods have been widely used in many DL applications due to their low per-iteration cost, optimal complexity, easy implementation and proven efficiency in practice. The preferred method is the stochastic gradient descent (SGD) method [6,37], and its variance-reduced [12,20,38] and adaptive [13,21] variants. However, these methods due to the use of only first-order gradient information come with several issues such as relatively-slow convergence, highly sensitivity to the choice of hyper-parameter (e.g., step-length and batch size). First-order methods can also find some difficulties in escaping saddle points [43], and exhibit limited benefits of parallelism due to their usual implementation with small mini-batches [24]. On the other hand, second order methods can often find good minima in fewer steps due to their use of curvature information. The main second order method incorporating the inverse Hessian matrix is Newton’s method [34] that computes the next update step by wk+1 = wk − η∇2 F (wk )−1 ∇F (wk ). However, Newton’s method presents serious computational and memory usage challenges involved in the computation of the Hessian. Moreover, using exact Hessians will result in algorithms that produce sequences moving towards saddle points, as Newton’s method encourages rapid local convergence towards any stationary point regardless of the curvature [11,23]. Quasi-Newton and Hessian-free methods are two techniques aimed at incorporating second order information without computing and storing the true Hessian matrix. Hessian-free methods attempt to find an approximate Newton direction ∇2 F (wk )−1 ∇F (wk ) using conjugate gradient methods [4,28]. Nevertheless, whether true Hessian matrix-vector products or subsampled variants of them, see e.g. [42], are used, the iteration complexity of a (modified) CG algorithm is significantly greater than that of a limited memory quasi-newton method. In fact, quasi-Newton methods and their limited memory variants [34] attempt to combine the speed of Newton’s method and the scalability of first-order methods. They construct Hessian approximations using only gradient information and exhibit superlinear convergence. Quasi-Newton and stochastic quasi-Newton methods to solve large optimization problems arising in machine learning have been recently extensively
A Stochastic Modified LBFGS-TR for Training DNNs
11
considered within the context of convex and non-convex optimization. For instance, a stochastic Broyden-Fletcher-Goldfarb-Shanno and its limited memory variant (L-BFGS) were proposed for online convex optimization in [39]. Another stochastic L-BFGS method for solving strongly convex problems was proposed in [9] that uses sampled Hessian-vector products rather than gradient differences, which was proved in [33] to be linearly convergent by incorporating the variance reduction technique (SVRG [20]) to alleviate the effect of noisy gradients. A closely related variance reduced block L-BFGS method was proposed in [19]. A regularized stochastic BFGS method was proposed in [30], and an online L-BFGS method was proposed in [31] for strongly convex problems and extended in [27] to incorporate SVRG variance reduction. For the solution of non-convex optimization problems arising in deep learning, a damped L-BFGS method incorporating SVRG variance reduction was developed and its convergence properties were studied in [40]. Stochastic quasi-Newton methods use a subsampled Hessian approximation or/and subsampled gradient. Some of these stochastic quasi-Newton algorithms employ fixed size batches and compute stochastic gradient differences in a stable way, originally proposed in [39], using the same batch at the beginning and at the end of the iteration. Since this can potentially double the iteration complexity, an overlap batching strategy was proposed to reduce the computational cost in [2] and tested also in [3]. This strategy was further applied in [14,35]. Other stochastic quasi-Newton methods have been considered that employ a progressive batching approach in which the sample size is increased as the iteration progresses, see e.g. [5] and references therein. Recently, in [17] a Kronecker-factored block diagonal BFGS and L-BFGS method was proposed, that takes advantage of the structure of feed-forward DNN training problems. The main contribution of this work is as follows. As most of the previously cited related works, we consider a limited memory variant of BFGS (L-BFGS), one of the most popular quasi-Newton updates in Broyden’s class. We consider a stochastic variant of L-BFGS obtained by fixed-size subsampling. We study also a modified L-BFGS update obtained through a modified secant condition which is theoretically shown to provide an increased order of accuracy in the approximation of the curvature of the Hessian. Both the original and the modified L-BFGS quasi-Newton methods are used in a trust-region framework. We provide an almost comprehensive overview of recent improvements in quasi-Newton based training algorithms, such as accurate selection of the initial Hessian approximation, efficient solution of the trust-region subproblem with a direct method in high accuracy and an overlap sampling strategy to assure stable quasi-Newton updating by computing gradient differences based on this overlap. We examine the behaviour of the studied quasi-Newton methods in the training of deep convolutional neural networks in a supervised learning application, image classification, and provide a comparison with a state of the art first-order method such as Adam [21]. This paper is organized as follows. We provide an overview of the (limited memory) BFGS method in Sect. 2. In Sect. 3 we introduce a modified L-BFGS
12
´ Mart´ınez Calomardo M. Yousefi and A.
update obtained by imposing a different secant condition. We describe the use of the modified L-BFGS method in a trust-region framework and its stochastic variant in Sects. 4 and 5, respectively. Numerical result are reported in Sect. 6. Finally, some of the conclusions of this study are included in Sect. 7.
2 2.1
An Overview on the L-BFGS Update The BFGS Update
The BFGS update as Hessian approximation have the following general form Bk+1 = Bk −
Bk sk sTk Bk yk y T + T k, T sk Bk sk yk sk
k = 0, 1, . . . ,
(2)
and satisfies the standard secant condition Bk+1 sk = yk ,
(3)
where sk = pk and yk = ∇F (wt )−∇F (wk ). The vector pk is the search direction at iteration k and can be obtained in many different ways, for instance, using a trust-region framework [10] which proposes a trial point wt = wk + pk .
(4)
The BFGS updates (2) using only gradient information to incorporate curvature information generate symmetric positive definite matrices, i.e. Bk+1 0, whenever the initial approximation B0 = γk I has the same property and the curvature condition sTk yk > 0 holds. In this work, we skip updating Bk if the following curvature condition is not satisfied for some small value of 2 > 0: sTk yk > 2 sk 2 . 2.2
(5)
The L-BFGS Update and Its Compact Form
For large-scale optimization problems, the limited-memory BFGS (denoted by L-BFGS) would be more efficient. In fact, for k ≥ r, the r most recent computed pairs are stored in the following matrices Yk := yk−r . . . yk−1 . (6) Sk := sk−r . . . sk−1 , Using (6), the L-BFGS matrix Bk (2) can be represented in the following compact form [34] Bk = B0 + Ψk Mk ΨkT , where B0 = γk I 0 and Ψk = B0 Sk Yk ,
Mk =
k = 1, 2, . . . , T −1 −Sk B0 Sk −Lk . −LTk Dk
(7)
(8)
In (8), matrices Lk , Uk and Dk are respectively the strictly lower triangular part, the strictly upper triangular part and the diagonal part of the following matrix splitting (9) SkT Yk = Lk + Dk + Uk .
A Stochastic Modified LBFGS-TR for Training DNNs
2.3
13
The Initialization of the L-BFGS Update
The initial matrix B0 is often set to some multiple of the identity matrix. A heuristic and conventional method to choose this multiple is γk =
T yk−1 yk−1 := γkh . T yk−1 sk−1
(10)
The quotient of (10) is an approximation to an eigenvalue of ∇2 F (wk ) and appears to be the most successful method, in practice, to generate initial Hessian approximations [34]. However, in a non-convex DL optimization, the choice of γk should be carefully operated to avoid the introduction of false negative curvature [14,35]. To this end, an extra condition can be imposed on γk to avoid pTk Bk pk < ˆ where λ ˆ 0 while pTk ∇2 (wk )pk > 0. The hyper-parameter γk is selected in (0, λ) is the smallest eigenvalue of the following generalized eigenvalue problem (Lk + Dk + LTk )u = λSkT Sk u,
(11)
ˆ ≤ 0, then γk can be set to γ h . with Lk and Dk defined in (9). If λ k
3
A Modified L-BFGS Update
A modified BFGS update, and a consequently modified L-BFGS algorithm, can be proposed by rewriting (3) as a modified secant condition Bk+1 sk = yk∗ ,
(12)
where (sk , yk∗ ) gives better curvature information than (sk , yk ) for updating Bk+1 . Therefore, in a similar fashion as described in the previous section, a modified L-BFGS update can be constructed by using yk∗ in place of yk . Let ψk = 2(Fk −Fk+1 )+(gk +gk+1 )T sk . In [41], the vector yk∗ was constructed as ψk yk∗ = yk + sk . (13) sk 2 Definition (13) together with (12) provides more accurate curvature information. In fact, it can be proved that 1 T s (Tk+1 sk )sk + O(sk 4 ), 3 k (14) 1 T T 2 4 sk (∇ F (wk+1 )sk − yk ) = sk (Tk+1 sk )sk + O(sk ), 2 is the tensor of the objective function F at wk+1 in the Taylor series sTk (∇2 F (wk+1 )sk − yk∗ ) =
where Tk+1 expansion
1 1 T Fk = Fk+1 − gk+1 sk + sTk ∇2 F (wk+1 )sk − sTk (Tk+1 sk ) sk + O(sk 4 ). (15) 2 6 In [29], a simple modification of (13) was proposed as yk∗ = yk + sign(ψk ) sψkk2 sk to handle the case ψk < 0. We show below that this modification does not provide any improvement.
´ Mart´ınez Calomardo M. Yousefi and A.
14
3.1
Sign Correction
Considering the Eq. in (14) together yields ψk =
1 T s (Tk+1 sk )sk + O(sk 4 ). 6 k
(16)
Let ψk < 0. Therefore, we have sTk yk∗ = sTk yk − ψk which leads to derive sTk ∇2 F (wk+1 )sk − sTk yk∗ = sTk ∇2 F (wk+1 )sk − sTk yk + ψk + 2ψk (17) 2 = sTk (Tk+1 sk ) sk + O(sk 4 ). 3 Equation (17) shows that the dominant error is even worse than the one in (14). Therefore, we suggest to use yk whenever ψk < 0; otherwise we can use yk∗ . 3.2
A New Modified Secant Condition
Writing the Taylor series expansion for gk and premultiplying it by sTk lead to 1 sTk gk = sTk gk+1 − sTk ∇2 F (wk+1 )sk + sTk (Tk+1 sk ) sk + O(sk 4 ). 2
(18)
Combining Eq. (15) and (18) together yields that the third order term disappears and sTk ∇2 F (wk+1 )sk = 6(Fk − Fk+1 ) + 3sTk (gk+1 + gk ) + sTk yk + O(sk 4 ) = 3ψk + sTk yk + O(sk 4 ),
(19)
which suggests the choice of yk∗ =
3ψk sk + yk . sk 2
(20)
Obviously, the new vector yk∗ in Eq. (20) provides better curvature approximation (the error is of order O(|sk |4 ) instead of O(|sk |3 )) than the one defined in equation (13).
4
The Modified L-BFGS Trust Region Method
Let L-BFGS-TR define the L-BFGS trust region method in which the current parameter vector wk at iteration k is updated by a search direction obtained using a Hessian approximations Bk for which the standard secant condition (3) holds. In a similar fashion, we describe in this section the modified L-BFGS trust region method (M-LBFGS-TR) in which the Hessian approximations Bk satisfy
A Stochastic Modified LBFGS-TR for Training DNNs
15
the modified secant condition (12). In solving (1), both trust-region methods using a Hessian approximation satisfying either the standard (3) or the modified (12) secant condition, generate a sequence of iterates (4) in which pk is obtained by solving the following trust-region subproblem: pk = arg minn Qk (p) := p∈IR
1 T p Bk p + gkT p 2
s.t.
p2 ≤ δk ,
(21)
for some trust-region radius δk > 0, where gk := ∇F (wk ) and Bk ≈ ∇2 F (wk ). The acceptance of the trial (4) is based on the ratio between the actual reduction in the objective function of (1) and the reduction predicted by the quadratic model, that is ρk =
F (wk ) − F (wt ) . Qk (0) − Qk (pk )
(22)
Since the denominator in (22) is nonnegative, if ρk is positive, the new iterate wk+1 will be computed as in (4) as wk+1 := wt ; otherwise, wk+1 := wk . The process of adjustment of the trust-region radius at each iteration is described in Algorithm 2. According to [7,8] the subproblem (21) can be efficiently solved if Bk is chosen to be a quasi-Newton matrix. Let Bk be a (modified) L-BFGS Hessian approximation in compact form (7). As described in [15,32], the global solution of (21) is characterized by the following theorem Theorem 1. Let δ be a given positive constant. A vector p∗ is a global solution of the trust region problem (21) if and only if p∗ 2 ≤ δ and there exists a unique σ ∗ ≥ 0 such that Bk + σ ∗ I is positive semi-definite with (Bk + σ ∗ I)p∗ = −gk ,
σ ∗ (δk − p∗ 2 ) = 0.
(23)
Moreover, if Bk +σ ∗ I is positive definite, then the global minimizer is unique. Following [1,7,35], the solution of the trust-region subproblem (21) can be computed as p∗ := p(σ ∗ ) = −
−1 T
1 Ψk gk . I − Ψk τk Mk−1 + ΨkT Ψk τk
(24)
where τk = γk + σ ∗ . This direct formula can be obtained by exploiting the spectral decomposition of the coefficient matrix Bk + σ ∗ I and its inversion using the Sherman-Morrison-Woodbury formula [34]. Algorithm 1 describes the process of solving the trust-region subproblem. It is based on the strategies described in the subsequent paragraphs. For further details see [1,7,35].
16
´ Mart´ınez Calomardo M. Yousefi and A.
Algorithm 1. Trust-Region Subproblem Solution. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:
Inputs: Current Ψ Ψk , M −1 Mk−1 , γ γk , δ δk and g gk Compute the thin QR factorization of Ψ with factors Q and R ˆ T Compute the spectral decomposition of matrix RM RT = U ΛU ˆ1 , . . . , λ ˆ k ) such that λ ˆ1 ≤ . . . ≤ λ ˆk Set: the reordered matrix Λˆ = diag(λ Compute the spectral decomposition of Bk as Λ1 = Λˆ + γI Let: λmin = min{λ1 , γ} Compute P = QU Compute g = PT g Compute aj = (g )j and ak+1 = g22 − g 22 if φ(0) ≥ 0 then Set: σ ∗ = 0 Compute p∗ with (26) as solution of (Bk + σ ∗ I)p = −g else Compute a root σ ∗ ∈ (0, ∞) of (28) by Newton’s method Compute p∗ with (26) as solution of (Bk + σ ∗ I)p = −g end if
Algorithm 2. Trust-Region Radius Adjustment 1: Inputs: Current iteration k, δk , ρk , 0 < τ2 < 0.5 < τ3 < 1, 0 < η2 ≤ 0.5, 0.5 < η3 < 1 < η4 2: if ρk > τ3 then 3: if pk ≤ η3 δk then 4: δk+1 = δk 5: else 6: δk+1 = η4 δk 7: end if 8: else if τ2 ≤ ρk ≤ τ3 then 9: δk+1 = δk 10: else 11: δk+1 = η2 δk 12: end if
The Spectral Decomposition of Matrix Bk + σ ∗ I. Computing the thin QR factorization of matrix Ψk , Ψk = Qk Rk , where, for k ≥ r, Qk ∈ Rn×2r and Rk ∈ R2r×2r , and the cheap spectral decomposition of the 2r × 2r matrix Rk Mk RkT ˆ1, . . . , λ ˆ 2r ) are respectively ˆ T , where Uk and Λˆ = diag(λ as Rk Mk RkT = Uk ΛU k orthogonal and diagonal matrices, leads to ˆ T QT . Bk = B0 + Qk Rk Mk RkT QTk = γk I + Qk Uk ΛU k k Now, let P Qk Uk and P⊥ (Qk Uk )⊥ where (.)⊥ denotes orthogonal complement. By Theorem 2.2.1 in [18], we have PTP = PPT = I
A Stochastic Modified LBFGS-TR for Training DNNs
17
where P P P⊥ ∈ Rn×n is an orthogonal matrix. Therefore, the spectral decomposition of Bk is obtained as Λ1 0 Λ , (25) Bk = P ΛP T , 0 Λ2 where ˆ 1 + γk , λ ˆ 2 + γk , . . . , λ ˆ 2k + γk ), Λ1 = Λˆ + γk I = diag(λ Λ2 = γk I. We assume the eigenvalues are increasingly ordered. The inversion of Bk + σ ∗ I. Let τk = γk + σ. Applying the Sherman-MorrisonWoodbury formula [34] to compute the inverse of the coefficient matrix Bk + σ ∗ I leads to −1 T
1 I − Ψk τk Mk−1 + ΨkT Ψk p(σ) = −(Bk + σI)−1 gk = − Ψk gk . (26) τk By (25) and (26), we have k p(σ) = i=1
(g )2i (λi + σ)2
+
g⊥ 2 , (γk + σ)2
(27)
where g = PT g, g⊥ 2 = g2 − g 2 . Assume pu p(0) is the solution of the first optimality condition (Bk + σI)p(σ) = −gk , for which σ = 0 makes the second optimality condition σ(δk − p(σ)2 ) = 0 holds. If pu ≤ δ, using (26) we have (σ ∗ , p∗ ) = (0, pu ) = (0, p(0)). If pu > δ, then p∗ must lie on the boundary of the trust-region to make the second optimality condition hold. To impose this, σ ∗ must be the root of the following equation: φ(σ)
1 1 − = 0, p(σ) δ
(28)
and can be determined by Newton’s method, e.g. the variant proposed in [7]. The global solution of the trust-region subproblem is then (σ ∗ , p∗ ) = (σ ∗ , p(σ ∗ )).
18
´ Mart´ınez Calomardo M. Yousefi and A.
Algorithm 3. Stochastic M-LBFGS-TR 1: Inputs: w0 ∈ Rn , number of multi-batches N¯ , number of epochs epochmax , overlap set size os, r, 1 , 2 , τ1 > 0, S0 = Y0 = [.], k = 0, epoch = 0
2: while k ≥ 0 do 3: if k = 0 then 4: Take subsets O−1 and O0 of size os for the initial multi-batch J0 O O 5: Compute F0 −1 , g0 −1 and F0O0 , g0O0 by (29) and then F0J0 , g0J0 by (30) 6: else 7: Take the second subset Ok of size os for the multi-batch Jk O O J J 8: Compute Fk k , gk k by (29), and then Fk k , gk k by (30) ¯ ) = 0 then 9: if mod(k + 1, N 10: Shuffle the data without replacement for next epoch and epoch = epoch + 1 11: end if 12: end if J 13: if gk k ≤ 1 or epoch > epochmax then 14: return 15: end if 16: if k = 0 or Sk = [.] then 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43:
Compute pk = −δk
J
gk k J
gk k
else Compute pk using Algorithm 1 end if O O Compute trial wt = wk + pk and Ft k , gt k by (29) Ok Ok O O Compute (sk , yk ) = (wt − wk , gt − gk ) and ρk = (Ft k − Fk k )/Q(pk ) O
O
O
O
k k Compute ψk = (Fk k − Ft k ) + sT k (gk + gt ) if sign(ψk ) > 0 then 3ψk yk = yk + s 2 sk k end if if ρk ≥ τ1 then wk+1 = wt else wk+1 = wk end if Update δk using Algorithm 2 2 if sT k yk > 2 sk then if k ≤ r then Store: sk and yk as new column in Sk+1 and Yk+1 else Keep: only the r recent {sj , yj }kj=k−l+1 in Sk+1 and Yk+1 end if ˆ of the problem (11) Compute the smallest eigenvalue λ ˆ > 0 then if λ ˆ ∈ (0, λ) ˆ γk+1 = max{1, 0.9λ}
else Compute γkh as
T yk−1 yk−1 T yk−1 sk−1
and set γk+1 = max{1, γkh }
44: end if −1 45: Compute Ψk+1 and Mk+1 using (8) 46: else −1 47: Set γk+1 = γk , Ψk+1 = Ψk and Mk+1 = Mk−1 48: end if 49: k =k+1 50: end while
A Stochastic Modified LBFGS-TR for Training DNNs
5
19
Stochastic M-LBFGS-TR
In the stochastic setting, the training set is divided into multiple subsets called batches. The process of selecting a single batch, computing a subsampled gradient and loss for it and then updating the parameters create one single iteration of a stochastic algorithm. This process is repeated for each batch iteratively until one epoch, that is one pass through all data samples, is completed. After each epoch, the dataset is shuffled and new batches are generated. Let Jk be a random subset of data at iteration k, whose size and index set of the samples included are denoted by |Jk | and Jkidx , respectively. In this work, samples are drawn without replacement for batches with fixed size. The subsampled loss and gradient are computed as follows J
Fk k := F Jk (wk ) =
1 |Jk |
fi (wk ),
J
gk k := ∇F Jk (wk ) =
idx i∈Jk
1 |Jk |
∇fi (wk ).
(29)
idx i∈Jk
In the stochastic L-BFGS-TR (sL-BFGS-TR) algorithm, when the batch Jk changes from one iteration to the next, the updates might be unstable since different data points are used to evaluate the gradient at the beginning (at wk ) and at the end of the iteration (at wt ), so that the gradient difference employed J to update the Hessian approximation is computed as yk = gt k+1 − gkJk . To overcome this problem, a remedy suggested in [39] consists in using the same multi-batch Jk for computing yk = gtJk − gkJk which requires double function and gradient evaluations at wk and wt . Another sampling strategy was proposed in [2] to compute yk = gtOk − gkOk where Ok = Jk ∩ Jk+1 = ∅ such that the overlap set Ok should not be insignificant. Similarly, in the stochastic M-LBFGS-TR (sM-LBFGS-TR) algorithm, when ψk > 0, the modified vector yk∗ is computed as in (20) with ψk = (FkOk − FtOk ) + (gkOk + gtOk )T sk . In this work, we take a particular variant of this approach referred as half overlap sampling where Jk = Ok−1 ∪ Ok and |Jk | = 2|Ok |. With this sampling strategy, the overall loss and gradients in (29) are computed as FkJk =
1 Ok−1 (F + FkOk ), 2 k
gkJk
=
1 Ok−1 (g + gkOk ). 2 k
(30)
This requires two function and gradient evaluations on the overlap set of the current batch. The stochastic M-LBFGS-TR training algorithm is outlined in Algorithm 3. Besides the previously indicated function and gradient evaluations, which constitute the predominant cost, the per iteration complexity of both sL-BFGSTR and sM-LBFGS-TR algorithms consists in 2rn + O(r3 ) operations needed to update Bk , and in the trust-region framework, 2(4r + 1)n + O(r2 ) flops to compute Q(p) needed for ρ evaluation and to obtain the search direction p(σ) using the direct formula described in (24). We also have the cost of computing a QR factorization and a cheap eigenvalue decomposition requiring O(nr2 ) and O(r3 ) operations, respectively.
20
´ Mart´ınez Calomardo M. Yousefi and A.
Computing the numerator in (22) using subsampled function differences as FtJk − FkJk requires double function evaluation at the beginning and at the end of the iteration. Experimentally, we examined that using overlap Ok in place of Jk provides a more affordable cost per iteration without any detriment in the attainable training accuracy. In support of this statement we have included Fig. 5. We note that computing ψk in sM-LBFGS-TR does not impose any additional cost because it uses subsampled loss and gradient values corresponding to Ok which have been already evaluated in the previous iteration.
6
Experiments
We summarize in this section the behaviour of the described quasi-Newton optimization algorithms sL-BFGS-TR [35] and sM-LBFGS-TR on the training of two deep neural networks with different architectures for image classification of the benchmark datasets MNIST and CIFAR10 (see [22,25]). We used Glorot (Xavier) approach [16] for initializing the learning parameters. The architecture of the networks, which contain batch normalization layers, is described below. – LeNet-5. A well known convolutional neural network designed for handwritten and machine-printed character recognition [26]. By solving an optimization problem for w ∈ R431,080 , LeNet-5 with the following architecture is trained with the MNIST dataset: • Input layer with a 28 × 28 × 1 image. • Convolutional layer with 20 filters of 5×5 size, stride 1 followed by ReLU. • Max pooling layer with a 2 × 2 and stride 2. • Convolutional layer with 50 filters of 5×5 size, stride 1 followed by ReLU. • Max pooling layer with a 2 × 2 and stride 2. • Fully connected layer with 500 neurons followed by ReLU. • Fully connected layer with 10 neurons followed by softmax. – ConvNet3FC2. Motivated by [36], we define a CNN with 3 intermediate convolutional networks (ConvNet) and 2 fully connected networks (FC). This network with the structure defined below, is trained with CIFAR10 by solving an optimization problem for w ∈ R3,525,162 : • Input layer with a 32 × 32 × 3 image with Z-score normalization1 • Convolutional layer with 32 filters of 5 × 5 size, stride 1 and padding 2. • Batch normalization layer followed by ReLU. • Max pooling layer with a 2 × 2 window and stride 1. • Convolutional layer with 32 filters of 5 × 5 size, stride 1 and padding 2. • Batch normalization layer followed by ReLU. • Max pooling layer with a 2 × 2 window and stride 1. • Convolutional layer with 64 filters of 5 × 5 size, stride 1 and padding 2. • Batch normalization layer followed by ReLU. 1
Z-score normalization produces a dataset whose mean and standard deviation is zero and one, respectively.
A Stochastic Modified LBFGS-TR for Training DNNs
• • • •
21
Max pooling layer with a 2 × 2 window and stride 1. Fully connected layer with 64 neurons. Batch normalization layer followed by ReLU. Fully connected layer with 10 neurons followed by softmax.
All experiments were run on an Ubuntu Linux server virtual machine with 32 CPUs and 128GB RAM using MATLAB and its deep learning toolbox. We provide a comparison with the most popular first-order method Adam implemented using the MATLAB built-in function adamupdate by a grid search tuning effort on learning rates and batch sizes. The best learning rate for all batch sizes was found to be 10−3 . The limited memory parameter for both quasi-Newton methods was set to r = 20. We obtained comparable results using different values of r ∈ {5, 10, 15, 20} but we did not include these results here due to space limitation issues. Other hyperparameters for L-BFGS-TR and M-LBFGS-TR algorithms are 1 = 10−5 , 2 = 10−2 , γ0 = 1, τ1 = 10−6 , τ2 = 0.1, τ3 = 0.75, η2 = 0.5, η3 = 0.8, η4 = 2. We have investigated the effect of the batch size on the performance of the different training algorithms. The networks were trained for a maximum number of epochs. The program stops before that limit if 100% accuracy has been reached. Figure 1 and 2 show the evolution of loss and accuracy for different batch sizes |Jk | ∈ {100, 500, 2000, 5000} in the classification of MNIST and CIFAR10, respectively. The results corresponding to the smallest batch size for the MNIST dataset are reported within the first epoch only to facilitate the comparison. All the loss and accuracy evolution curves have been filtered by a fixed display frequency. This frequency, when indicated, corresponds to how many iterations per epoch have not been displayed. We observe from Fig. 1 and 2 that, for both problems, both sL-BFGS-TR and sM-LBFGS-TR perform better than tuned Adam independently of the batch size. In all the experiments, sM-LBFGS-TR exhibits comparable performance with respect to sL-BFGS-TR. Neither sL-BFGS-TR nor sM-LBFGS-TR are strongly influenced by batch size. Large multi-batch sizes can be employed without a considerable loss of accuracy even though the performance of both methods decreases when larger batch sizes are used, due to the smaller number of iterations per epoch (smaller number of parameters updates). Adam performs very well in both problems providing comparable accuracies to the ones yielded by second-order methods even if it is less accurate when large batch sizes are used. Figure 4 displays the variability of the obtained test accuracy computed over five runs with random seeds. It can be seen that the results are reliable and that first-order methods exhibit larger variability than the two quasi-Newton algorithms. According to the complexity analysis performed in the former section, we found that the training time of both second-order methods is larger than that of the first-order one (see Table 1). Nevertheless, we underline the fact that, as Fig. 3 illustrates, with a fixed computational time budget the proposed quasi-Newton methods provide comparable or better testing accuracy than the first-order Adam optimizer.
22
´ Mart´ınez Calomardo M. Yousefi and A.
Fig. 1. MNIST: Evolution of the training and testing loss and accuracy using stochastic quasi-Newton based methods and tuned Adam for different batch sizes.
A Stochastic Modified LBFGS-TR for Training DNNs
23
Fig. 2. CIFAR10: Evolution of the Training and Testing Loss and Accuracy using stochastic quasi-Newton based methods and tuned Adam for different batch Sizes.
24
´ Mart´ınez Calomardo M. Yousefi and A.
Fig. 3. Testing accuracy of stochastic quasi-Newton based methods and tuned Adam versus training CPU time (in Seconds).
Table 1. Training time of the methods for kmax Iterations. CIFAR10 (kmax = 100) MNIST (kmax = 200) bs = 500 bs = 5000
bs = 500 bs = 5000
Adam
00:18:30 00:41:24
00:03:30 00:07:46
sL-BFGS-TR
00:32:04 00:55:04
00:09:35 00:13:45
sM-LBFGS-TR 00:32:06 00:54:46
00:09:42 00:13:40
Fig. 4. Error bars of stochastic quasi-Newton based methods and tuned Adam: variability of the test accuracy in the format “mean ± standard deviation” computed over five runs with random seeds.
A Stochastic Modified LBFGS-TR for Training DNNs
25
Fig. 5. CIFAR10: Evolution of the training and testing loss and accuracy using quasiNewton methods with different sample sets (Ok or Jk ) to compute subsampled loss function differences needed to compute ρk in (22).
7
Conclusions
In this work, we have considered stochastic limited memory BFGS quasi-Newton methods to solve nonlinear and non-convex optimization problems arising in the training of deep neural networks. Our implementation incorporates an accurate selection of the initial Hessian approximation and stable quasi-Newton updates are obtained by a sampling strategy with overlap. We have provided a comparison of the standard L-BFGS method with a variant of this algorithm based on a modified secant condition which is theoretically shown to provide an increased order of accuracy in the approximation of the curvature of the Hessian. In our experiments, on image classification problems with MNIST and CIFAR10 datasets, both sL-BFGS-TR and sM-LBFGS-TR exhibit comparable performances. Moreover, the results included in this paper illustrate that these methods converge faster than tuned Adam and perform better for larger batch sizes which are favorable for parallel computing. Restricted to the experiments with the largest considered batch size, the results show that with a fixed computational time budget the proposed quasi-Newton methods provide comparable
26
´ Mart´ınez Calomardo M. Yousefi and A.
or better testing accuracy than the first-order Adam optimizer. Nevertheless, despite their better convergence properties and the advantage of not requiring time-consuming tuning effort needed instead for Adam, the iteration complexity is high, since two loss and gradient evaluations are required at each iteration. Future research will be devoted to devising sampling strategies that reduce the number of loss and gradient evaluations per iteration. Future work will consists also in comparing the efficiency of the proposed stochastic L-BFGS optimizers with the recently introduced Kronecker-factored block diagonal L-BFGS described in [17] for feed-forward networks. Finally, another interesting future line of research we are currently undergoing is the analysis of whether symmetric rank one (SR1) updates, allowing for indefinite Hessian approximations, could potentially outperform L-BFGS in the task of high dimensional optimization in deep neural network training.
References 1. Adhikari, L., DeGuchy, O., Erway, J.B., Lockhart, S., Marcia, R.F.: Limitedmemory trust-region methods for sparse relaxation. In: Wavelets and Sparsity XVII, vol. 10394, p. 103940J. International Society for Optics and Photonics (2017) 2. Berahas, A.L., Nocedal, J., Tak´ aˇc, M.: A multi-batch L-BFGS method for machine learning. In: Advances in Neural Information Processing Systems, pp. 1055–1063 (2016) 3. Berahas, A.S., Tak´ aˇc, M.: A robust multi-batch L-BFGS method for machine learning. Optim. Methods Softw. 35(1), 191–219 (2020) 4. Bollapragada, R., Byrd, R.H., Nocedal, J.: Exact and inexact subsampled Newton methods for optimization. IMA J. Numer. Anal. 39(2), 545–578 (2019) 5. Bollapragada, R., Nocedal, J., Mudigere, D., Shi, H.-J., Peter Tang, P.T.: A progressive batching L-BFGS method for machine learning. In: International Conference on Machine Learning, PMLR, pp. 620–629 (2018) 6. Bottou, L., LeCun, Y.: Large scale online learning. Adv. Neural. Inf. Process. Syst. 16, 217–224 (2004) 7. Brust, J., Erway, J.B., Marcia, R.F.: On solving L-SR1 trust-region subproblems. Comput. Optim. Appl. 66(2), 245–266 (2016) 8. Burdakov, O., Gong, L., Zikrin, S., Yuan, Y.: On efficiently combining limitedmemory and trust-region techniques. Math. Program. Comput. 9(1), 101–134 (2016) 9. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016) 10. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust region methods. SIAM (2000) 11. Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems, vol. 4, pp. 2933–2941 (2014) 12. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in neural information processing systems, pp. 1646–1654 (2014) 13. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7), 2121–2159 (2011)
A Stochastic Modified LBFGS-TR for Training DNNs
27
14. Erway, J.B., Griffin, J., Marcia, R.F., Omheni, R.: Trust-region algorithms for training responses: machine learning methods using indefinite hessian approximations. Optim. Methods Softw. 35(3), 460–487 (2020) 15. Gay, D.M.: Computing optimal locally constrained steps. SIAM J. Sci. Statis. Comput. 2(2), 186–197 (1981) 16. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference On Artificial Intelligence And Statistics, JMLR Workshop and Conference Proceedings, pp. 249–256 (2010) 17. Goldfarb, D., Ren, Y., Bahamou, A.: Practical quasi-Newton methods for training deep neural networks (2020). arXiv preprint, arXiv:2006.08877 18. Golub, G.H., Van Loan, C.F.: Matrix computations, 4th edn. Johns Hopkins University Press (2013) 19. Gower, R., Goldfarb, D., Richt´ arik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869– 1878 (2016) 20. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural. Inf. Process. Syst. 26, 315–323 (2013) 21. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2015) 22. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009). https://www.cs.toronto.edu/∼kriz/cifar.html 23. Kungurtsevm V., Pevny, T.: Algorithms for solving optimization problems arising from deep neural net models: smooth problems (2018). arXiv preprint, arXiv:1807.00172 24. Kylasa, S., Roosta, F., Mahoney, M.W., Grama, A.: GPU accelerated sub-sampled Newton’s method for convex classification problems. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 702–710. SIAM (2019) 25. LeCun, Y.: The MNIST database of handwritten digits (1998). http://yann.lecun. com/exdb/mnist/ 26. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 27. Lucchi, A., McWilliams, B., Hofmann, T.: A variance reduced stochastic Newton method (2015). arXiv preprint, arXiv:1503.08316 28. Martens, J., Sutskever, I.: Training deep and recurrent networks with hessian-free optimization. In: Montavon, G., Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8 27 29. Modarres, F., Malik, A.H., Leong, W.J.: Improved hessian approximation with modified secant equations for symmetric rank-one method. J. Comput. Appli. Math. 235(8), 2423–2431 (2011) 30. Mokhtari, A., Ribeiro, A.: Res: regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 62(23), 6089–6104 (2014) 31. Mokhtari, A., Ribeiro, A.: Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 16(1), 3151–3181 (2015) 32. Mor´e, J.J., Sorensen, D.C.: Computing a trust region step. SIAM J. Sci. Stat. Comput. 4(3), 553–572 (1983) 33. Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258 (2016)
28
´ Mart´ınez Calomardo M. Yousefi and A.
34. Nocedal, J., Wright, S.: Numerical optimization. Springer Science & Business Media (2006). https://doi.org/10.1007/978-0-387-40065-5 35. Rafati, J., Marcia, R.F.: Improving L-BFGS initialization for trust-region methods in deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications, ICMLA, pp. 501–508. IEEE (2018) 36. Ramamurthy, V., Duffy, N.: L-SR1: a second order optimization method for deep learning (2016) 37. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951) 38. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017) 39. Schraudolph, N.N., Yu, J., G¨ unter, S.: A stochastic quasi-Newton method for online convex optimization. In: Artificial intelligence and statistics, PMLR, pp. 436–443 (2007) 40. Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017) 41. Wei, Z., Li, G., Qi, L.: New quasi-Newton methods for unconstrained optimization problems. Appl. Math. Comput. 175(2), 1156–1188 (2006) 42. Xu, P., Roosta, F., Mahoney, M.W.: Second-order optimization for non-convex machine learning: an empirical study. In: Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 199–207. SIAM (2020) 43. Ziyin, L., Li, B., Ueda, M.: SGD may never escape saddle points (2021). arXiv preprint, arXiv:2107.11774v1
Enhanced Deep Learning Framework for Fine-Grained Segmentation of Fashion and Apparel Usman Ahmad Usmani1 , Ari Happonen2(B) , and Junzo Watada3 1 Universiti Teknologi Petronas, UTP, Seri Iskandar, 32610 Perak, Malaysia
[email protected]
2 LUT University, Yliopistonkatu 34, 53850 Lappeenranta, Finland
[email protected] 3 1 Chome-104 Totsukamachi, Shinjuku City, Tokyo 169-8050, Japan
Abstract. 3D clothing data models have been learned from the real clothing data, but it is difficult to predict the exact segmentation mask of a garment as it varies depending on the size. The accurate segmentation of clothes has become a problem over the last few years due to automatic product detection for enhancing the shopping experience for consumers. The ability to recognize the associated attributes and clothing products will increase the shopping experience for consumers. In the fashion domain, the recent five years literature in computer vision focused on seeking solutions for the recognition of clothes. Still, there has been a gap in the efforts by the fashion designers and computer vision communities. This work proposes a deep learning framework that can learn how to detect and segment clothing objects accurately. We propose a clothing segmentation framework having novel feature extraction and fusion modules. The low-level feature data are extracted by the feature extraction module using Mask Region Convolutional Neural Network (RCNN) segmentation branches and Inception V3 used to extract the high-level semantic data. In contrast, the feature fusion module fuses the two types of image feature data with a new reference vector for each image. Consequently, the feature now includes both high-level and low-level image semantic feature information, boosting our overall segmentation framework’s performance. We use the Polyvore and DeepFashion2 databases for testing our algorithm, since these are the standard datasets used by the current methods for running the simulations. When compared to the current state-of-the-art segmentation methods, our results perform better with an accuracy of 17.3% and AUC of 4%. Keywords: Deep learning · Neural network · Learning framework · Segmentation methodology · Data fusion · Fashion · Feature mapping · Digitalization · Clothing Industry · Industry 4.0
1 Introduction Clothing is significant in a model society because the right outfit can improve a person’s look and attitude [1]. The rise of clothes e-commerce platforms and the emerging age of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 29–44, 2022. https://doi.org/10.1007/978-3-031-10464-0_3
30
U. A. Usmani et al.
big data [2–4] have resulted in massive volumes of clothing data, which have accumulated over the last few years on the Internet or social media. As a result, fulfilling user expectations and growing user buy desire is vital for improving the garments electric business platform’s user experience and sales. In computer vision, deep learning algorithms have been used for object detection, image retrieval, image classification, ReID1–3, and image fusion have been more widely used [2, 3]. The primary areas of current deep learning research on clothes include clothing classification, clothing attribute prediction, clothing segmentation, clothing retrieval, and virtual trying of clothes. Because of their excellent feature extraction capabilities, deep learning networks apparel suggestion has become a prominent subject, potentially adopting e-commerce platforms. Figure 1 shows some clothing images from the Deep Fashion datasets with different clothing segments. Clothing designs must, in general, take into mind the products’ compatibility. Viet et al. [4, 5] employed Siamese networks to translate textile objects from image to style space. Mcauley et al. [6] used a low-rank transformation into the style space for mapping textile products. Several studies advocated mapping things into discrete style spaces to represent object compatibility [7, 8]. These methods, on the other hand take extensive computational time, and are incapable of handling vast amounts of data. Li et al. [9] recently established a strategy for an outfit as a sequence and compatibility using a recurrent neural network to overcome these issues (RNN) [10]. Han et al. [11] developed an MM-Fashion approach which stores the clothing outfit as the bidirectional sequence for a specific order (from accessories to top to bottom, or in reverse order) and anticipating the next object in the outfit using the bidirectional long short-term memory (Bi-LSTMs). MM-Fashion build the semantic feature vectors by using the high-level semantic tag information for understanding the link compatibility of the visual-semantic embedding. This strategy, however, has several drawbacks. To begin, a group of people from different companies classify photo tags that have been obtained directly from the Internet. The guarantee of tag information consistency cannot be taken as granted fact; second, when converting the label information into a feature vector, a word mapping matrix must be trained. Finally, because there is no clear convergence target for training, it is difficult to validate the word mapping matrix used to generate the semantic feature vector. As a result, this strategy will not function with images that do not include any relevant text [12, 13]. This research proposes advanced image semantic information by using feature maps in the classification branch of Mask RCNN network’s to extract the images high-level semantic information by eliminating the concerns stated before. Before the target object is entirely based on its feature map, Mask RCNN maps the relevant feature layer of the Feature Pyramid Network (FPN) to the new target frame. The created feature map is coupled to a SoftMax function and fully connected layer for choosing the target object’s category. Consequently, the feature map is used to offer clothing segmentation high-level semantic information, and in certain circumstances, it can function as semantic labels. This high-level semantic information unlike semantic tag information, does not need further annotations, resulting in more excellent stability, which is important for boosting the learning modules efficiency. The clothing segmentation and retrieval have enormous
Enhanced Deep Learning Framework
31
potential for internet shopping due to its sales growth and popularity. Several computer vision results have succeeded and enhanced the sale of clothes in online shopping [15, 16]. The pixel-wise of clothing products inside images enables speedy and accurate recognition but is laborious and costly. On the other hand, it is possible that the online user data can be used to create clothing tags with high-quality images. The clothes put on display have several distinct differences from those represented in traditional images. It’s difficult to segment the clothing images due to the irregular regions and complex shapes. Within the fashion industry, researchers have made great strides in data collection across fashion databases. However, interpreting fashion images is challenging because clothing images are distorted in different commercial and consumer images. This paper aims to segment the image into clothing regions and then extract the useful feature of the image. The significant contribution of this paper is the creation of a model that can be used to interpret the details in a large batch of clothing images. The discrepancies in the poses and self-occlusions of various people pose a big problem in the segmentation of clothing images. The number of clothing categories is large, with many additional classifying mechanisms in the fashion industry. To overcome the difficulties mentioned above, consecutive stages of inference over the collection of clothing images are required. These are namely the segmentation of clothing image to extract the distinct clothing sections and of the clothing, image field to identify the various clothing items [17]. Recent work on object/scene context has influenced the clothing configuration contexts which are based on the location specifications and the mutual connections between them. Some algorithms used a support vector machine to refine various regions observed across all images for image segmentation. They extracted from each image and classified the regions with varying dress appearance and human-specific features. Only those coherent regions fulfill all of the criteria (e.g., size and location). They train a series of top-down models for regions of interest, then classify regions with similar structures across all images. Results are presented by creating different clusters that are more coherent than the original data. According to specific research, clothing of equal quality has similar patterns (i.e., shapes and structures). Hasan et al. [18] recommend that HOG-based be used. The vast number of categories and considerable variability make automated detection via supervised learning problems due to the random categorization of images. They use single-molecule sequencing to produce data one step at a time. They construct a coherent model by combining sections within the same vision and correlating images with comparable content. Similar areas in multiple photos may be extracted for statistical analysis, and grades can be assigned simultaneously. In our experiments, we proved the effectiveness of our method for several other research purposes. We also provide a landmark-driven attention method that use the projected landmark heatmap to improve the attribute prediction and category classification accuracy. Using both local and global landmark positions, our network can concentrate on the most important areas of the garments for categorization and attribute prediction. This is done by combining convolutional features and mark locations to produce a new
32
U. A. Usmani et al.
attention map. This kind of attention mechanism highlights the most crucial information for fashion analysis while filtering out irrelevant elements, improving category and attribute prediction accuracy. Fashion Image Analysis can be used in the industry, but unfortunately, it is not being used. There are issues related to the bench-mark variation from the real-world situation. For example, DeepFashion [19], which is the most prominent fashion, has drawbacks that include a single object of clothing per image. It has sparse landmark and poses description and no per-pixel mask annotations. Every day, thousands of hours are spent looking at different clothing images. Figure 2 shows the plotting of various segments in a cloth. Online clothing shopping spreads, the use of predictions would increase the company’s sales. Due to the low percentage of clothing images having metadata, the recognition algorithms are an efficient alternative to expensive manual annotations. A crossscenario retrieval problem is established when an issue is a real-world object, whereas similar objects are usually presented in a clean and isolated environment. To sustain the database size of millions of pieces of information, we will employ large-scale image retrieval methods needed to process millions of objects. While considering the variety of data and similarity between the queries and images, we find that the image retrieval techniques are unsuccessful. Due to the low resolution of predicted heatmaps after several pooling processes, the majority of the methods fall short of further improving the fashion analysis accuracy. Sharp-cornered or cloth edges is often used as a fashion marker, which has an influence on the prediction accuracy. The brand images collected from online shopping are similar to those on the right, i.e., images are worn by models or merely visible on a white backdrop. A rating mark is usually included with the product images. We employ imaging methods to find which things belong to which class after classifying images into meaningful groupings based on how similar they seem. Our significant contribution are as follows: • We propose an enhanced Mask R-CNN based deep learning framework for generating segmentation masks of clothing objects based on the continuous clothing data. • We include unique feature extraction and fusion modules in this apparel segmentation system. The feature extraction module extracts low-level feature data with Inception V3 and high-level semantic data are extracted with the Mask Region Neural Network (RCNN) segmentation branches. • In contrast, the feature fusion module fuses the two types of image feature data with a new reference vector for each image. We achieve speedy, accurate classification without needing an actual learning point using a binary spatial representation. The system is quick and adaptable, allowing it to handle large-scale product categorization with ease. • The statistical results infer that our method performs better than the current state of the art methods. The remainder of the paper is structured as follows. In Sect. 2, we give the related work, and then in Sect. 3, we give our deep learning method for the automated segmentation of the clothing regions. Finally, in Sect. 4, we conclude our paper.
Enhanced Deep Learning Framework
33
Fig. 1. Plotting some images in the datasets with given segments.
2 Related Work In the current literature on clothing segmentation, the main emphasis of the methods has been on developing expressive models for several styles of clothing and appearance. Some methods [4, 15, 20, 37] described a graph to depict the clothing configurations of the garments. The latest progress on the segmentation methods is on the occluded group images [20], and enhanced segmentation methods are presented for objects [21]. Many efforts have been made to establish a biological marker for the brain and estimate areas in the brain. These studies have not yet been extended to the topic of garment segmentation, despite their technical findings and analyses. They employ regression techniques for landmark localization, in which convolutional features are fed directly into a fully connected layer to match landmark coordinate coordinates. However, this kind of regression is complicated and non-linear, making the parameters difficult to comprehend. Clothing is often associated with object co-clustering, which is the storage of images, including related items. In [22], the representation of a definition was achieved using the unsupervised learning of a single model, whereas Al [23] integrated automated image segmentation with an unsupervised multi-class image. On the other hand, these image retrieval approaches are computationally costly and so are not suitable for all image retrieval jobs. Many recent works have solved the supervised image segmentation problem by training a classifier and then propagating the labels to the test images. Pioneering work by Liu et al. [24] showed how to encode the entire image with a sparse collection of dots. Nevertheless, these strategies are also cost-ineffective and time-consuming. The researchers can find quantitative correspondences between and different clothing styles. A lot of research has been done on the various factors influencing people’s clothing choices. A lot of research concentrates on detecting irregular objects in clothing as opposed to the location of placement. Lin et al. [25] joined in a collaboration to study fashion category detection and recognition. They focus on the torso area, segmented using graphs that the clothing model assumes to the same person. All wear a bar-coded jacket that can
34
U. A. Usmani et al.
be checked. For the complete body and the training images, the work computes a global clothing probability map. Chen and his colleagues [26] investigated how individuals learn the meanings of qualities. Modeling clothes involves using a conditional random field, images created by an individual, and a set of images to anticipate a person’s or event’s wearing style. Simonyan and colleagues [27] say that clothes indicate an individual’s social class. They used images to reflect both the smaller and complex patterns. The two types of instruments used to examine the torso region differ on the system and application.
Fig. 2. Plotting segments: ClassId: “Shoe”, “Pants”, “top, t-shirt, sweatshirt”, “pocket”, “sleeve”, “neckline”.
Wang et al. [28] performed a study on classifying images to depict multiple individuals and developed a novel technique for segmentation multiple individuals in clothing. Bo et al. The author in [29] focuses on how color and pattern affect. As in the works listed earlier, the anatomy of the upper/torso portion of the human body is identified. Wang et al. [30] suggested a method for retrieving garments across scenarios. They employ an intermediate annotated auxiliary set to generate a piece in the query picture to extract cross-item similarities. Then they learn the similarity transfer matrix from the additional set to the shopping item images. Their answer is straightforward, but it requires a precise computation of the number of shared characteristics between two articles of apparel.
Enhanced Deep Learning Framework
35
There has been a lot of study on the issue of garment recognition in recent years. A recent study [31] looked at general outfit identification. Several papers [32–36] focus on different types of clothes and their uses. Human identification and posture evaluation were employed in this study, as in the great majority of prior investigations. In [5], Gallagher and Chen [37] work together to overcome the problems of apparel segmentation and identification. Graph cuts, based on a clothing model from one or more images of the same person dressed identically, segment the torso. An automatically extracted clothing mask covers the torso area of each person’s outfit model. Using all of the training photographs and clothing classes in our scenario, we create a global clothing prior probability map for the whole body in this research. Leibe et al. [38] concentrate on attribute learning or the acquisition of semantic clothing data. They demonstrate a one-of-a-kind method that uses a conditional random field. It overlays on the top of individual attribute classification predictions to forecast a person’s or an event’s dress style. According to Winn et al. [39], human professions may be distinguished by their clothing. They relate visual patches to semantic-level patterns like clothes and hairstyle trends using sparse coding approaches. Because [2] and [14], like [5] are thoracic, they are unable to understand generic clothes. Wang et al. [40] are interested in multi-person pictures and provide a unique multi-person clothing segmentation approach for highly occluded images. The distinctions between monochromatic and printed/stitched textiles are investigated by Yang et al. [41]. Both sculptures, like their predecessors, focus on the human torso and upper body. Nowozin et al. [42] were the first to suggest the concept of crossscenario clothing retrieval. They design a similarity transfer matrix from the auxiliary set to the internet shopping set to find cross-scenario similarities using an intermediate annotated auxiliary. It is set to generate sparse reconstructions of the matched query picture human parts. Their method is quick, but it only works for matching upper- and lower-body apparel, and similarity is measured by how many features two clothing items have in common. Wang et al. [43] presented the magic closet strategy for wardrobe suggestions in recent research. They use garment features as latent variables in a latent Support Vector Machine-based recommendation model to provide occasion-oriented clothing recommendations. Both methods miss clothes, requiring the use of more precise garment segmentation. Szegedy et al. [44] just released a work that closely resembles ours. The authors start with, and articulated position estimates to anticipate and apparel classes in a real-world picture. They discuss the Fashionista, which we used in our research, and how garment estimations might aid posture grading. The definition of a fashionista is someone who is focused on style, clothing and fashion shopping. Celebrity stylist is an example of a fashionista. Because it covers the more significant issue of clothing categorization, this is the approach we’ll look at in the experiments section. As indicated in [16], a proper posture assessment is a fantastic place to start for our method. One of the most common reasons for failure scenarios is incorrect posture estimates.
3 Method At the present, the Mask R-CNN instance segmentation framework is the most popular. The Mask R-CNN mask head features a fully convolutional structure, with four
36
U. A. Usmani et al.
convolutional layers and two deconvolutional layers, but no pooling layers, to maximize efficiency. The feature maps with a fully convolutional structure disregard different scales of semantic input. As a result, the visual detail prediction is useless. Higher-level neurons are required for global information, whereas lower-level neurons are more vulnerable to local texture and pattern activation, according to [45]. As a result, the FPN structure is added to the mask head, allowing for the transmission of higher-level information to speed up segmentation. By combining lower-level data with higher-level semantic information, our strategy improves the segmentation performance of the whole feature hierarchy. The feature pyramid structure is made up of a propagation channel from the bottom layer to the top layer. It also includes the lateral connection that performs add operations between feature maps of the same resolution. 3.1 Network Structure This work proposes a clothing segmentation network with three modules: clothing images high-level semantic information, a feature fusion module that merges feature vectors into one dimension to express image information fully, and extraction of low-level feature information from feature extraction module. It also includes a style compatibility learning module. Figure 3 shows our Proposed Clothing Segmentation Framework. We include unique feature extraction and fusion modules in this clothing segmentation system. The feature extraction module extracts low-level feature data with Inception V3 and high-level semantic data are extracted with the Mask Region Neural Network (RCNN) segmentation branches. 3.2 Feature Extraction Module: Inception V3 and Mask RCNN The low-Level feature Information is extracted using the Inception V3 model. It uses the same method as MM-Fashion and is applied for extracting low-level feature information from the ImageNet. The Image-Net collection comprises roughly 100 million images for image categorization. Low-level feature information, such as texture and edge information, can be extracted from images using ImageNet-trained classification models. This data includes both the image’s details the low-level feature information required to discriminate between different kinds of clothing. At a high Level, the RCNN mask identifies the target’s clothing category. Even though category and style data complement each other, the feature map includes the least category data. Consequently, the feature map can be used to offer high-level semantic information for clothing segmentation, and in certain circumstances, it could even act as semantic labels. This high-level semantic information requires no additional annotations, less interference, has higher stability than the different label information by other people and clothing attributes. It is conducive to training subsequent style compatibility learning modules because it extracts picture features using the same set of parameters. Modules have a distinct personality. To combine image and word information, our model provides low-level feature information by using the Inception V3 network. It uses a technique for translating text and image information to an embedding space. The main problem with
Enhanced Deep Learning Framework
37
MM-technique Fashion is a considerable semantic gap between image feature information and text information, making development and improvement difficult. In reality, rather than the fused information, the image’s feature vector is still heavily dependent on the features provided by using the Inception V3 model. To address this problem, we integrate high-level semantic information from Mask RCNN with low-level feature information. It uses using semantic information from Mask RCNN by creating a feature fusion module that, eliminate the need of labeling image description data for humans. We provide a reference vector into the framework as the image’s feature vector to decrease the semantic gap between the two features and simplify optimization. The image’s low-level information high-level are combined in this vector.
Fig. 3. Our proposed clothing segmentation framework. We include unique feature extraction and fusion modules in this clothing segmentation system. The feature extraction module extracts low-level feature data with inception V3 and high-level semantic data are extracted with the mask region neural network (RCNN) segmentation branches.
This goal led to the development of a new loss function that improves the similarity of the three feature vectors and the discrimination of the reference vector across images. High-level semantic information and low-level feature information are supplied when the clothing object is entered into Inception v3 and Mask RCNN, respectively. Then we map Vli and V h i into low-level feature vectors f l i and high-level semantic vectors fhi of the same dimension in the same manner that work W h was done. The average of the image reference vectors fli and fhi is also fei . We perform the calculation of the distances between the reference vector and fli or fhi . It also includes the other images in the reference vector fei clothing object image set C, M j , j C, j Ç i in the reference vector
38
U. A. Usmani et al.
Fig. 4. Masks of images of different part of clothing and apparel.
Fig. 5. Some more masks of images in the clothing dataset of different part of clothing and apparel.
Fig. 6. Fine grade segmentation mask of images.
Enhanced Deep Learning Framework
39
fei clothing C image set object. We define the L pull in Eq. (1) to make dli and dhi loss functions, as well as the L push in Eq. (2) as follows: 1 [fl − fei ]2 + [fhi − fei ]2 i i N N N 1 = min(0, − fei − fej | i=1 j=1,j=i N (N − 1) Lpull =
Lpush
(1) (2)
Because the low-level feature vector, high-level semantic vector and reference vector are all elements of the same image, L pull optimization reduces the distance between them. Due to the fact that the feature vectors are taken from a range of images, optimizing L push will result in a larger gap between their reference vectors. Furthermore, since more information is supplied, the reference vector has a more substantial capacity for images. We utilize the weight matrices W l and W h supplied by Han et al. [11].
4 Experiments 4.1 Datasets Polyvore is a well-known fashion website with a large selection of beautiful images and descriptions of clothing parts. The number of clothing objects contained in the suit and style tags is provided in these outfits. Researchers have the site’s data on various fashion vocations, and images from the website served as the study’s. There are 21,889 sets, with 17,316 serving as training sets, 1497 serving as validation sets, and 3076 serving as test sets. Specifications for the mask from the RCNN Classification Network for Implementation Branch: By using a Mask-RCNN model, we extract high-level features from the DeepFashion [46] distinct clothing categories. One issue is that it clashes with the model, just the target box of trousers and shirts but not shoes or accessories. We employ the Mask RCNN network to solve this issue, which takes the whole image area as the target box of shoes and accessories. It outputs the complete feature map as highlevel semantic information. The FPN P5 is the output feature layer, and ROI Align is used to scale it to a spatial scale of 7 × 7. The final output size is 7 × 7 × 256 bytes. As the image’s low-level feature information, a 2048-dimensional feature vector recovered using Inception V3 is employed, and this feature vector is translated to 512dimensional space using. By flattening the feature map from 7 × 7 × 256 to 13666, the Mask RCNN classification branch allows it to be mapped to the same 712-dimensional space as W h . With 712 hidden layer units in backward and forward, and a dropout probability of 0.4, the other parameters are similar to MM-Fashion for training configuration. The optimizer’s initial learning rate is set at 0.3, and after every two iterations, it decays to half (epoch). Because memory capacity is limited, we set the batch size to 12, resulting in each batch comprising eight pairs of matching images for a total of 56 images. Usually, the bigger is the batch size, the training will be faster, but it will take more memory. The exact size of the mini-batch used is generally determined through trial and error. Several tests with numbers ranging from a few tens to a few thousand should be run on a sample of the dataset, with the quickest one chosen. In the literature, batch sizes in those ranges seem to be relatively common. ImageNet is used to the
40
U. A. Usmani et al.
parameters of Inception V3. In this part, we use the compatibility prediction Area Under Curve and performance of fill-in-the-blank (FITB) to compare the proposed work to existing state-of-the-art approaches (AUC) [54]. Figure 4 shows the masks of images of different part of clothing and apparel. Figure 5 shows some more masks of images in the clothing dataset of different part of clothing and apparel. Figure 6 shows the fine grade segmentation mask of images. Table 1 shows the performance comparison between the proposed method with other state-of-the-art methods on the compatibility prediction AUC and FITB. Four sets of comparison tests are required in the following application scenarios to assess the effect of the feature extraction and feature fusion modules on trial outcomes. One includes the MM-Fashion feature extraction + feature fusion module (v1). The other includes the MM-Fashion feature extraction + feature fusion module (v2), feature extraction + feature fusion module (v3). The suggested processes v1, v2, and v3 are used in the following three instances. We compare our method with the following techniques. The other methods are Random: which assign the garment compatibility ratings at random. SetRNN:16 gives a series of fashion images, the RNN model is predicted by the compatibility score. LMT:13 uses a projection matrix to represent the compatibility outcomes between style space objects. SiameseCNNs:12 is a Siamese network-based pairwise compatibility learning. Bidirectional LSTMs are used by MM-Fashion:18 and outfits suitability ratings to predict the following object. Table 1. The performance comparison between the proposed method with other state-of-the-art methods on the compatibility prediction Auc and Fitb. Method
Acc
AUC
Randoms [47]
24.1%
0.510
SetRNN [9]
29.5%
0.530
LMT [48]
49.6%
0.677
Siamese CNN’s [49]
48.1%
0.709
MMFashion [50]
68.6%
0.861
MMFashion feature fusion [51]
69.3%
0.882
MM-Fashion feature extraction module [52]
69.5%
0.874
Feature extractionmodule + feature fusionmodule [53]
70.1%
0.902
Our algorithm
87.3%
0.940
5 Conclusions This paper proposes a novel process for image segmentation that produces a productlevel clothing segmentation for objects. The research of utilizing AI for the benefit of fashion industries has got new novelties lately, like studies on optimizing the design of clothes for sustainability and using IoT, Industry 4.0 technologies and AI together to help fashion garments design decision [55–59]. Still, there is a gap in the efforts by the fashion
Enhanced Deep Learning Framework
41
designers and the computer vision communities to come together and improve the way this sector operates. Our study proposes a deep learning framework that can learn how to detect and segment clothing objects accurately. Our network can concentrate on the most important areas of the garments for categorization and attribute prediction by combining convolutional features and mark locations to produce a new attention map. This kind of attention mechanism highlights the most crucial information for fashion analysis while filtering out irrelevant elements, improving category and attribute prediction accuracy. We propose a method to segment images and the application of field marks. Clothing segmentation is an efficient technique and one that can be applied in retrieving clothing based on parsing data. We hope to use our algorithm for virtual shopping experiences and estimations of clothing attributes in the near future, which could be used to collect case by case fitment data and then manage area specific development in fleet levels, like it has been done in other industries [60–62]. Our research improves the current segmentation networks by adding additional feature fusion modules and feature extraction modules. It performs better with respect to the other state-of-the-art methods by an accuracy of 17% and the AUC by 4%. To begin, the feature fusion module constructs a new reference vector for each image. It includes a further loss function to reduce the distance between vectors inside a single image while increasing the distance between vectors in images.
References 1. Malisiewicz, T., Gupta, A., Efros, A.A.: Ensemble of exemplar-SVMs for object detection and beyond. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 89–96 (2011) 2. Silva, E.S., Hassani, H., Madsen, D.Ø.: Big Data in fashion: transforming the retail sector. J. Bus. Strategy 41(4), 21–27 (2019). https://doi.org/10.1108/JBS-04-2019-0062 3. Obschonka, M., Audretsch, D.B.: Artificial intelligence and big data in entrepreneurship: a new era has begun. Small Bus. Econ. 55(3), 529–539 (2019) 4. Awaysheh, F.M., Alazab, M., Garg, S., Niyato, D., Verikoukis, C.: Big data resource management & networks: Taxonomy, survey, and future directions. IEEE Commun. Surv. Tutorials 23(4), 2098–2130 (2021) 5. Veit, A., Kovacs, B., Bell, S., McAuley, J., Bala, K., Belongie, S.: Learning visual clothing style with heterogeneous dyadic co-occurrences. In: Proc. IEEE Int. Conf. Comput. Vision, pp. 4642–4650 (2015) 6. McAuley, J., Targett, C., Shi, Q., Van Den Hengel, A.: Image-based recommendations on styles and substitutes. In: Proc. 38th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, pp. 43–52 (2015) 7. Kalantidis, Y., Kennedy, L., Li, L.-J.: Getting the look: Clothing recognition and segmentation for automatic product suggestions in everyday photos. In: Proc. 3rd ACM Conf. Int. Conf. Multimedia Retrieval, pp. 105–112 (2013) 8. Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans. Image Process. 24(12), 4766–4779 (2015) 9. Li, Y., Cao, L., Zhu, J., Luo, J.: Mining fashion outfit composition using an end-to-end deep learning approach on set data. IEEE Trans. Multimedia 19(8), 1946–1955 (2017) 10. Dong, J., Chen, Q., Xia, W., Huang, Z., Yan, S.: A deformable mixture parsing model with parselets. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 3408–3415 (2013) 11. Han, X., Wu, Z., Jiang, Y.-G., Davis, L.S.: Learning fashion compatibility with bidirectional LSTMs. In: Proc. 25th ACM Int. Conf. Multimedia, pp. 1078–1086 (2017)
42
U. A. Usmani et al.
12. Simo-Serra, E., Fidler, S., Moreno-Noguer, F., Urtasun, R.: A high performance CRF model for clothes parsing. In: Proc. Asian Conf. Comput. Vis., pp. 64–81 (2015) 13. Kim, G., Xing, E., Fei-Fei, L., Kanade, T.: Distributed cosegmentation via submodular optimization on anisotropic diffusion. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 169–176 (2011) 14. Gallagher, A., Chen, T.: Clothing cosegmentation for recognizing people. In: Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 1–8 (2008) 15. Lin, L., Wu, T., Porway, J., Xu, Z.: A stochastic graph grammar for compositional object representation and recognition. Pattern Recog. 42, 1297–1307 (2009) 16. Liu, X., Lin, L., Yan, S., Jin, H., Tao, W.: Integrating spatio-temporal context with multiview representation for object recognition in visual surveillance. IEEE Trans. Circuits Syst. Video Technol. 21(4), 393–407 (2011) 17. Lin, L., Liu, X., Peng, S., Chao, H., Wang, Y., Jiang, B.: Object categorization with sketch representation and generalized samples. Pattern Recogn. 45(10), 3648–3660 (2012) 18. Kuettel, D., Guillaumin, M., Ferrari, V.: Segmentation propagation in imagenet. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) Computer Vision – ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VII, pp. 459–473. Springer Berlin Heidelberg, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33786-4_34 19. Vezhnevets, A., Ferrari, V., Buhmann, J.: Weakly supervised semantic segmentation with a multi-image model. In: Proc. IEEE Int. Conf.Computer Vis., pp. 643–650 (2011) 20. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Recog. Image Anal. 96(1), 1–27 (2001) 21. Borràs, A., Tous, F., Lladós, J., Vanrell, M.: High-level clothes description based on colourtexture and structural features. In: Perales, F.J., Campilho, A.J.C., de la Blanca, N.P., Sanfeliu, A. (eds.) Pattern Recognition and Image Analysis: First Iberian Conference, IbPRIA 2003, Puerto de Andratx, Mallorca, Spain, June 4–6, 2003. Proceedings, pp. 108–116. Springer Berlin Heidelberg, Berlin, Heidelberg (2003). https://doi.org/10.1007/978-3-540-44871-6_13 22. Wang, X., Zhang, T.: Clothes search in consumer photos via color matching and attribute learning. In: Proc. ACM Int. Conf. Multimedia, pp. 1353–1356 (2011) 23. Liang, X., et al.: Deep human parsing with active template regression. IEEE Trans. Pattern Anal. Mach. Intell. 37(2), 2402–2414 (2015) 24. Liu, S., et al.: Matching-CNN meets KNN: Quasi-parametric human parsing. In: Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 1419–1427 (2015) 25. Chen, H., Xu, Z.J., Liu, Z.Q., Zhu, S.C.: Composite templates for cloth modeling and sketching. In: Proc. IEEE Comput. Sci. Conf. Comput. Vis. Pattern Recog., vol. 1, pp. 943–950 (2006) 26. Wang, X., Zhang, T., Tretter, D.R., Lin, Q.: Personal clothing retrieval on photo collections by color and attributes. IEEE Trans. Multimedia 15(8), 2035–2045 (2013) 27. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, CoRR. Available: http://arxiv.org/abs/1409.1556 (2014) 28. Wang, W., Xu, Y., Shen, J., Zhu, S.-C.: Attentive fashion grammar network for fashion landmark detection and clothing category classification. In: Proc. IEEE Conf. Comput. Vision Pattern Recognit., pp. 4271–4280 (2018) 29. Bo, Y., Fowlkes, C.C.: Shape-based pedestrian parsing. In: Proc. IEEE Conf. Comput. Vision Pattern Recog., pp. 2265–2272 (2011) 30. Wang, N., Ai, H.: Who blocks who: Simultaneous clothing segmentation for grouping images. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 1535–1542 (2011) 31. Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: Proc. Comput. Vis. Pattern Recog., pp. 2480–2487 (2012)
Enhanced Deep Learning Framework
43
32. Luo, P., Wang, X., Tang, X.: Pedestrian parsing via deep decompositional network. In: Proc. IEEE Int. Conf. Comput. Vis., pp. 2648–2655 (2013) 33. Liu, S., et al.: Fashion parsing with video context. IEEE Trans. Multimedia 17(8), 1347–1358 (2015) 34. Liu, S., et al.: Fashion parsing with weak color-category labels. IEEE Trans. Multimedia 16(1), 253–265 (2014) 35. Liu, X., et al.: Label to region by bi-layer sparsity priors. In: Proc. 17th ACM Int. Conf. Multimedia, pp. 115–124 (2009) 36. Luo, P., Wang, X., Lin, L., Tang, X.: Joint semantic segmentation by searching for compatiblecompetitive references. In: Proc. ACM Int. Conf. Multimedia, pp. 777–780 (2012) 37. Lin, L., Liu, X., Zhu, S.-C.: Layered graph matching with composite cluster sampling. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1426–1442 (2010) 38. Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentationwith an implicit shapemodel. In: Proc. Workshop Statist. Learn. Comput. Vis., Eur. Conf. Comput. Vis. pp. 17–32 (2004) 39. Winn, A.C., Minka, T.: Object categorization by learned universal visual dictionary. In: Proc. 10th IEEE Int. Conf. Comput. Vis., vol. 2, pp. 1800–1807 (2005) 40. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image retrieval. In: Proc. Eur. Conf. Comput. Vis., pp. 584–599 (2014) 41. Yang, W., Luo, P., Lin, L.: Clothing co-parsing by joint image segmentation and labeling. In: Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 3182–3189 (2014) 42. Nowozin, S., Jegelka, S.: Solution stability in linear programming relaxations: graph partitioning and unsupervised learning. In: Proc. 26th Annu. Int. Conf. Int. Conf. Mach. Learn., pp. 769–776 (2009) 43. Wang, K., Lin, L., Lu, J., Li, C., Shi, K.: Pisa: pixelwise image saliency by aggregating complementary appearance contrast measures with edge preserving coherence. IEEE Trans. Image Process. 24(10), 1057–7149 (2015) 44. Szegedy, W., et al.: Going deeper with convolutions. In: Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 1–9 (2015) 45. Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: Proc. Int. Conf. Representation Learn. (2014) 46. Zhang, N., Paluri, M., Ranzato, M., Darrell, T., Bourdev, L.: PANDA: Pose aligned networks for deep attribute modeling. In: Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 1637–1644 (2014) 47. Donahue, J., et al.: Decaf: a deep convolutional activation feature for generic visual recognition. In: Proc. Int. Conf. Representation Learn. (2013) 48. Shotton, J., Johnson, M., Cipolla, R.: Semantic texton forests for image categorization and segmentation. In: Proc. IEEE Conf. Comput. Vis. Pattern Recog., pp. 1–8 (2008) 49. Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. In: Proc. ACM Int. Conf. Multimedia, pp. 675–678 (2014) 50. Kim, B.-G., Hong, G.S., Kim, J.-H., Choi, Y.-J.: An ecient vision-based object detection and tracking using online learning. J. Multimed. Inf. Syst. 4, 285–288 (2017) 51. Kim, J.-H., Kim, B.G., Roy, P.-P., Jeong, D.-M.: Ecient facial expression recognition algorithm based on hierarchical deep neural network structure. IEEE Access 7, 41273–41285 (2019) 52. Kahaki, S.M.M., Nordin, M.J., Ahmad, N.S., Arzoky, M., Ismail, W.: Deep convolutional neural network designed for age assessment based on orthopantomography data. Neural Comput. Appl. 32(13), 9357–9368 (2019) 53. Lee, Y.W., Kim, J.H., Choi, Y.J., Kim, B.G.: CNN-based approach for visual quality improvement on HEVC. In: Proceedings of the IEEE Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 12–14 Jan 2018
44
U. A. Usmani et al.
54. Luan, B., et al.: MUR-CNN: a two-dimensional code instance segmentation network based on deep learning. Future Internet 11, 197 (2019) 55. Ghoreishi, M., Happonen, A.: New promises AI brings into circular economy accelerated product design: a review on supporting literature. In: E3S Web Conf., vol. 158, pp. 1–10 (2020) 56. Lehikoinen, E., Viljakainen, E.: Robotic Process Automation in Financial Management. In: Proceedings, vol. 2233, issue 1, pp. 1–19 (2020) 57. Ghoreishi, M., Happonen, A.: The case of fabric and textile industry: the emerging role of digitalization, internet-of-things and industry 4.0 for circularity. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology. LNNS, vol. 216, pp. 189–200. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-1781-2_18 58. Happonen, A., Ghoreishi, M.: A mapping study of the current literature on digitalization and industry 4.0 technologies utilization for sustainability and circular economy in textile industries. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology. LNNS, vol. 217, pp. 697–711. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-2102-4_63 59. Ghoreishi, M., Happonen, A., Pynnönen, M.: Exploring industry 4.0 technologies to enhance circularity in textile industry: role of internet of things. In: Twenty-first International Working Seminar on Production Economics, pp. 1–16 (2020) 60. Kärri, T., et al.: Fleet-based industrial data symbiosis, Title of parent publication: S4Fleet - Service Solutions for Fleet Management, DIMECC Publications Series No. 19, 06/2017, pp. 124–169 (2017) 61. Kinnunen, S.-K., Happonen, A., Marttonen-Arola, S., Kärri, T.: Traditional and extended fleets in literature and practice: definition and untapped potential. Int. J. Strateg. Eng. Asset Manag. 3(3), 239–261 (2019) 62. Kortelainen, H., Happonen, A., Kinnunen, S.-K.: Fleet service generation—challenges in corporate asset management. In: Koskinen, K.T., Kortelainen, H., Aaltonen, J., Uusitalo, T., Komonen, K., Mathew, J., Laitinen, J. (eds.) Proceedings of the 10th World Congress on Engineering Asset Management (WCEAM 2015). LNME, pp. 373–380. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-27064-7_35
Towards Tackling QSAT Problems with Deep Learning and Monte Carlo Tree Search Ruiyang Xu(B) and Karl Lieberherr Northeastern University, Boston, United States [email protected]
Abstract. Recent achievements of AlphaZero using self-play has shown remarkable performance on several board games. It is plausible to think that self-play, starting from zero knowledge, can gradually approximate a winning strategy for certain two-player games after an amount of training. In this paper, we present a proof-of-concept to solve small instances of Quantified Boolean Formula Satisfaction (QSAT) problems by leveraging the computational power from neural Monte Carlo Tree Search (neural MCTS). QSAT is a PSPACE-complete problem with many practical applications. We propose a way to encode Quantified Boolean Formulas (QBFs) as graphs and apply a graph neural network (GNN) to embed the QBFs into the neural MCTS. After training, an off-the-shelf QSAT solver is used to evaluate the performance of the algorithm. Our result shows that, for problems within a limited size, the algorithm learns to solve the problem correctly merely from self-play. It is impressive that neural MCTS is succeeding on small QSAT problems but research is needed to better understand the algorithm and its parameters.
Keywords: Neural MCTS
1
· Graph neural network · QSAT
Introduction
The past several years have witnessed the progress and success of self-play. The combination of classical MCTS [5] algorithms with newly developed deep learning techniques gives a stunning performance on complex board games like Go and Chess [17,18,20]. One common but outstanding feature of such an algorithm is the tabula-rasa style of learning. In other words, it learns to play the game with zero knowledge (except the game rules). Such tabula-rasa learning is regarded as an approach to general artificial intelligence. Given such an achievement, it is interesting to see whether their algorithm’s superhuman capability can be used to solve problems in other domains. Specifically, we apply neural MCTS [1,18] to solve the QSAT problem through self-play c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 45–58, 2022. https://doi.org/10.1007/978-3-031-10464-0_4
46
R. Xu and K. Lieberherr
on the QSAT game. Our experiment shows that, even though the QSAT game is different from traditional board games (see Sect. 4.2), the algorithm is still able to determine the truthfulness of the corresponding QSAT problem through the dominant player. Furthermore, the trained algorithm can be used to reveal the solution (or show the non-existence of a solution) of the original problem through competitions against an enumerator. However, our objective is not necessarily to improve the state-of-the-art of hand-crafted problem solvers in specific areas but to illustrate a proof-of-concept that there is a generic algorithm (neural MCTS) which can solve well-known PSPACE problems tabula-rasa. In this work, as a proof-of-concept, we make two main contributions: 1. We propose a way to turn QBFs (Quantified Boolean Formulas) into graphs so that they can be embedded with a graph neural network; 2. We implemented a variant of the neural MCTS algorithm, which has two independent neural networks (designed explicitly for the QSAT games) for each of the players. Our result shows that the algorithm can solve small QSAT instances correctly. Although the algorithm is time-inefficient, it explores only a small portion of the state space to learn the optimal strategy. The remainder of this paper is organized as follows. Section 2 shows related papers which inspired our work. Section 3 presents essential preliminaries on neural MCTS, the QSAT problem, and graph neural networks. Section 4 introduces our approach to encode QBFs as graphs and the architecture of our implementation. Section 5 gives our correctness measurement and presents experimental results. Sections 6 and 7 cover discussion and conclusions.
2
Related Work
In terms of combining a QSAT solver with machine learning, Janota built a competitive QSAT solver, QFUN [8], based on counterexample guided refinement and machine learning. Although like in our work, the QSAT problems is treated as a game, their learning does not depend on the game state (i.e., the QBF), but focus on the assignment pairs from the two players in two consecutive moves (i.e., a move by the existential player, and a countermove by the universal player). By supervised learning a decision tree classifier, the learning algorithm categorizes the move-countermove pairs into two classes: feasible countermove and infeasible countermove. QFUN progressively solves a QBF by learning moves for the existential player so that there are no feasible countermoves for the universal player. While the performance is compelling, their solver is largely based on a counterexample guided abstraction refinement algorithm [9], whose design requires insights from human, hence cannot be regarded as tabula-rasa. As an alternative methodology, NeuroSAT [16] provides us another insight to apply machine learning to such problems. By leveraging the graph neural networks [3] and the message passing process [7], they developed a single-bit supervised SAT solver. The algorithm depends on zero-knowledge and learns purely on the input formula. In NeuroSAT, Boolean formulas are encoded as graphs so that a specially designed graph neural network can be applied to those
Neural MCTS QSAT
47
graphs. The target value of the graph neural network is a single bit, represented for the solvability of the input SAT problem. It has been shown that NeuroSAT performs adequately on SAT problems within a reasonable size. When it comes to applying neural MCTS to solve problems in other domains, Xu et al. use a technique called Zermelo Gamification to turn specific combinatorial optimization problems into games so that they can be solved through AlphaZero like algorithms [21]. They applied their method to a particular combinatorial optimization problem, the Highest Safe Rung. Their result shows that the algorithm can accurately solve such a problem within a given size. Although they only applied their method to one specific problem, their experiment result endorse the idea that there is a generic algorithm (neural MCTS) that can solve well-known problems tabula-rasa. To this extent, our work can be seen as an extension of theirs. To be specific, the Highest Safe Rung problem is polynomial while QSAT is much harder, PSPACE-complete. Our work is also closely related to General Game Play (GGP) [6]. In terms of using MCTS as a strategy searching technique, CADIAPLAYER [4] has been proved to be an efficient and competitive GGP player. However, CADIAPLAYER only uses traditional MCTS without a neural network. It turns out there are only a few papers in GGP which combine MCTS with a neural network. We only found a work by Rezende et al. [14], which is quite similar to ours. Nevertheless, the neural network used in their research has to be dynamically generated during the training, which is hard to train and quite different from most neural network architectures we can see nowadays.
3
Preliminaries
3.1
Neural Monte Carlo Tree Search
The PUCT algorithm1 implemented in AlphaZero [19,20] is essentially a neural MCTS algorithm which uses a polynomial upper confidence bound [2] and uses the neural prediction Pφ (a|s) as the predictor [15]. The algorithm is running through multiple searching iterations to decide the optimal action for the current state. During each iteration, there are four phases: 1. SELECT: At the beginning of each iteration, the algorithm selects a path from the root (current game state) to a leaf (either a terminal state or an unvisited state) in the tree according to the PUCB (see [17] for a detailed explanation for terms used in the formula). Specifically, suppose the root is
1
There are two ways to interpret this acronym: 1) the Polynomial Upper Confidence Tree [2], as the exploration term under the square root is polynomial. (it usually was a logarithmic function in other UCB formulas); 2) the Predictor Upper Confidence Tree [15] because the probability prediction from the neural network is used to penalize the exploration term.
48
R. Xu and K. Lieberherr
s0 , we have2 :
N (si , a ) ai = arg max Q(si , a) + cPφ (a|si ) N (si , a) + 1 a
a
W (si , a) N (si , a) + 1 = move(si , ai )
Q(si , a) = si+1
(1)
2. EXPAND: Once the select phase ends at a non-terminal leaf, the leaf will be fully expanded and marked as an internal node of the current tree. All its children nodes will be considered as leaf nodes during the next iteration of selection. 3. ROLL-OUT: Normally, starting from the expanded leaf node chosen from previous phases, the MCTS algorithm uses a random policy to roll out the rest of the game [5]. The algorithm simulates the actions of each player randomly until it arrives at a terminal state, which means the game has ended. The algorithm then uses the outcome of the game as the result evaluation for the expanded leaf node. However, a random roll-out usually becomes time-consuming when the tree is deep. A neural MCTS algorithm, instead, uses a neural network Vφ to predict the result evaluation so that the algorithm saves the time on rolling out. 4. BACKUP: This is the last phase of an iteration where the algorithm recursively backs-up the result evaluation along the tree edges. Specifically, suppose the path found in the Select phase is {(s0 , a0 ), ...(sl−1 , al−1 ), (sl , )}. then for each edge (si , ai ) in the path, we update the statistics as: W (si , ai ) := W (si , ai ) + Vφ (sl ) N (si , ai ) := N (si , ai ) + 1
(2)
However, in practice, considering a Laplace smoothing in the expression of Q, the following updates are actually applied: Q(si , ai ) :=
Q(si , ai ) × N (si , ai ) + Vφ (sl ) N (si , ai ) + 1
(3)
Once the given number of iterations has been reached, the algorithm returns a vector of action probabilities of the current state (root s0 ). And each action (s0 ,a) probability is computed as π(a|s0 ) = N N (s0 ,a ) . The real action played by a the neural MCTS is then sampled from the action probability vector π. In this way, neural MCTS simulates the action for each player alternately until the game ends. This process is called neural MCTS simulation, which is the core of self-play. 2
a N (si−1 ,a ) , however, AlphaZero Theoretically, the exploratory term should be N (si−1 ,a)+1 √ ) N (s ,a i−1 a without any explanation. We tried both in our used the variant N (si−1 ,a)+1 implementation, and it turns out that the AlphaZero one performs much better.
Neural MCTS QSAT
3.2
49
QSAT Problems and QSAT Games
A quantified Boolean formula (QBF) is a formula in the following form: ∃x1 ∀x2 ∃x3 ∀x4 ...∃xn .Φ(x1 , ..., xn ) where xi are distinct boolean variables. The sequence of quantifiers and variables is called the prefix of a QBF. The propositional formula Φ is called the matrix of a QBF, which only uses the variables in {xi }. A QBF can evaluate to either true or false since there are no free variables, and it is satisfiable only if it evaluates to true; otherwise, it is unsatisfiable. The problem of determining the truthfulness of a QBF is called the QSAT problem, which is known to be PSPACE complete [11]. A QSAT problem can be seen as a game between two players: the existential player (the Proponent (P)) who assigns values to the existentially quantified variables, and the universal player (the Opponent (OP)) who assigns values to the universally quantified variables. The two players make moves by assigning values to the variables alternately following the sequence of quantifiers in the prefix. The existential player (P) wins if the formula evaluates to True, and the universal player (OP) wins if it evaluates to False. There is a connection between the optimal strategy of the QSAT game and satisfiability of the QSAT problem. The optimal strategy of a QSAT game is a strategy that can always keep the player in a winning position, no matter what move the opponent will take. The QSAT problem is satisfiable/unsatisfiable if and only if the existential/universal player has the optimal strategy. Such a strategy (or the solution to a QSAT problem, either it’s satisfiable or unsatisfiable) is a Skolem function, which can be represented as a truth-table or a binary tree. In our case, the neural MCTS is actually learning the tree representation of the Skolem function and use it as its policy tree. 3.3
Gated Graph Neural Networks
In this work, QBFs are encoded as graphs, and a Gated Graph Neural Network (GGNN) [7,13] is applied to embed the QBFs into the neural MCTS framework. Notice that the GGNN is not the only option, and there are alternatives [3,7]. We choose GGNN for the sake of its easy implementation. The forward pass of the GGNN can be described as following: = Aewv htw , t = 0..T mt+1 v e w∈N (v)
ht+1 v
= GRU (htv , mt+1 v ), t = 0..T σ(f (hTv , h0v )) g(hTv ) R=
(4)
v∈V
where e is the edge type in a multigraph, Ae is the edge-weight matrix to be learned during the training, htv is the hidden representation of node v at message
50
R. Xu and K. Lieberherr
passing iteration t, and mtv is called the message from node v at iteration t. R is called the read-out which aggregates information from each node to generate a global feature target (notice that σ means the sigmoid activation function, f and g are MLPs, and means element-wise product). The message passing process iterates for a given T times, during each iteration, each node v computes its message using the hidden representation from the neighbor nodes N (v). After that, a Gated Recurrent Unit (GRU) is used to update the hidden representation of the node v. The message passing process allows each node’s hidden representation to capture the global structure information of the entire input graph. Finally, the read-out process R is applied to all the nodes to compute the global target of the input graph. GGNN is invariant to graph isomorphism, which is well-suited to capture the symmetry properties among the QBFs.
4 4.1
Implementation QBF Graphs
Although the QSAT problem has a simple syntactic structure, symmetries induced by the semantics of propositional logic should not be ignored [10]. The fact that symmetric QBFs are equivalent can improve learning efficiency. In this work, we specially designed a graph encoding of the QBFs, which helps us catch those symmetries through graph isomorphism. After using Tseitin transformation to re-write Φ in conjunctive normal form (CNF), a QBF is represented as an undirected multigraph (Fig. 1) with two nodes for every literal and its negation, and one node for every clause. There are four types of edges in this multigraph: 1) E2A edge, an edge between every consecutive existential literal and universal literal; 2) A2E edge, an edge between every consecutive universal literal and existential literal; 3) L2C edge, an edge between every literal and every clause it appears in; 4) reflexive edge, and an edge between each pair of literal and its negation. The reason behind such a design is three aspects: 1) the sequential information of the prefix is essential to identify the solution of a QBF. Even if two QBFs have the same matrix Φ, a different variable sequence in the prefix might lead to a massive difference in the solution. Therefore, we use the E2A edges and A2E edges to track such sequential information. 2) In a QBF, variables only show as positive literals in the prefix; however, they can be both positive and negative in the matrix Φ. Hence we naturally represent any variable as two nodes, meaning a pair of two complementary literals. 3) Since any literal and its complement are coupled, we use a reflexive edge to capture such entanglement. 4.2
Architecture
In our design, the policy-evaluation neural network of the neural MCTS becomes two GGNNs (see Sect. 3.3), one for each player. The reason why we use two
Neural MCTS QSAT
51
Fig. 1. An example of graph encoding for the QBF: ∃x1 ∀x2 ∃x3 (x1 ∨ x2 ∨ ¬x3 ) ∧ (x2 ∨ x3 ) ∧ (x1 ∨ x3 ). Notice that there are four types of edges, and two types of nodes.
independent neural networks instead of one is that the QSAT game is asymmetric in terms of the winning condition3 . As we have introduced in the Sect. 3.2, P wins the game if and only if the QBF evaluates to true, while OP wins the game if and only if the QBF evaluates to false. On the other hand, when it comes to the outcome of GGNN for two consecutive moves by different players, we noticed that the hidden representation sometimes has no significant difference between the two players. Hence the GGNN becomes confused on the input graphs. This issue can be resolved only by separating the neural networks so that both of the players can learn and progress mutually and consistently. Another fact to notice is that we treat every single QSAT problem as an independent game. During the self-play phase, the neural MCTS algorithm (Sect. 3.1) simulates the move for each player based on the player’s GGNN. The neural MCTS takes in the current game state (the QBF graph) and uses the current player’s GGNN to do the selection and rollout. After a certain number (25 in our case) of iterations, neural MCTS will return the action probability distribution for the current state. The player will sample her next move from this distribution. The simulation alternates between the two players until the game ends, where the game outcome will be evaluated and stored for the training phase. To call the neural network, the hidden representation h0v of each node v is initialized with the type of the node. Specifically, for an existential literal node, the hidden representation is [1, 0, 0, ..., 0]; for a universal literal node, the hidden representation is [0, 1, 0, ..., 0]; and for a CNF clause node, the hidden representation is [0, 0, 1, ..., 0] . Notice that we use 0’s to pad the vector to a given length. Another fact to notice is that there are two read-out task (Pφ and Vφ ). Hence we use two
3
The reason why we treat the two players asymmetrically is that our graph encoding of the QBF based on the CNF of the matrix. Therefore, a negation of the matrix will result in a different CNF, which will change the encoding.
52
R. Xu and K. Lieberherr
different sets of aggregation MLPs for each of the task: Ri = σ(fi (hTv , h0v )) gi (hTv ) v∈V
(5)
Pφ = R 1 , V φ = R 2 After each self-play simulation, we store the game trace of each player separately as a set of tuples in the form of (s, π, v), where s is the game state (the QBF graph), π is the action probability distribution generated by neural MCTS based on current state, and v is the game result in the perspective of current player based on the game outcome. We run such a simulation several times (in our case, ten times) to retrieve enough training data. After that, we train the GGNN independently for each of the players using those training data collected during self-play. After training, we use the newly trained GGNNs to play against each other for 20 rounds and collect the performance data for evaluation and analysis, and this is called the arena phase.
5 5.1
Experiment Experiment Setup
The hyperparameters are set as follows: the number of searching iteration for neural MCTS is set to 25, and the number of self-play is set to 100; the message passing time T is set to 10 for the GGNN; the size of the hidden representation of the GGNN is set to 128. Considering the capacity and computation power of our machine, we generate 100 random QBFs (50 satisfiable and 50 unsatisfiable) which have 51 nodes (the prefix has 21 quantified variables, and the matrix has 9 clauses. So there are 42 literal nodes and 9 clause nodes.) after encoding as graphs. Each QBF is regarded as a single game to be played and learned by the neural MCTS. We run the learning iteration (i.e., self-play, training, and arena) for 32 epochs, and collect the performance data in the arena phase during each iteration. All the experiments have been run on a MacBook Pro laptop with a 2.9 GHz Intel Core i7 processor and 16 GB 1600 MHz MHz DDR3 memory. 5.2
Performance Measurement
To measure the performance of the algorithm, we use two metrics: the local correctness ratio and the global correctness ratio. We compute the local correctness ratio of the two players during the arena phase, where the two players compete with each other for 20 rounds. An action is locally correct if it preserves a winning position. It is straightforward to check the local correctness of actions by using an off-the-shelf QSAT solver: GhostQ [12]. The local correctness ratios of the two players after each match in the arena phase are collected. Then we take the average value of their local correctness ratio as the performance measurement for the current training iteration.
Neural MCTS QSAT
53
Definition 1. Local Correctness for P Given a QBF ∃x1 ∀x2 ...∃xn .Φ, an action x∗ is locally correct if and only if ∀x2 ...∃xn .Φ[x1 \ x∗ ] evaluates to True. Definition 2. Local Correctness for OP Given a QBF ∀x1 ∃x2 ...∃xn .Φ, an action x∗ is locally correct if and only if ∃x2 ...∃xn .Φ[x1 \ x∗ ] evaluates to False. Since the two neural networks might be inductively biased to each other, the locally correct solution could be incorrect, and it could be the case that the OP player is too weak even to defeat a non-optimal P player. That also means, even the P player can always defeat his opponent after some iterations of training, it is still unknown to us if the P player has finally found the optimal strategy or not. To see whether the neural MCTS learns the correct solution, we measure the global correctness ratio by test the algorithm with an enumerator. To be specific, if a QBF is satisfiable, then we enumerate all possible moves for the OP (the universal player) and use the enumerator to play against the P’s neural network. Vice-versa for the unsatisfiable QBF. Theoretically, OP’s neural network fails to solve the QBF if there is any chance that the P’s enumerator can win the game. We count the number of winning games for each player and use it to measure the global correctness ratio. A 100% global correctness not only means the neural MCTS has found the correct solution, but also a fully support (represented as a winning strategy encoded in the neural network) to that solution. On the other hand, a non-100% global correctness can be treated as a measure of approximation of the algorithm. 5.3
General Result
The algorithm averagely took 70 s to finish one complete match, and since there are 100 self-plays during each epoch, the time used to run one epoch is roughly 2.5 h. It turns out to be really time-consuming. On the other hand, our experiment shows that the algorithm can correctly determine the truthfulness of all 100 test QBFs. We notice that, for a satisfiable QBF, the existential player can quickly dominate the game and win at most of the times, and vice-versa for the universal player in an unsatisfiable case. The result indicates that for a satisfiable/ an unsatisfiable QSAT problem, the existential/ universal player has a higher chance to win the corresponding QSAT game against the universal/ existential player. We also measured the algorithm’s global correctness ratio for all test cases, and we noticed an asymmetry between the satisfiable and unsatisfiable cases. To be specific, we computed the average global correctness ratio (AGC) for all satisfiable and unsatisfiable QBFs, respectively, and it turns out that the AGC for satisfiable cases is 100%, while the AGC for unsatisfiable cases is 100%. This fact indicates that neural MCTS can solve all small instances perfectly.
54
5.4
R. Xu and K. Lieberherr
Two Examples
In this section, for the illustration purpose, we show the experiment results for a satisfiable QSAT and an unsatisfiable QSAT (described in Fig. 2 and Fig. 3 where, due to limited space, we only show the matrix of the QBF). One can see, in Fig. 2, the local correctness ratio of the existential player (P) soars high after the first epoch, while in Fig. 3, the local correctness ratio of the universal player (OP) increases rapidly. Even though there are fluctuations, one of the player always dominate the game, this phenomenon is treated as an indication of truthfulness for the QSAT. Also, notice that the curves in the unsatisfiable case wave more violently than the ones in the satisfiable case. This fact means that even though the player can dominate the game, dominating an unsatisfiable QSAT game might be harder than a satisfiable one. In terms of global correctness ratio, both of them got 100% correctness, which means the neural MCTS not only makes the correct decision but also constructively support its decision.
Fig. 2. Local correctness ratio measured for a satisfiable QBF during self-play. The matrix of the QBF is listed on the right side in QDIMACS format. The X-Axis is the number of Epochs.
6 6.1
Discussion Exploration vs. Exploitation
One of the known issues of self-play is that the two players will always mutually bias their strategy to fit with the other’s strategy through exploiting their experiences. This mutually inductive bias facilitates the learning process of the players when they proceed at the same pace and have about the same skill level. However, once the learning speeds are unbalanced, the mutually inductive bias foils the improvement of players’ performance by stagnating their strategies in a local optimum. To understand this issue, one can think about a game between an expert and a novice. Since the expert can easily find a strategy to win against the novice, the novice will always lose the game. And because there is no positive feedback at all, the novice will build a biased belief that there is no way to win
Neural MCTS QSAT
55
Fig. 3. Local correctness ratio measured for a unsatisfiable QBF during self-play. The matrix of the QBF is listed on the right side in QDIMACS format. The X-Axis is the number of Epochs.
the game. Such a belief can be strengthened during self-play, and finally, it leads to some fixed losing strategy. While on the other side, since the opponent is not so challenging, the expert will also stay with the current strategy without any attempt to improve it. Nevertheless, we notice that neural MCTS is resilient to mutually inductive bias. Whenever the learning paces are unbalanced, the weaker player’s decisions become indifferent (i.e., no matter what moves she takes, she will always lose the game). On the other hand, neural MCTS pushes those indifferent actions into a uniform distribution, hence to encourage exploration by making random moves. Consequently, neural MCTS adaptively balances the exploration and exploitation, thus jumping out of the local optimum. 6.2
State Space Coverage
Neural MCTS is capable of handling a large state space [20]. Such an algorithm must search only a small portion of the state space and make the decisions from those limited observations. To measure the state space coverage, we recorded the number of states accessed during the experiment, In each QSAT game, we count the total number of states accessed during each game in self-play, and we compute the 10-games moving average of states accessed for all self-plays (Fig. 4). This result indicates an implicit adaptive pruning mechanism behind the neural MCTS algorithm, which can be regarded as a justification for its capability of handling large state spaces. 6.3
Limitation
We notice that neural MCTS is substantially time-consuming (Fig. 5). This problem is caused by accessing the graph neural network during each simulation. And computing on graph neural networks has turned out to be time-consuming. We think this is a common issue for such an algorithm. But the real power of neural
56
R. Xu and K. Lieberherr
Fig. 4. Average states accessed during self-play for QSAT problem described in Fig. 2. As a comparison, there are 226599 states in total.
MCTS comes from its accumulated knowledge acquired during each self-play. That means after the neural network has been trained, it can be applied to any new problem instances immediately. Another issue is that our test cases are restricted to a limited size. Because QSAT is known to be PSPACE-complete, verifying the correctness of the algorithm is time-consuming (the verification time increases exponentially with the number of variables in the QBF). On the other hand, the strategy learned by the neural MCTS algorithm is implicitly encoded inside the neural network, and there is no way to extract such a strategy so that it can be explicitly verified by a more efficient approach from Formal Methods. Therefore, using an enumerator to verify the correctness is inevitable for the time being. As a result, even though neural MCTS can handle a deep game tree and hence a large number of variables, it is still hard or even impossible to verify the learning outcome.
Fig. 5. Time used to finish one match. The measurement is based on the instance from Fig. 2.
Neural MCTS QSAT
7
57
Conclusion
Intrigued by the astonishing achievements of AlphaZero, we attempt to use the neural MCTS algorithm to solve a practical problem: QSAT. In our proof-ofconcept work, we make two main contributions. First, we propose a way to encode QBFs as undirected multigraphs, which bridges the logic formula representation of QBFs with the graph neural network input. Second, we particularly use two separated graph neural networks to build our neural MCTS variant. Such a design can significantly reduce the learning confusion caused by the asymmetry between the two players. Our evaluation is based on two metrics: local and global correctness ratio. The local metric, by utilizing an off-the-shelf QSAT solver, only focuses on the correctness in a single game, yet it imposes no constraints on the number of variables in the formula. The global metric, which relies on an enumerator, can determine the exact correctness of the learned neural MCTS, but it is sensitive to the number of variables in the formula. Our experiment results are positive on the given limited size test cases, which justifies the feasibility of our idea to some extent. We demonstrate that neural MCTS only explores a very small fraction of the search space on our test case, yet it finds the correct solution. For future work, it may be worthwhile to figure out how to explain the learned neural MCTS or how to extract the generated strategy from the neural network. It is also useful to do some study on how to optimize the current algorithm so that it can handle more significant cases. Our objective is not necessarily to improve the state-of-the-art of hand-crafted problem solvers in specific areas but to illustrate that there is a generic algorithm (neural MCTS) that can solve well-known problems tabula-rasa. The hope is that neural MCTS will help solve future algorithmic problems that have not yet been solved by humans. We view Neural MCTS as a helper in human solving of algorithmic problems in the future. We also hope our research sheds some light on the remarkable but mysterious learning ability of the neural MCTS algorithm from AlphaZero.
References 1. Anthony, T., Tian, Z., Barber, D.: Thinking fast and slow with deep learning and tree search. In: Advances in Neural Information Processing Systems, pp. 5360–5370 (2017) 2. Auger, D., Cou¨etoux, A., Teytaud, O.: Continuous upper confidence trees with polynomial exploration – consistency. In: Blockeel, H., Kersting, K., Nijssen, S., ˇ Zelezn´ y, F. (eds.) ECML PKDD 2013. LNCS (LNAI), vol. 8188, pp. 194–209. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40988-2 13 3. Battaglia, P.W., et al.: Relational inductive biases, deep learning, and graph networks (2018). arXiv preprint arXiv:1806.01261 4. Bjornsson, Y., Finnsson, H.: Cadiaplayer: a simulation-based general game player. IEEE Trans. Comput. Intell. AI Games 1(1), 4–15 (2009) 5. Browne, C., et al.: A survey of monte carlo tree search methods. IEEE Trans. Comput. Intell. AI Games 4(1), 1–43 (2012) 6. Genesereth, M., Love, N., Pell, B.: General game playing: overview of the AAAI competition. AI Mag. 26(2), 62–62 (2005)
58
R. Xu and K. Lieberherr
7. Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1263–1272 (2017), JMLR. org 8. Janota, M.: Towards generalization in QBF solving via machine learning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 6607–6614 (2018) 9. Janota, M., Klieber, W., Marques-Silva, J., Clarke, E.: Solving QBF with counterexample guided refinement. In: Cimatti, A., Sebastiani, R. (eds.) SAT 2012. LNCS, vol. 7317, pp. 114–128. Springer, Heidelberg (2012). https://doi.org/10. 1007/978-3-642-31612-8 10 10. Kauers, M., Seidl, M.: Symmetries of quantified boolean formulas (2018). arXiv, abs/1802.03993, 11. Kleinberg, J., Tardos, E.: Algorithm Design. Addison-Wesley Longman Publishing Co. Inc., Boston (2005) 12. Klieber, W., Sapra, S., Gao, S., Clarke, E.: A non-prenex, non-clausal QBF solver with game-state learning. In: Strichman, O., Szeider, S. (eds.) SAT 2010. LNCS, vol. 6175, pp. 128–142. Springer, Heidelberg (2010). https://doi.org/10.1007/9783-642-14186-7 12 13. Li, Y., Tarlow,D., Brockschmidt, M., Zemel, R.S.: Gated graph sequence neural networks. In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016, Conference Track Proceedings (2016) 14. Rezende, M., Chaimowicz, L.: A methodology for creating generic game playing agents for board games. In: 2017 16th Brazilian Symposium on Computer Games and Digital Entertainment, SBGames, pp. 19–28. IEEE (2017) 15. Rosin, C.D.: Multi-armed bandits with episode context. Ann. Math. Artif. Intell. 61(3), 203–230 (2011) 16. Selsam, D., Lamm, M., Bunz, B., Liang, P., de Moura, L., Dill, D.L.: Learning a sat solver from single-bit supervision (2018). arXiv preprint, arXiv:1802.03685 17. Silver, D., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529, 484 (2016) 18. Silver, D., et al.: A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018) 19. Silver, D.: Mastering chess and shogi by self-play with a general reinforcement learning algorithm (2017). CoRR, abs/1712.01815 20. Silver, D., et al.: Mastering the game of Go without human knowledge. Nature 550, 354 (2017) 21. Xu, R., Lieberherr, K.: Learning self-game-play agents for combinatorial optimization problems. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2019, pp. 2276–2278. IFAAMS (2019)
Laplacian Pyramid-like Autoencoder Sangjun Han1 , Taeil Hur2 , and Youngmi Hur3(B) 1
School of Mathematics and Computing (Mathematics), Yonsei University, Seoul, South Korea [email protected] 2 JENTI Inc., Seoul, South Korea [email protected] 3 Department of Mathematics, Yonsei University, Seoul, South Korea [email protected]
Abstract. In this paper, we develop the Laplacian pyramid-like autoencoder (LPAE) by adding the Laplacian pyramid (LP) concept widely used to analyze images in Signal Processing. LPAE decomposes an image into the approximation image and the detail image in the encoder part and then tries to reconstruct the original image in the decoder part using the two components. We use LPAE for experiments on classifications and super-resolution areas. Using the detail image and the smaller-sized approximation image as inputs of a classification network, our LPAE makes the model lighter. Moreover, we show that the performance of the connected classification networks has remained substantially high. In a super-resolution area, we show that the decoder part gets a high-quality reconstruction image by setting to resemble the structure of LP. Consequently, LPAE improves the original results by combining the decoder part of the autoencoder and the super-resolution network. Keywords: Deep learning · Autoencoder · Laplacian pyramid Classification · Acceleration · Super-resolution
1
·
Introduction
Deep neural networks are standard machine learning methods for diverse image processing such as object classification, image transform, image recognition. The networks have great varieties in architectures, algorithms, and processes. The autoencoder is a part of these varieties. The autoencoder encodes a given data to some representation in a latent space, usually compressed from the input data, by a few layers. Then it decodes this representation to the reconstruction converted to have desired properties by different layers. The encoder has the advantage of analyzing the data in a low dimensional space, in a way similar to the Principal Component Analysis. Also, the simplicity of the model structure makes it easy to modify the structure according to various purposes like unsupervised pre-training, denoising, restoration of image. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 59–78, 2022. https://doi.org/10.1007/978-3-031-10464-0_5
60
S. Han et al.
Our paper is motivated by the article [6]. The authors develop the waveletlike autoencoder (WAE) and use it for acceleration in classification networks. WAE decomposes the input image into two down-scaled images, low-frequency information IL and high-frequency information IH , through the encoder and reconstructs the original image at the decoder. Their use of the prefix “waveletlike” is due to the imposed condition on IH to be sparse and the reconstruction process obtained by adding the convolution filtered versions of IL , IH . In WAE, to accelerate the classification model, they input IL as the mainstream and IH as a helper to the classification networks (e.g., VGG16, ResNet50) instead of the original image. The change made the network have smaller computational complexity; hence it takes less time for the entire process. Besides, the complementary analysis using the helper IH makes the network stay competitive in terms of accuracy. Although WAE is good at accelerating the basic classification networks, it is not satisfactory in some crucial aspects. First, contrary to the wavelet, WAE does not impose any condition on low-frequency information. This missing condition makes the approximation IL hard to reflect the original image, which can drop classification performance. Second, such a low-frequency image can lower the quality of the reconstruction image. The WAE paper does not pay attention to the reconstructed result because its primary concern is the acceleration problem, requiring only two decomposed images in the classification networks. This limited architecture makes it hard for use in other areas requiring a reconstruction. Third, the name of the autoencoder is “wavelet-like,” but the model is missing a critical feature of the wavelet, namely multi-scale property. Consequently, WAE has difficulty decomposing multiple times, resulting in a restriction on WAE so that the model is stuck in the acceleration task. Considering the preceding, we propose a new model named the Laplacian pyramid-like autoencoder (LPAE). We impose an extra condition on the lowfrequency part of WAE, and get an autoencoder with a hierarchy, similar to the shape of the Laplacian pyramid (LP) introduced in [5]. As a result, LPAE makes the approximation image with better quality. Using this approximation image, we obtain higher performance of classification but also extend to superresolution problems with 2k magnification for various k since LPAE decomposes and reconstructs an image multiple times. The datasets used for the classification are ImageNet2012 (ImageNet) from [33] and Intel Image Classification (Natural Scene) from [18]. We combine two base networks, VGG16 (VGG) and ResNet50 (ResNet), with WAE and LPAE. LPAE shows better classification performance than WAE. LPAE accelerates test times sufficiently, although there is a slight drop in the acceleration time. In some cases, LPAE even has a faster total training time than WAE. For a superresolution problem, we use three datasets, CelebA [26], DIV2K [35], and Set5 [4]. After training on CelebA, the test is done on CelebA. After training on DIV2K, the test is carried out on Set5, following the convention in the field. For the network, base network is WaveletSRNet (WaveSR) introduced in [14]. Since WaveSR uses the wavelet packet transform for image reconstruction, we
Laplacian Pyramid-like Autoencoder
61
can directly see the effectiveness of LPAE for super-resolution by replacing the wavelet part with LPAE. This result in super-resolution shows the potential usefulness of the proposed LPAE for solving other problems in deep learning because the classification and the super-resolution are different types of problems.
2 2.1
Related Works Autoencoder
Recently, most research about the autoencoder concentrates on its connection to other applications rather than developing it independently. Generative autoencoder [10,31] occupies a significant portion as a base model for application. In [17], variational autoencoder (VAE) solves a problem of insufficient data by producing synthetic data from a given sample data. The author in [27] develops a reference-based SR approach using VAE, which can take any image as a reference. The author in [36] suggests a novel generative model with hierarchical structure based on VAE. By exploiting unusual convolutions such as depthwise separable convolution and regular convolution, authors get high performance without modifying the loss function of VAE. Other existing autoencoders, such as sparse autoencoder and denoising autoencoder, are used to solve specific problems. For example, [15] composes stacked sparse denoising autoencoder for detecting electricity theft and [13] exploits a sparse autoencoder for landslide susceptibility prediction. Because the autoencoder extracts the feature efficiently and reconstructs data with a simple structure, it is a popular tool for developing a framework combined with machine learning. Our model, LPAE, is a model possessing the properties of LP. So LPAE can provide new approaches to diverse problems where LP properties are helpful. 2.2
Network Acceleration
After the substantial progress of the convolutional neural network (CNN), a lot of research focuses on the acceleration of the networks with keeping high performance. There are several approaches to accelerate networks. The authors in [3,6] propose methods that modify an architecture or overall structure. The author in [3] suggests an accelerator using the point that can compute the convolution layer similar to the fully connected layer. Moreover, to accelerate model training, some researches present a new training framework. The author in [41] makes an assistant model along with the main model to remove trivial instances during training, then gets the results with the new training algorithm. The author in [25] suggests a new training plan for fine-tuning. It accelerates the fine-tuning by adaptively freezing the nearly converged portion of the network. Since many natural language processing and classification models exploit fine-tuning the pretrained network, this method is tempting. Furthermore, the study to achieve harmony between the hardware and the software is actively performed [16,30]. Considering rapidly advanced hardware, both propose a hardware-based approach to speed up a CNN and reduce the energy required for computation.
62
2.3
S. Han et al.
Single Image Super-Resolution
In the super-resolution area, many of the studies try to build a deep convolutionbased model using up-sampling. For example, [19] introduces the way of the deep network for super-resolution by a residual learning and high initial learning rate, while [22] suggests enhancing the power of the deep network by removing the batch normalization and setting a training pipeline. The author in [12] makes a model to repeat the up-scaling and down-scaling iteratively and to reflect the feedback of error. In addition to the construction of deep models, some researchers are focusing on other points. By learning the feature correlations of hidden layers by second-order attention, [8] improves the expressional power of CNN. The author in [34] trains the content-adaptive resampling (CAR) model with the main super-resolution (SR) network to create a low-resolution image. Since unsupervised training is conducted on CAR, CAR makes a down-scaled image to keep important information for super-resolution impartially. The primary concern of [7], [20] is the speed of the SR. While both of them construct a new network structure and framework, neural architecture search is a noticeable characteristic of [7]. The LPAE that we propose in this paper is an assistant model to reconstruct images trained with the main SR model, similar to the approach in [34]. Also, LPAE learns the correlation between the components for super-resolution and can improve the reconstructing power of the main SR model similar to [8]. However, LPAE is an autoencoder with a straightforward structure and is easy to connect with diverse architectures without too much modification.
3 3.1
Laplacian Pyramid in Neural Network Idea and Structure of LP
The Laplacian pyramid (LP) is introduced in [5] as a technique for compact image encoding. This technique has its root in subtracting a low-pass filtered image from the original image. Such subtraction reduces redundant information by decreasing the correlation between neighboring pixels in an image, in other words, data compression. Moreover, this compression process can be repeated on low-pass filtered images having different scales. Then the repetitions form a pyramid-like structure and accelerate the reduction. Since this process is similar to the Laplacian operator sampling on diverse scales, the pyramid is named the Laplacian pyramid. The following is the overall process of establishing LP. For an input image I0 , a filtered image I1 is obtained with the decreased resolution by some low-pass filter. A filtered image I2 is similarly obtained from I1 , and proceeding succes˜ sively results in a sequence {Ik }K k=0 . For each fixed Ik with k ≥ 1, the image Ik with the same size as Ik−1 is defined by expanding Ik by interpolation. Subtracting I˜k from Ik−1 produces dk with the same size as Ik−1 but with compressed image information. The sequence {dk }K k=1 of such differences corresponds to the pyramid of LP.
Laplacian Pyramid-like Autoencoder
3.2
63
Strengths of LP and Applications on Neural Net
According to the above procedure, LP decomposes an image I into a low-pass filtered image Ic and a difference Id after one stage. The decomposition result of LP is redundant compared to the wavelet transform often used as the analysis tool because the Id part of LP has the same size as the original image. But some advantages are originated from the structure of LP. For example, the implementation that requires only low-pass filtering and subtraction is straightforward. Another strength of LP is that there exists no scrambled frequency [9] derived from high-pass filtering. And the organizational style of LP enables the perfect reconstruction of the original image. Such advantages encourage many usages of LP as a tool in traditional image processing. In addition, there have been lots of efforts in machine learning to treat LP, both independently and as a concept in a model. For the super-resolution problem, [2] and [43] get competitive results by introducing the style of LP in the attention module and by considering a generator of the conditional generative adversarial network, respectively. On the other hand, [21,23] use LP itself to make input images. After generating the image pyramid from the original image, [23] put components into two modules for high-quality transfer, a drafting module that transforms artistic style from low-resolution image and a revision module that refines the transformed image with a high-resolution image. On top of these, there are applications for diverse problems such as object detection [42], image compression [37]. These researches show a possibility of a connection between LP and deep learning, presenting satisfactory results. As will be seen later in this paper, our LPAE presents convincing results, which add to the above possibility.
4
Laplacian Pyramid-Like Autoencoder
We propose a simple autoencoder model, the Laplacian pyramid-like autoencoder (LPAE), to have properties of LP such as a hierarchical structure, analysis and reconstruction parts. Then LPAE is connected to the classification networks and the super-resolution networks. These connections are performed to show the effectiveness of LPAE. 4.1
Proposed Model
LPAE has two parts, encoding (analysis) and decoding (reconstruction) parts. Figure 1 shows the overall structure of LPAE. In the analysis part, LPAE decomposes the original image I into images Ic and Id . Ic is the down-sampled approximation image that is low-pass filtered, and Id is the detail image representing the difference between the original image and the prediction from approximation. Both outputs pass through 4 convolution layers. When we downsample images, we use the convolution layer with stride 2 instead of a pooling layer. In the reconstruction part, the output image I is reconstructed by an element-wise sum between the detail image Id and the prediction image φ(Ic ), where φ is a
64
S. Han et al.
Fig. 1. Overall structure of LPAE.
deconvolution process. The process φ consists of up-sampling with 4 × 4 transposed convolution layer and filtering with 3 of convolution layers. In our LPAE, except for the output layers, we use convolution layers with filter-size 3 × 3 and 16 output channels for simplicity and efficiency, but the network can be made deeper or more redundant. 4.2
Loss for LPAE
Our loss function consists of three components that make the autoencoder similar to LP. Since we want LPAE to reconstruct the original image as much as possible, we define our first loss function to be the reconstruction loss, lr =
1 ||I − I ||1 . |I|
For the approximation image representing the low-frequency channel of the original image well, we prepare an approximation image I↓ by using the bicubic interpolation and apply the mean square error (MSE). We then define our second loss function as the energy (or approximation) loss, le =
1 ||Ic − I↓ ||22 . |Ic |
Laplacian Pyramid-like Autoencoder
(a) Original Image
(b) Approximation Image (Left: LPAE, Right: WAE)
(c) Detail Image (Left: LPAE, Right: WAE)
(d) Reconstruction Image (Left: LPAE, Right: WAE)
65
Fig. 2. Results of LPAE and WAE for two images from ImageNet. Original images are in (a). In (b)-(d), left images are LPAE results, and right images are WAE results.
For the approximation image I↓ , many other methods, including the wavelet transform by CDF 9/7 filters, can also be used, but we find no significant difference in the loss for using different approximation methods. Hence in this paper, we fix the bicubic interpolation to get I↓ . The bicubic interpolation makes a natural connection with the super-resolution problem. To constrain the detail image to be sparse, we set our last loss function to be the sparsity loss, 1 ||Id ||22 . ls = |Id |
66
S. Han et al.
This sparsity loss makes the detail image carry a high-frequency channel of the original image and provide textures to the connected network. The overall loss of LPAE is defined as the weighted sum of the three losses: ltotal = αlr + βle + γls , where α = γ = 1, β = 0.8 to give less weight on the approximation loss than other losses. Figure 2 shows the result of LPAE and WAE applied to two different images from ImageNet. For the approximation image in Fig. 2(b), there is no doubt that the result of LPAE is more vivid, original-like image. Besides, Fig. 2(c), the result of detail images, demonstrates that this precise approximation influences the detail image to be sparse. Although sparse, the detail contains many information for the original texture since it has the same spatial size as the original image, unlike WAE. Eventually, LPAE gets a more sharp reconstruction, as seen from the images in Fig. 2(d). The differences prove the validity of our loss function. 4.3
Image Classification Problem
In [6], the authors show that WAE can accelerate classification networks while keeping the accuracy almost the same by using two outputs of the encoding part. We show that LPAE can accelerate classification networks at about the similar level, even with a slight improvement in accuracy. We think that using both the approximation and the detail, similar to [6], has contributed to getting comparable accuracy, even if the input in VGG is not the original image. In this sense, we speculate that there is an improvement in accuracy for LPAE because LPAE gets a better approximation and a high-resolution detail than WAE. To be comparable with the experiments in [6], we set the same structure except for the autoencoder part, replacing WAE with LPAE. Below we describe the use of the autoencoder only for VGG, as the process for ResNet is similar. Recall that the encoder part of LPAE decomposes an original image I into an approximation image Ic and a detail image Id . The features fc for Ic are extracted from the feature extraction part of VGG. The features fd for Id are extracted from another lighter feature extraction part consisting of convolution layers with only a quarter of the output channels for fc . Using fully connected layers, we obtain the classification score sc from fc , and sd from the concatenation of fc and fd . The final score s is the average of the two scores sc and sd . From the acceleration perspective, according to [6], the total complexity of all the convolutional layers can be represented by O(N ), where N =
d
nl−1 · s2l · nl · m2l .
(1)
l=1
Here, d is the number of the convolution layers. For the l-th layer, nl is the number of the output channels, sl is the size of the kernels, and ml × ml is the spatial size of the output features.
Laplacian Pyramid-like Autoencoder
67
Since LPAE has 12 size of feature maps for the approximation, the complexity N in (1) is 14 of the original network. However, unlike WAE, there is no change in the size of feature maps for the detail. Still following the setup of WAE for the detail, the number of the output channels for the LPAE’s detail becomes 14 com1 of the original pared to the approximation case, so the complexity becomes 16 5 network. As a result, LPAE has 16 total complexity compared to the original. 5 = 3.2. This number is comparable to Thus the acceleration rate is about 1/ 16 3.76, the acceleration rate of WAE. 4.4
Applications to Super-Resolution Problem
As observed earlier, LPAE’s approximation image in Fig. 2 tends to be similar to some other low-pass filtered images, such as the approximation image of the wavelet transform or the bicubic interpolation. This approximation image helps the detail to be more sparse. Besides, LPAE can make the hierarchical structure because the approximation carries low-frequency information sufficiently. Based on these observations, we try to expand the application domain of LPAE to super-resolution. There are lots of algorithms in the super-resolution problem. Some of them are hard to be connected to LPAE directly in a natural way. So we try to choose the models having room for substitution, and WaveletSRNet (WaveSR) in [14] is one such model. The basic concept of the network is the use of the wavelet packet transform (WPT) as a reconstructor. For example, to train WaveSR for magnification of 4, an input image is decomposed into two levels using WPT. Then the network gets the approximation image of 14 size and extracts features from the approximation image through the embedding part. The features form new details of the number needed in the reconstruction process (15 for magnification of 4) and one approximation image. At last, through WPT, the network creates the high-resolution image of magnification of 4. In this procedure, the authors set the loss function of WaveSR to make the new details and approximation similar to the decomposition results of WPT. In this algorithm, we insert LPAE as a substitute for WPT. The divided encoding and decoding parts can take the role of the existing decomposition and reconstruction, respectively. Moreover, the hierarchical structure of LPAE does not restrict its use only for a magnification of 2 but allows it for an arbitrary magnification of 2k . Although LPAE produces larger-sized details than WPT, the redundancy of information is helpful to reconstruct. And the fact that it has a smaller number of detail images makes the network efficient. In addition, we expect the substitute to improve the high-resolution result because it has the flexibility that the reconstruction of LPAE gets better by a fine-tune on various datasets in contrast with WPT. We modify the loss function given in [14] to fit LPAE’s situation when we train WaveSR substituted with LPAE (named LPSR). The reconstruction loss is defined by using the L1 distance between the reconstruction I and the original
68
S. Han et al.
high-resolution image I, lrec =
1 ||I − I ||1 . |I|
The use of L1 loss for the whole procedure can reduce the smoothing effect and enhance the quality of the image [44]. Based on this observation, we define the pyramid loss by lp =
1 1 ||Ic − Ic ||1 + ||Idi − Id i ||1 λi |Ic | |I | d i i
where {λi } are weights for the detail part. For instance, we take λ1 = 0.8, λ2 = 1.2 in case of magnification of 4. Finally, the loss function of LPSR is the weighted sum of two losses: ltotal = γlrec + δlp . If the new details Id i and approximation Ic produced by LPSR are close to the outputs {Idi , Ic } of LPAE sufficiently, the reconstruction using the new ones will have high quality on high-resolution stage. From this point of view, we focus on the closeness between the outputs of LPAE and LPSR. Hence in all of our experiments, weights γ = 1 and δ = 10 are used. 4.5
Details on Experiments
In classification, as mentioned in our paper, we use two backbones, VGG16, ResNet50, and use two datasets, ImageNet2012 (ImageNet), Intel Image Classification datasets (Natural Scene). ImageNet is a large dataset having train images of 1.2 million approximately and validation images of 50k. The images are classified as labels of 1000 and are with varying image sizes. In comparison to ImageNet, Natural Scene is a tiny dataset of six categories. There are 14k in train, 10k in validation and test. Moreover, the spatial size of each image is fixed as 150 × 150. We mostly follow the training strategies in [6] to train the autoencoders for comparison with WAE. In particular, for both WAE and LPAE, the Xavier algorithm is used for initialization of parameters, and the SGD algorithm with momentum 0.9 and weight decay of 0.0005 is used. However, some options are chosen differently because of the dataset size. Although we keep a batch of 4 as in [6] for Natural Scene, we choose 256 for ImageNet to shorten training time. Also, our training epochs are fixed as 100 for Natural Scene and 20 for ImageNet. Considering the difference between the two autoencoders, we choose the initial learning rate differently. For WAE, it is set as 0.000001 with the decay factor of 0.1 after every 10 epochs as in [6], but for LPAE as 0.01 with the same decaying strategy.
Laplacian Pyramid-like Autoencoder
69
To train the connected classification network, for Natural Scene images, we randomly crop to 128 × 128, and for ImageNet images, we resize to 256 × 256 and randomly crop to 224 × 224. The only data augmentation we select is the random horizontal flip. We choose the batch size to be 256 regardless of datasets, and choose the SGD algorithm with the same options. But the decay strategy and the number of epochs are different along with the dataset. For ImageNet, training is performed with 20 epochs, and then the learning rate is multiplied by 0.1 after 10 epochs. For Natural Scene, the learning rate is multiplied by 0.5 after every 10 epochs during 100 training epochs. In the super-resolution, we train on the DIV2K dataset and test on the Set5 dataset. Additionally, we use the CelebA dataset for comparison with the original task of WaveSR. DIV2K has 800 diverse images of the large size of 2k resolution, but CelebA has 162,770 images of center-arranged (to some degree) face with the size of 178 × 218. Because of this substantial dissimilarity, we perform different data augmentations on two datasets. The autoencoders’ training options are chosen similarly to the classification. For DIV2K, we randomly crop the high-resolution image to 192 × 192 with random horizontal/vertical flips. We train LPAE using the Adam algorithm with a batch of 4 and adjust the initial learning rate of 0.001 to be divided by 2 after every 50 epochs during 400 epochs. To train LPSR, we set a batch of 8 and select the Adam algorithm with the initial learning rate of 0.001 decaying to half after every 50 epochs for 300 training epochs. For the other dataset, CelebA, the training image is resized to 144 × 144 then randomly cropped to 128 × 128 with a random horizontal flip only. LPSR is trained during 40 epochs using the Adam algorithm with a batch of 256, the initial learning rate of 0.01 multiplied by 0.1 after every 10 epochs. Options for other networks, WaveSR and WSR, are chosen to be identical to LPSR. All of our codes are available at https://github. com/sangjun7/LPAE.
5
Experimental Results
In this section, we examine our LPAE performances. We first make a comparison between LPAE and WAE using the PSNR value in Table 1. Then we join two autoencoders to the classification networks, VGG and ResNet. These networks represent the efficiency and the power of LPAE by inference time and accuracy. The following super-resolution results show the versatility of LPAE. By checking PSNR and SSIM values, we show that replacing original reconstruction parts with LPAE enhances super-resolution abilities. All of the experiments are conducted on 6 GPUs of GeForce RTX 3090 except for measuring test time. We measure the test time of classification networks by a GPU of Tesla T4 on Google Colab.
70
S. Han et al.
Table 1. Comparison of power for restoring the original image after encoding and decoding between WAE, LPAE and Bicubic interpolation (Bicubic). LPAE WAE
Bicubic
ImageNet Train loss 0.0041 0.0019 – Test loss 0.0042 0.0019 – PSNR (dB) 47.89 28.57 28.44 DIV2K
5.1
Train Loss 0.0023 0.0114 – Test loss 0.0024 0.0109 – PSNR (dB) 54.73 19.90 26.19
Autoencoder
LPAE is motivated by WAE. However, there is a considerable difference between the two autoencoders in their network structure and loss formulation. As we can see in Fig. 2, LPAE gets a high-resolution detail image compared to the detail image of WAE. The large-sized detail image has more abundant information for textures and high-frequency data of the original image. Furthermore, LPAE makes a more specific approximation image than WAE, as seen in Fig. 2. This fact comes from the constraint on the approximation image to resemble the bicubic interpolated image. Also, the change of the MSE loss between the input image and the reconstructed image to the L1 loss with a large learning rate returns high-quality reconstruction [44]. Table 1 shows a difference between the two autoencoders. The PSNR values in the table are calculated between the original image and the reconstruction image. Although PSNR alone cannot determine the quality of reconstruction, many researchers indeed consider a large PSNR value as an essential indication for better reconstruction. For Bicubic, we put back to the original image using bicubic interpolation after reducing the spatial size to half. For ImageNet, we get the large PSNR value of 47.89 dB, far superior to the others, i.e., 28.57 dB for WAE and 28.44 dB for Bicubic. In the DIV2K dataset, LPAE makes the closer reconstruction image to the original than the case of ImageNet. The PSNR value of LPAE is 54.73 dB, and that of WAE is 19.90 dB, which shows a significant difference between PSNR values. LPAE gets one compressed image and another sparse image, and based on the above results, we see that LPAE restores the original image much better. Thus LPAE can accomplish the role of accelerator comparable to WAE and be extended to the super-resolution problem. 5.2
Classification
As mentioned in Sect. 4, we expect that the link between LPAE and a classification network reduces the computational cost and accelerates algorithms with a slight drop in accuracy. To show this, we train the basic classification networks (VGG, ResNet) and those connected with WAE and LPAE, using two datasets
Laplacian Pyramid-like Autoencoder
71
Table 2. Comparison of accuracy and training times between classification networks connected with WAE or LPAE. ImageNet Top 5 accuracy (%) Train time (hr) Trainable parameters VGG
Basic 86.94 WAE 84.96 LPAE 85.32
30.11 27.48 27.12
138,365,992 132,914,278 150,220,870
ResNet Basic 84.01 WAE 79.95 LPAE 80.31
29.23 27.68 27.69
25,575,784 29,628,406 29,633,494
Natural scene Top 5 accuracy (%) Train time (hr) Trainable parameters VGG
Basic 89.97 WAE 88.73 LPAE 89.97
0.55 0.35 0.38
138,365,992 132,914,278 150,220,870
(ImageNet, Natural Scene). We use our codes for WAE. Although not able to reproduce exactly, we identify a tendency of accelerating presented in [6]. Table 2 shows a comparison of performances between networks. For all cases, classification networks connected with LPAE (LPVGG, LPResNet) get better precision than those connected with WAE (WVGG, WResNet). For Natural Scene, resulting in 89.97%, LPVGG even has the same accuracy as the original VGG. We speculate that more information of the original image is kept after the LPAE’s encoding helps classify. Table 2 also reports the number of trainable parameters for each case for reference. The training times of LPVGG/LPResNet are reduced to a similar level as WVGG/WResNet. For ImageNet, WVGG saves about 2.63 hr than the basic, and LPVGG saves about 2.99 hr that is about 10% of the whole training time of the basic. For Natural Scene, LPVGG reduces about 31% of training time than the basic. Table 3 shows each network’s FLOPs, complexity (Compl.), and test times for different batch sizes. Our complexity here is calculated similarly to N in (1) for convolution layers and fully connected layers. For instance, with VGG on 15.47 ImageNet, we get an acceleration rate of 15.47 5.48 ≈ 2.82 for LPAE, and 4.46 ≈ 3.47 for WAE. These numbers are similar to the rough computation of acceleration rate in Sect. 4, i.e., 3.2 for LPAE and 3.76 for WAE. When we check the results of FLOPs about VGG, we get 31.02 billion (B) for the basic, 8.95 B for LPVGG, and 11.01 B for WVGG. Thus the FLOPs of LPVGG and WVGG are about 0.35 and 0.29 of that of VGG. Hence, the computational cost of LPVGG, similar to WVGG, is vastly reduced. We think that this point leads to the result accelerating LPVGG sufficiently in the test. LPVGG gets 12.67 ms in the test with 1 batch, which is decreased by 1.78 ms. If we raise batch size to 20 or 50 (cf. [39]), the decrease rate becomes larger because LPVGG has 80.77 ms instead of 127.07
72
S. Han et al.
Table 3. Comparison about FLOPs and test times between VGG or ResNet connected with WAE and LPAE. B, W and L is basic, WAE and LPAE same as Table 2, respectively. The unit of FLOPs and compl. is a billion (B). ImageNet FLOPs (B) VGG
B 31.02 W 8.95 L 11.01
ResNet B 7.77 W 2.74 L 3.68
Compl. (B)
Test time (ms) for batch
15.47 4.46 5.48
14.45 127.07 274.55 10.84 57.17 117.25 12.67 80.77 171.98
4.06 1.40 1.88
7.40 8.75 9.26
Compl. (B)
Test time (ms) for batch 1
20
50
5.06 1.47 1.80
6.96 7.03 7.56
41.78 21.03 27.80
85.88 39.45 57.68
1
20
61.09 40.86 56.90
50
140.10 88.27 127.46
Natural scene FLOPs (B) VGG
B 10.15 W 2.95 L 3.62
ms for size 20, and 171.98 ms instead of 274.55 ms for size 50. Although the connection with LPAE accelerates VGG less than the connection with WAE, there is a meaningful difference in test time between LPVGG and VGG. For other cases with 1 batch, results are unexpected, representing that basic VGG takes the shortest time for the test. We think the unexpected situations are due to the small size of the input. For increased batch, the test time of networks connected with LPAE again becomes smaller than that of original networks. 5.3
Super-Resolution
We evaluate super-resolution networks (WaveSR, WSR, LPSR) on two datasets (CelebA, Set5) using metrics PSNR and SSIM. For the result of WaveSR, we used our codes and could not reproduce results reported in [14] despite the same options. The values in Table 4 are obtained by applying our codes with the same environment to the three WaveletSRNet-based networks, namely, WaveSR, WSR, and LPSR. For CelebA, LPSR has top values among the three networks for all scales. In particular, for ×2 scale, there is the most significant gap when we change WPT to LPAE because PSNR is 36.04 dB for LPSR and 30.87 dB for WaveSR. This shows the power of LPAE reconstructing the image and having a learning method that fits the model’s parameter on data distribution. If we focus on the comparison between WSR and LPSR, the PSNR value of WSR takes a sharp drop with increased scale. WSR for ×2 scale gets 31.93 dB, which is even
Laplacian Pyramid-like Autoencoder
73
Table 4. Results for the super-resolution network based on WaveletSRNet. Scale
WaveSR WSR LPSR
CelebA ×2 PSNR (dB) 30.87 SSIM 0.920 ×4 PSNR (dB) 27.71 SSIM 0.840 ×8 PSNR (dB) 24.34 SSIM 0.707
31.93 0.940 17.96 0.513 15.05 0.431
36.04 0.967 29.15 0.871 24.86 0.735
×2 PSNR (dB) 32.03 SSIM 0.914 ×4 PSNR (dB) 28.60 SSIM 0.852 ×8 PSNR (dB) 24.01 SSIM 0.650
25.70 0.781 15.28 0.381 13.85 0.336
37.62 0.955 32.25 0.899 27.07 0.782
Set5
Table 5. Comparison of PSNR values of LPSR with the state-of-the-art (SOTA) on Set5. Set5 ×2
×4
×8
CAR [34] 38.94 33.88 DRLN+ [2] 38.34 32.74 27.46 32.69 27.25 ABPN [29] 38.13 32.55 27.17 HBPN [28] DBPN-RES-MR64-3 [12] 38.08 32.65 27.51 37.91 32.12 MWCNN [24] CARN [1] 37.76 32.13 LFFN-S [38] 37.66 31.79 CSRCNN [40] 37.45 31.01 25.74 36.62 31.52 IKC [11] DeepRED [32] 30.72 26.04 LPSR
37.62 32.25 27.07
higher than the PSNR of WaveSR, but for ×4 scale, WSR reaches 17.96 dB. This result describes the limit of WAE, which means that WAE does not consider multi-scale analysis and reconstruction of the image (c.f. Sect. 1). The same tendency appears once more in the results for the Set5 dataset. LPSR obtains the biggest PSNR, SSIM values among the three networks. Its PSNR values are 37.62 dB for ×2 scale, 32.25 dB for ×4 scale, and 27.07 dB for ×8 scale. For WSR, since Set5 is dissimilar to CelebA, which is the center-aligned data of
74
S. Han et al.
LR
HR
WaveSR 2× magnification
WSR
LPSR
LR
HR
WaveSR 4× magnification
WSR
LPSR
LR
HR
WaveSR 8× magnification
WSR
LPSR
Fig. 3. Results of the super-resolution networks based on WaveletSRNet for Set5 dataset.
Laplacian Pyramid-like Autoencoder
75
constant size, reconstruction by WAE falls down to 25.70 dB in ×2 scale. For ×4 or ×8 scale, it is hard to do reconstruction using WSR. Table 5 shows the comparison of our LPSR on Set5 with SOTA networks. Our PSNR values rank about the middle on average, which is good, considering the fact that LPSR is obtained by a simple change using LPAE from WaveSR. Figure 3 shows the reconstruction of WaveSR, WSR, LPSR for two images of Set5. WaveSR works well for ×2 scale, but as the scale is getting bigger, the reconstruction images of WaveSR are blurred and have a checkers pattern. However, LPSR gets highquality reconstruction images for all scales.
6
Conclusion and Future Works
We organize LPAE, which assists in various problems from network acceleration to super-resolution. It reflects the structure of LP and consists of the encoder and the decoder. Using the encoder, we decompose an image into the approximation image with a low-resolution/frequency channel and the detail image with a highfrequency channel. The decoder recreates the original image correctly using the approximation and the detail. Three types of loss (approximation loss, sparsity loss, and reconstruction loss) enable us to obtain clear decomposed images and to achieve better reconstruction. Experiments in this paper show that LPAE makes the existing classification network light and preserves the original accuracy in classification. For super-resolution, it accomplishes better performance than the established wavelet-based model. In the future, we plan to explore a range of applications to different problems such as generative models, image compression, and character recognition. Acknowledgments. T. Hur—This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-00023, Developing a lightweight Korean text detection and recognition technology for complex disaster situations). Y. Hur—This work was supported in part by National Research Foundation of Korea (NRF) [Grant Numbers 2015R1A5A1009350 and 2021R1A2C1007598], and by the ‘Ministry of Science and ICT’ and NIPA via “HPC Support” Project.
References 1. Ahn, N., Kang, B., Sohn, K.-A.: Fast, accurate, and lightweight super-resolution with cascading residual network. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 252–268 (2018) 2. Anwar, S., Barnes, N.: Densely Residual Laplacian Super-Resolution. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1192–1204 (2022) 3. Ardakani, A., Condo, C., Ahmadi, M., Gross, W.J.: An Architecture to Accelerate Convolution in Deep Neural Networks. IEEE Trans. Circuits Syst. I Regul. Pap. 65(4), 1349–1362 (2018)
76
S. Han et al.
4. Bevilacqua, M., Roumy, A., Guillemot, C., line Alberi Morel, M.: Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In: Proceedings of the British Machine Vision Conference, pp. 135.1-135.10. BMVA Press (2012), ISBN 1-901725-46-4 5. Burt, P. J., Adelson, E.H.: The Laplacian Pyramid as a Compact Image Code. In: Readings in Computer Vision, pp. 671–679 (1987) 6. Chen, T., Lin, L., Zuo, W., Luo, X., Zhang, L.: Learning a wavelet-like autoencoder to accelerate deep neural networks. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 6722–6729 (2018) 7. Chu, X., Zhang, B., Ma, H., Xu, R., Li, Q.: Fast, accurate and lightweight superresolution with neural architecture search. In: 2020 25th International Conference on Pattern Recognition, ICPR, pp. 59–64 (2021) 8. Dai, T., Cai, J., Zhang, Y., Xia, S.-T., Zhang, L.: Second-order attention network for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp. 11065–11074 (2019) 9. Do, M.N., Vetterli, M.: Framing Pyramids. IEEE Trans. Signal Process. 51(9), 2329–2342 (2003) 10. Doersch, C.: Tutorial on Variational Autoencoders (2021). arXiv:1606.05908 11. Gu, J., Lu, H., Zuo, W., Dong, C.: Blind super-resolution with iterative kernel correction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1604–1613 (2019) 12. Haris, M., Shakhnarovich, G., Ukita, N.: Deep back-projection networks for single image super-resolution. IEEE Trans. Pattern Anal. Mach. Intell. 43(12), 4323–4337 (2021) 13. Huang, F., Zhang, J., Zhou, C., Wang, Y., Huang, J., Zhu, L.: A deep learning algorithm using a fully connected sparse autoencoder neural network for landslide susceptibility prediction. Landslides 17(1), 217–229 (2020). https://doi.org/ 10.1007/s10346-019-01274-9 14. Huang, H., He, R., Sun, Z., Tan, T.: Wavelet-SRNet: a wavelet-based cnn for multi-scale face super resolution. In: Proceedings of the 2017 IEEE International Conference on Computer Vision, ICCV, pp. 1689–1697 (2017) 15. Huang, Y., Xu, Q.: Electricity theft detection based on stacked sparse denoising autoencoder. Int. J. Electr. Power Energy Syst. 125, 106448 (2021) 16. Imani, M., Garcia, R., Gupta, S., Rosing, T.: Hardware-software co-design to accelerate neural network applications. ACM J. Emer. Technol. Comput. Syst. 15(21), 1–18 (2019) 17. Islam, Z., Abdel-Aty, M., Cai, Q., Yuan, J.: Crash data augmentation using variational autoencoder. Accid. Anal. Prev. 151(1), 105950 (2021) 18. Kaggle (Photo by Jan Bottinger on Unsplash). Intel Image Classification (2018). https://www.kaggle.com/puneet6060/intel-image-classification 19. Kim, J. Lee, J. K., Lee, K.M.: Accurate image super-resolution using very deep convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1646–1654 (2016) 20. Kong, X., Zhao, H., Qiao, Y., Dong, C.: ClassSR: a general framework to accelerate super-resolution networks by data characteristic. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp. 12016–12025 (2021) 21. Liang, J., Zeng, H., Zhang, L.: High-resolution photorealistic image translation in real-time: a Laplacian pyramid translation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp. 9392–9400 (2021)
Laplacian Pyramid-like Autoencoder
77
22. Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops, pp. 136–144 (2017) 23. Lin, T., et al.: Drafting and revision: Laplacian pyramid network for fast highquality artistic style transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5141–5150 (2021) 24. Liu, P., Zhang, H., Zhang, K., Lin, L., Zuo, W.: Multi-level wavelet-CNN for image restoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops, pp. 773–782 (2018) 25. Liu, Y., Agarwal, S., Venkataraman, S.: AutoFreeze: automatically freezing model blocks to accelerate fine-tuning (2021). arXiv:2102.01386 26. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision, ICCV (2015) 27. Liu, Z.-S., Siu, W.-C., Wang, L.-W.: Variational autoencoder for reference based image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Workshops, pp. 516–525 (2021) 28. Liu, Z.-S., Wang, L.-W., Li, C.-T., Siu, W.-C.: Hierarchical back projection network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops (2019a) 29. Liu, Z.-S., Wang, L.-W., Li, C.-T., Siu, W.-C., Chan, Y.-L.: image super-resolution via attention based back projection networks. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop, ICCVW, pp. 3517–3525 (2019b) 30. Mahmoud, M., et al.: TensorDash: exploiting sparsity to accelerate deep neural network training. In: 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO, pp. 781–795 (2020) 31. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders (2016). arXiv:1511.05644 32. Mataev, G., Milanfar, P., Elad, M.: DeepRED: deep image prior powered by RED. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, ICCV (2019) 33. Russakovsky, O., et al.: ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-0150816-y 34. Sun, W., Chen, Z.: Learned image downscaling for upscaling using content adaptive resampler. IEEE Trans. Image Process. 29, 4027–4040 (2020) 35. Timofte, R., Agustsson, E., Gool, L.V., Yang, M.-H., Zhang, L., Lim, B., et al.: NTIRE 2017 challenge on single image super-resolution: methods and results. In: The IEEE Conference on Computer Vision and Pattern Recognition, CVPR Workshops (2017) 36. Vahdat, A., Kautz, J.: NVAE: a deep hierarchical variational autoencoder (2021). arXiv:2007.03898 37. Wang, J., Duan, Y., Tao, X., Xu, M., Lu, J.: Semantic perceptual image compression with a Laplacian pyramid of convolutional networks. IEEE Trans. Image Process. 30, 4225–4237 (2021) 38. Yang, W., Wang, W., Zhang, X., Sun, S., Liao, Q.: Lightweight Feature Fusion Network for Single Image Super-Resolution. IEEE Signal Process. Lett. 26(4), 538–542 (2019) 39. Yapıcı, M.M., Tekerek, A., Topaloglu, N.: Performance comparison of convolutional neural network models on GPU. In: 2019 IEEE 13th International Conference on Application of Information and Communication Technologies, AICT, pp. 1–4 (2019)
78
S. Han et al.
40. Zhang, J., Wang, Z., Zheng, Y., Zhang, G.: Cascaded convolutional neural network for image super-resolution. In: Sun, X., Zhang, X., Xia, Z., Bertino, E. (eds.) ICAIS 2021. CCIS, vol. 1422, pp. 361–373. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-78615-1 32 41. Zhang, J., Yu, H.-F., Dhillon, I.S.: AutoAssist: a framework to accelerate training of deep neural networks. In: NIPS 2019: Proceedings of the 33rd International Conference on Neural Information Processing Systems, vol. 539, pp. 5998–6008 (2019) 42. Zhang, W., Jiao, L., Li, Y., Huang, Z., Wang, H.: Laplacian feature pyramid network for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 60, 1–14 (2021) 43. Zhang, X., Song, H., Zhang, K., Qiao, J., Liu, Q.: Single image super-resolution with enhanced Laplacian pyramid network via conditional generative adversarial learning. Neurocomputing 398, 531–538 (2020) 44. Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3(1), 47–57 (2016)
Autonomous Vision-Based UAV Landing with Collision Avoidance Using Deep Learning Tianpei Liao1(B) , Amal Haridevan2 , Yibo Liu2 , and Jinjun Shan2 1
Department of Electrical Engineering and Computer Science, York University, Toronto, Canada [email protected] 2 Department of Earth and Space Science and Engineering, York University, Toronto, Canada [email protected], {yorklyb,jjshan}@yorku.ca Abstract. The autonomous vision-based Unmanned Aerial Vehicles (UAVs) landing is an adaptive way to land in special environments such as the global positioning system denied. There is a risk of collision when multiple UAVs land simultaneously without communication on the same platform. This work accomplishes vision-based autonomous landing and uses a deep-learning-based method to realize collision avoidance during the landing process. Specifically, the landing UAVs are categorized into Level I and II. The YoloV4 deep learning method will be implemented by the Level II UAV to achieve object detection of Level I UAV. Once the Level I UAV’s landing has been detected by the onboard camera of Level II UAV, it will move and land on a relative landing zone beside the Level I UAV. The experiment results show the validity and practicality of our theory. Keywords: Vision-based
1
· Collision avoidance · Deep learning
Introduction
Unmanned Aerial Vehicle (UAV) development has shown valuable potential in the delivery market. The companies like Amazon and Alibaba develop a considerable interest in drones delivery and have started competing about testing drones to deliver packages [3]. Various UAVs have different purposes including transportation of food, medical supplies, and packages. A key aspect of drones delivery is autonomous landing. Recently, researchers have shown an increased interest in autonomous UAV landing using fiducial markers. The purpose of using fiducial markers is to estimate the pose of the vehicle by obtaining the sixdegrees of freedom [8]. A number of techniques, such as Apriltag [8] and Artag [5], have been developed to adapt and explore autonomous landing. The original version of this chapter was revised: Author provided belated correction has been incorporated. The correction to this chapter is available at https://doi.org/10.1007/978-3-031-10464-0 64 c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 79–87, 2022. https://doi.org/10.1007/978-3-031-10464-0_6
80
T. Liao et al.
In the situation that two different UAVs can not communicate and share information, that is having no vehicle safety communication (VSC) [11], there is an urgent need to address the safety problems caused by the collision during autonomous landing.
Fig. 1. Diagram of proposed framework
Due to the complexity of the business model UAVs, the conflict of “Free Flight” [9] is possible to occur and yields the dangerous movement. For the large-scale UAV, one of the greatest challenges is to estimate the spatial relation with another close UAV so that it can archive a safe landing path. Depending on the functionalities and diversities of UAVs, the different priorities of vehicles need to be assigned if they are landing closely and recently. In this research, we purpose a strategy to resolve the risk of collision when two different levels of UAV landing on close paths. Our goal is to achieve one UAV recognizing the landing path of another UAV so that it can wait for it to finish the landing process. Figure 1 shows the diagram of proposed framework. The Apriltags are used for navigation and estimating position due to its efficiency, less false positive rate, and robustness [8]. The two different levels of UAV are categorized into Level I and II UAV and they are landing on close platforms. To clarify the problem of having no VSC [11], the vision-based collision avoidance method is introduced. An inexpensive detection algorithm is implemented to archive real-time decision-making. Moreover, the YoloV4 [4] deep learning approach is adopted on Level II UAV to obtain further in-depth information on object detection. The Non-Maximum Suppression (NMS) [6,7] is utilized to avoid multiple bounding boxes so that it raises detection accuracy. The Level I UAV is labeled by the bounding box on the view of Level II UAV. The path of the bounding box can be understood by Level II UAV to determine if Level I UAV has finished landing. The Level II UAV will safely move to its landing zone after it determines the landing of Level I UAV.
Autonomous Vision-Based UAV Landing with Collision Avoidance
2 2.1
81
Proposed Method Estimation of Position
The recent Apriltag detection has been improved on detection speed and localization accuracy by implementing the continuous boundary segmentation algorithm [8]. To estimate its 3D coordinate in the world coordinate system, the tag position is required to be obtained first. In the situation that the Level II UAV always sees the Apriltags, the 3D reference points of each Apriltag 4-corners and their 2D projection need to be resolved, which are considered as “Perspectiven-Point problem” (PnP). The solving PnP method is based on the calibrated camera, so it demands for the camera intrinsic matrix A after the camera calibration. The closed-form solution is used to gain the intrinsic parameters of A in camera calibration [10]. A primary concern of solving the PnP problem is coordinate transformation. The system consists of the pixel, image, camera, and world coordinate system. To solve the PnP problem and regulate different coordinate systems, the relative method [2] is used s p = A [R | t] Pw
(1)
where A is a 3 × 3 camera intrinsic matrix. The join matrix [R|t] consists of rotation R and translation t, which is obtained from Apriltag’s coordinates and orientation. Pw represents the 3D point with respect to world coordinate system. The corresponding 2D pixel with respect to image coordinate system denotes as p and s is the scaling factor [2]. 2.2
YoloV4 Dataset Cloud Training
Fig. 2. Example of labeling
One of the concern about real-time detectors is the trade-off between Graphics Processing Units (GPU) usage and detection accuracy. The improved pipeline
82
T. Liao et al.
computation implemented in YoloV4 produces high-quality object detection in real-time [4]. Therefore, YoloV4 is applied for our collision avoidance strategy to compensate for the massive memory usage during the real flight time. The advantage of YoloV4 provides faster FPS and a more accurate Average Precision detector [4]. Figure 2 shows the example of labeling Level I UAV in the training dataset. The output images will be cropped into the customized size that matches your detector algorithm. Cloud dataset training is an innovative and convenient training method. It does not require sophisticated configuration with hardware since the environment has been built up in the cloud server. The Google Colab Pro [1] was recently introduced by Google that connects with the super engine at the backend Google cloud server and extremely increases the training process. 2.3
Collision Avoidance
The Level I UAV has no awareness of Level II UAV. It will perform the automatic landing on the Apriltag that is placed on the ground first as it has higher priority To gain the path of Level I UAV’s landing path on Level II’s onboard camera, we calculate the gap of the previous and current bounding box with respect to image coordinate system. The procedure of collision avoidance requires the detection of Level I complete landing.
Fig. 3. Illustration of image coordinate system
Figure 3 shows the 2D image coordinate system where the top left corner is the origin from the view of Level II UAV. The iterative image will be processed into a 4-dimensional blob after the detection. The blob is denoted as [x, y, w, h]. The x, y are the coordinates of the center of bounding boxes. The w, h are the width and height of bounding boxes. One blog is considered as the collection of images with the same width, height, and depth. To increase detection accuracy during real flight time, we use Non-Maximum Suppression to filter out some of the bounding boxes that have poor accuracy. The filter consists of two parts. In
Autonomous Vision-Based UAV Landing with Collision Avoidance
83
order to minimize the candidates of filtering. First, simply select the confidence score that is higher than 0.5 in the set of predictions, which yields a new set. Let P represents the set of all possible prediction, and P be the new filtered set. P = {p ∈ P | c score(p) > 0.5}.
Fig. 4. Intersection area between Bounding Box 1 and 2
The highest confidence score from P will be selected and then calculate the Intersection over Union (IoU) value with all other elements from P . The IoU value is the intersection area between the highest confidence score and one of the selected bounding boxes in P divided by the union areas. The Fig. 4 shows the grey area as the intersection area of bounding boxes. If the IoU exceeds the IoU threshold that we define, then the selected bounding box will be removed from P . Repeat the process until there is no prediction left in P . The IoU threshold is 0.4 in our experiment.
(a) Bounding Boxes (b) Bouding Box without NMS with NMS
Fig. 5. Difference between Absence or Presence of using NMS
Figure 5(a) shows the multiple bounding boxes with various confidence scores before NMS filtering. Figure 5(b) shows the result after the NMS filtering. The bounding box coordinate in real time is represented by (xt , yt ). The function f (t) is the distance difference of current and previous bounding box (2) f (t) = (xt+Δt − xt )2 + (yt+Δt − yt )2 where (xt , yt ) is the current bounding box coordinate and (xt+Δt , yt+Δt ) is the previous bounding box coordinate. The change with respect to image coordinate in x-axis might be slightly since the landing mostly affects the bounding box in
84
T. Liao et al.
y-axis. The sampling time of vehicle is represented as Δt. The landing threshold is denoted as σ which demonstrates the minimum transition of Level I UAV landing on Level II UAV image plane. If f (t) > σ, the Level II UAV determines that Level I UAV has completed landing.
Fig. 6. Proposed procedure of autonomous landing collision avoidance
Figure 6 shows the procedure of Level II UAV autonomous landing collision avoidance. If Level II UAV detects the existence of Level I UAV, it will hover until the Level I UAV concludes landing. Otherwise, it keeps tracking the iterative images and moves to the landing zone.
3
Experiment
Level I UAV is presented by Qdrone from Quanser which has Omnivision OV7251 as the down camera. It is considered a heavy-duty UAV that has higher priority due to its payload and heavyweights. Level II UAV is presented by DJI Tello which has the front camera with 720HD transmission. It is considered an agile UAV with lower priority and needs to wait for Level I UAV’s landing. Since both Level I and II UAV’s landing zone are close, the autonomous system is controlled by the pre-trained dataset and machine learning algorithms. The pose estimation and automatic landing are based on fiducial markers. In the experiment, there are six Apriltag used for level II UAV pose estimation, whose family is 36 h11 and 0–5 id. There is a total of 24 n points from
Autonomous Vision-Based UAV Landing with Collision Avoidance
85
Apriltags. The 3D points of Apriltag with respect to world coordinate origin needs to be assigned first. For Level I UAV autonomous landing, it will capture the Apriltag (id:6) by its down camera within the limited range, and land on the Apriltag (id:6).
(a) State 1: Waiting
(b) State 2: Moving
Fig. 7. 2 States of Level II UAV
Figure 7 shows two different states of Level II UAV from its front camera where the bounding box highlights the Level I UAV. It switches from normal flight to State 1 if it sees Level I UAV. This will forward 0 cm/s velocities to the remote control of the motors, which means hovering and waiting. If f (t) in Eq. 2 exceeds landing threshold σ, it will consider that the Level I UAV accomplishes landing and switches to State 2. The landing threshold of the experiment σ is 150 in pixels. State 2 is showing in Fig. 7(b).
Fig. 8. Level II UAV x,y-velocity
Figure 8 shows the x,y-velocity with respect to world coordinate in cm/s from Level II UAV. It indicates the velocities when Level II UAV is moving to the expected location where the Apriltag is. The positive y-direction is toward the Apriltags and the positive x-direction is to the right when the Level II UAV
86
T. Liao et al.
front camera is facing the Apriltags. The value of velocity is proportional to the difference between current and expected position. So it will become slower when it is approaching the expected landing zone as the difference is decreasing. When the Level II UAV is taking off, it is moving to expected landing zone by forwarding x,y-velocity to remote control. After 11 s, the bounding box of Level I UAV has been generated by object detection. It forwards 0 cm/s velocity to all the motors as it is waiting for Level I UAV which means the Level II UAV is currently at the state 1. After 37th second, it detects the Level I UAV complete landing, so it moves to the landing zone and land where the Level II UAV has switched from state 1 to 2 meaning there is a safe path for landing. To compensate for the latency of implementing deep learning object detection, there is a time delay of one second after sending the velocity to remote control. The video of our experiment is shown here https://youtu.be/AtY4MuwV8tM.
4
Conclusion
This project was undertaken to evaluate the risk of collision under multiple UAVs which are having no vehicle safety communication (VSC) [11] and design a visionbased collision avoidance method during the autonomous landing of 2 different levels of UAV. The present study lays the groundwork for future research into the real-time decision-making of collision avoidance using the object detection bounding boxes.
5
Future Study
Further studies need to be carried out in order to validate the prediction of object movement. It will adapt more complex trajectories and compensate the bounding boxes that contain low confidence scores.
References 1. Google colab. https://colab.research.google.com (accessed 9 August 2021) 2. Opencv: Camera calibration and 3d reconstruction. https://docs.opencv.org/3.4/ d9/d0c/group calib3d.html (accessed 9 August 2021) 3. Kharpal, A.: Alibaba tests drone deliveries after Amazon push (May 2015) 4. Bochkovskiy, A., Wang, C.-Y., Mark Liao, H.-Y.: Yolov4: optimal speed and accuracy of object detection (2020). arXiv preprint, arXiv:2004.10934 5. Fiala, M.: Artag, a fiducial marker system using digital techniques. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 2, pp. 590–596 (2005) 6. Mao, Q.-C., Sun, H.-M., Zuo, L.-Q., Jia, R.-S.: Finding every car: a traffic surveillance multi-scale vehicle object detection method. Appl. Intell. 50(10), 3125–3136 (2020). https://doi.org/10.1007/s10489-020-01704-5 7. Tripathi, R., Singla, V., Najibi, M., Singh, B., Sharma, A., Davis, L.: Asap-nms: accelerating non-maximum suppression using spatially aware priors (2020). arXiv preprint, arXiv:2007.09785
Autonomous Vision-Based UAV Landing with Collision Avoidance
87
8. Wang, J., Olson, E.: Apriltag 2: efficient and robust fiducial detection. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS, pp. 4193–4198 (2016) 9. Yang, L.C., Kuchar, J.K., Yang, L.C., Kucharf, J.K.: Prototype conflict alerting system for free flight. J. Guidance Control Dynam. 20, 768–773 (1997) 10. Zang, Z.: A flexible new technique for camera calibration determination of thermal properties of composting bulking materials. IEEE Trans. Pattern Anal. Mach. Intell. 22, 1330–1334 (2000) 11. Zimmer, M.: Surveillance, privacy and the ethics of vehicle safety communication technologies. Ethics Inf. Technol. 7(4), 201–210 (2005)
The Current State of the Art in Deep Learning for Image Classification: A Review Adam Byerly1,2(B) , Tatiana Kalganova1 , and Richard Ott3 1
Department of Electronic and Electrical Engineering, Brunel University London, Uxbridge UB8 3PH, UK [email protected] 2 Department of Computer Science and Information Systems, Bradley University, Peoria, IL 61615, USA 3 Air Force Research Laboratory, Sensors Directorate, 2241 Avionics Cir, WPAFB, Dayton, OH 45433, USA
Abstract. We present a review of the methods behind the top 40 highest accuracies achieved on the ILSVRC 2012 Imagenet validation set as ranked on Papers with Code. A significant proportion of these methods involve using transformer-based architectures, but it should be noted that none of the methods are na¨ıve self-attention transformers, which would be unmanageably large if the tokens were derived on a per-pixel basis. Rather, the works we review here all toil with different methods of combining the global nature of self-attention with the local nature of fine-grained image features, which have historically been the strength of convolutional neural networks. However, it should be noted that 9 out of 22 works reviewed did NOT use transformers. Keywords: State of the Art · Imagenet · Papers with code Transformers · Convolutional neural networks
1
·
Introduction
In [4], the authors pose the question: “Are we done with ImageNet?”. They ask if the recent progress on the Imagenet-1K [11] evaluation benchmark is continuing improvement on generalization or the result of us (the deep learning for image classification community) learning some latent properties of the labeling procedure. The latter possibility is interesting, and in their work, they do some good analysis and provide a better set of labels, which we should all consider using going forward. However, for now, the original labels remain the standard benchmark and the means by which comparisons among the best models are made. Papers with Code [33] has become the best-known record of the state-of-the-art methods for all types of deep learning tasks, including image classification. On Papers with Code, in the case of Imagenet, the performance is ranked by top-1 accuracy achieved. In this review, we will examine the technologies behind the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 88–105, 2022. https://doi.org/10.1007/978-3-031-10464-0_7
SOTA in Image Classification: A Review
89
top 40 best-ranked accuracies, which are reported in 22 papers (some papers present multiple models which rank multiple times).
2
Transformer-Based Networks
Since [52] and later [12], transformer networks have been dominating NLP deep learning tasks. As such, computer vision researchers have been looking into ways to take that success and transfer it to their domain. They have done so with a fair amount of success, with the caveat that such success in most cases has required unprecedentedly large networks with unprecedentedly large sets of additional training data. The fact that in this review we will encounter non-transformerbased networks trained without additional training data that are competitive with these networks suggests that it remains an open question whether or not transformer-based networks will entirely supplant convolutional neural networks in computer vision tasks. See Table 1 for a comparison of the transformer-based models reviewed here. In [15], the authors introduce ViT. ViT is currently the vision transformer network that most recent transformer networks compare themselves to or use as a basis for their designs. Inspired by the success of transformers applied to the NLP domain, the authors endeavored to create a network for the vision domain out of transformers sans convolutions entirely, and in their own words “with the fewest possible modifications” to existing transformer designs. The authors note that applying self-attention naively to entire images means attending every pixel to every other pixel and thus represents a quadratic complexity relative to the image’s size, which would not scale well to usable input sizes. The insight they leveraged was that 16 × 16 patches of an image could be treated much like words are treated in NLP applications. Prior attempts at fully transformer-based networks [39] failed to achieve competitive results on ImageNet-1k evaluation accuracies due to having not attempted to scale up the network’s parameters and additional training data. Again, in their own words, the authors discovered that “large scale training trumps inductive bias”—the inductive bias being that which is introduced by convolutions. In [59], the authors conducted a systematic study of the relationships between data size, compute budget, and achieved accuracy across a spectrum of ViT models [15]. Unsurprisingly, they discovered that bigger models with larger compute budgets result in higher accuracies, with the caveat that there exists sufficient data to train the model. In the largest models they studied, even 300M samples were insufficient to saturate the models’ achievable accuracy. Additionally, they found that the larger models were more sample efficient, meaning they achieve the same accuracy as smaller models after training for fewer steps. Another important observation that the authors made was that for more than two orders of magnitude, compute budget and accuracy followed a power-law, and at the high end of the compute budget, the largest models were not tending toward perfect accuracy, suggesting that a model with infinite capacity would achieve less than perfect accuracy. The authors noted that similar effects have been observed
90
A. Byerly et al. Table 1. Transformer-based networks
Rank top-1 Acc. # of Params. Addt’l. training samples
Paper title
2
90.45%
1843M
3B
Scaling Vision Transformers [59]
4
90.35%
14700M
3B
Scaling Vision with Sparse Mixture of Experts [40]
8
88.87%
460M
300M
TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? [41]
9
88.64%
480M
1.8B
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [29]
11
88.55%
632M
300M
An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale [15]
14
88.36%
7200M
300M
Scaling Vision with Sparse Mixture of Experts [40]
15
88.23%
2700M
300M
Scaling Vision with Sparse Mixture of Experts [40]
16
88.08%
656M
300M
Scaling Vision with Sparse Mixture of Experts [40]
18
87.76%
307M
300M
An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale [15]
20
87.54%
928M
14M
Big Transfer (BiT): General Visual Representation Learning [34]
21
87.5%
173M
14M
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows [13]
22
87.41%
3400M
300M
Scaling Vision with Sparse Mixture of Experts [40]
24
87.3%
197M
14M
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows [36]
26
87.1%
296M
0
VOLO: Vision Outlooker for Visual Recognition [58]
27
86.8%
193M
0
VOLO: Vision Outlooker for Visual Recognition [58]
32
86.5%
356M
0
Going deeper with Image Transformers [49]
35
86.4%
150M
0
All Tokens Matter: Token Labeling for Training Better Vision Transformers [30]
37
86.3%
271M
0
Going deeper with Image Transformers [49]
38
86.3%
307M
0
BEiT: BERT Pre-Training of Image Transformers [2]
39
86.3%
86M
0
VOLO: Vision Outlooker for Visual Recognition [58]
40
86.1%
271M
0
Going deeper with Image Transformers [49]
in generative models and the authors of that work referred to this phenomenon as the “irreducible entropy” of the task [20]. This further supports the hypothesis that there is a ceiling on the achievable accuracy for the ILSVRC 2012 Imagenet validation set [4]. They observed a similar saturation at the lower end of the compute budget scale, where smaller models achieved better accuracy than the power-law would predict. Mixture of Experts (MOE) is a method of combining the outputs of multiple sub-models called experts using a router mechanism. Generally, these have been studied since the early 1990s [6,27,31]. More recently, they have been applied to computer vision tasks [1]. In [40], the authors endeavored to combine MOEs with transformers. They designed a network that, while containing a large number of parameters, not all parameters get used during inference and they demonstrate the network’s ability to achieve competitive results while using as little as
SOTA in Image Classification: A Review
91
half of the computational power available in the network on any sample. Interestingly, the router mechanism they designed doesn’t route entire images, but rather individual patches of the images, so that different transformers in the network operate on different patches, possibly of a single image. Additionally, they created a fixed buffer size per expert in their mixture to encourage load balancing among the experts which encourages the overall model not to end up favoring only a small subset of the experts. The network designed by the authors of [41] is another design based on transformers. Transformers for visual tasks work by splitting the input into patches. The authors noted that in most cases, of the 200–500 patches produced for images of typical training sizes, about 8 or 16 of them were the most informative. They propose a mechanism that the call “TokenLearner” which, prior to the transformer block, learns which patches are significant and passes only those to the transformer. In so doing, they were able to reduce the total number of FLOPs by half and maintain classification accuracy. In addition to the TokenLearner module that precedes the transformer block, they devised a “TokenFuser” module that follows the transformer block which maps the result of the transformer operation back to the input’s original spatial resolution, which allows the input and output of the set of operations to maintain the same tensor shape, making them easier to fit into a model’s overall architecture. In [13], the authors grapple with the fact that in transformer-based architectures for vision tasks, global self-attention is an extremely expensive operation (quadratic in complexity) compared to local self-attention, which limits interactions between tokens. Their attempt to find a middle option is to introduce what they term a “Cross-Shaped Window” (CSWin), which is an attention mechanism that involves computing self-attention for vertical and horizontal stripes of the input image in parallel. In addition, they introduce a new positional encoding scheme they call “Locally-enhanced Positional Encoding” (LePE), which they claim, “handles the local positional information better than existing encoding schemes”, “naturally supports arbitrary input resolutions”, and is “especially effective and friendly for downstream tasks”. LePE differs from other positional encoding schemes by, rather than being concatenated into the input before the transformer block as with absolute positional encoding (APE) [52] and conditional positional encoding (CPE) [8], moving the encoding inside the encoding block as with relative positional encoding (RPE) [36,43]. But rather than happening inside the SoftMax operation that uses the queries, keys, and values, LePE is applied directly to the values only. In [36], the authors precedes and is cited by [13] and the two papers share an author. The approaches are also quite similar, though the leap from a network of transformers like are present in ViT to what the authors propose in this work is a little more apparent. In this work, the authors note that the spatial position of the patches of the images (the tokens) being used by all layers in ViT are the same. The authors argue that it is better to think of how the patches are divided up as being subject to a window that shifts across the image in subsequent layers. This allows for connections between overlapping regions in
92
A. Byerly et al.
the image to be learned by combinations of transformers. This network is trained entirely on publicly available data, using the 14M image ImageNet-22k dataset for additional training data. The authors of [49] start with a network similar to ViT, consisting of a series of transformer blocks with residual connections between them. They then altered this design in two specific ways. They posit that a problem with the ViT architecture is that the class token being passed along with the image patches through every transformer layer is asking the optimizer to optimize two contradictory objectives. Those objectives being learning the self-attention for the patches and learning the information that leads to the correct classification. In order to combat this, they propose using two different processing stages, the first of which is not passed the class token so that the transformers in this stage can focus solely on learning the self-attention, and only in this stage does the self-attention get updated. In the second stage, the class token is added, and the transformers begin learning the classification. Additionally, they added a learnable diagonal matrix they call the “LayerScale” which they multiply the output of a transformer block by before concatenating together with the path that skipped over that transformer block. They refer to this architecture as CaiT (Class-Attention in Image Transformers). This network is trained without using any additional training data. In [30], the authors propose a method they call “token labeling”. The idea behind it is to have each token coming out of a transformer block learn a Kdimensional vector for the classification for that specific patch, where K is the number of classes, and the vector components represent the probabilities of that patch belonging to each class. And then for the final classification, these are averaged together across the patches and then combined with the overall image class to form a final prediction. A drawback to this method is that before doing this, each patch’s probability for each image must be generated and stored. This network is trained without using any additional training data. The authors of [2] attempt to take the methods of BERT [12], which are applied to the natural language processing (NLP) domain, and apply them to the vision domain. They call their attempt BEiT. To do this requires a pre-pretraining step that creates discrete token values for each patch of each image via an autoencoder. Then, during pre-training, a transformer-based BEiTEncoder is trained to encode image patches into their corresponding tokens, with the caveat that some of the image patches fed into the network are masked out. Then for the final task of image classification, the pre-trained model has an additional classifier network appended. This network is trained without using any additional training data. The authors of [58] took note of the fact that all of the best-performing transformer-based vision models were using large amounts of additional training data to achieve their results. This motivated them to study the use of transformers while training on only the actual Imagenet 1k training data. Their findings were that a major factor in this is the larger parch sizes (typically 16 × 16 or 14 × 14) that most transformer architectures use due to their quadratic complexity.
SOTA in Image Classification: A Review
93
The authors posit that this fails to encode sufficiently fine-grained information. Their solution, which at first seems counter-intuitive, is to increase the patch size to 28 × 28, which for images of size 224 × 224 means an 8 × 8 embedding. Then, within each of those patches, use a sliding window attention mechanism to relate the fine-grained information within those patches together. A series of these transformer blocks make up the first stage of their design. The second stage of their design is to split each of those embeddings into 2 × 2 embeddings of size 14 × 14 and again apply the sliding window attention mechanism. This network is trained without using any additional training data and is the highest-ranked network to do so. In their own words, the authors of [34] “aim not to introduce a new component or complexity, but to provide a recipe that uses the minimal number of tricks yet attains excellent performance on many tasks”. They refer to this “recipe” as Big Transfer (BiT). In their work, they show that BiT can be pre-trained once and then fine-tuned quite cheaply on the task it is transferred to using a simple heuristic for choosing the hyperparameters for the fine-tuning training. They call this heuristic the “Bit-HyperRule”. In their study, they found that they could limit the hyperparameters that need to be fine-tuned to the learning rate schedule and whether or not to use MixUp [60] after transferring. The first step in their heuristic is to categorize the size of the dataset they are transferring to. They class datasets with fewer than 20k labeled examples as small, datasets with more than that, but less than 500k labeled examples as medium, and everything else as large. Then after transfer, for small datasets, they train for 500 steps, for medium, 10,000 steps, and for large 20,000 steps, decaying the learning rate by a factor of 10 after 30%, 60%, and 90% of the training steps. They use MixUp with α = 0.1 for medium and large datasets. The network they designed is based on ResNet-v2 [21], but instead of using Batch Normalization (BN), they use Group Normalization (GN) [54] and add Weight Standardization (WS) [38] to all of the convolutions. The authors argue that batch normalization is a poor choice for transfer learning due to the requirement to update running statistics and show that the combination of GN and WS has a significant positive impact on transfer learning tasks. This network is trained entirely on publicly available data, using the 14M image ImageNet-22k dataset for additional training data.
3
Transformer/Convolution Hybrid Networks
Two of the works we reviewed, including the top-ranking design, endeavored to use a combination of transformers and convolutions in their designs. See Table 2 for a comparison of the transformer/convolution hybrid networks reviewed here. The authors of [10] note that convolutional neural networks perform well due to their natural locality bias and tend to generalize well and converge relatively quickly, whereas networks employing transformers perform well because of their ability to find global interactions more easily than CNNs but have been shown to require much more data and many more parameters. In their work, the authors endeavored to combine the benefits of both convolution and attention
94
A. Byerly et al. Table 2. Transformer/convolution hybrid networks
Rank top-1 Acc. # of Params. Addt’l. training samples
Paper title
1
90.88%
2440M
3B
CoAtNet: Marrying Convolution and Attention for All Data Sizes [10]
3
90.45%
1470M
3B
CoAtNet: Marrying Convolution and Attention for All Data Sizes [10]
19
87.7%
277M
14M
CvT: Introducing Convolutions to Vision Transformers [53]
by summing a global static convolution kernel with the attention matrix prior to the Softmax normalization inside the transformer block’s attention heads. They refer to this as relative attention. Because the global context required for relative attention has a quadratic complexity with respect to the spatial size of the input, the direct application of relative attention to the raw image is not computationally tractable. Thus, their overall network architecture begins with an initial stem of traditional convolutional operations, which they refer to as stage 0, that down-samples the input image to feature maps half of the original image’s size. Then, stages 1 and 2 are Squeeze and Excitation [23] blocks that each further reduce the size of the filter maps by half. It is at this point the filter maps have attained a size that relative attention is able to cope with. As such, stages 3 and 4 are made up of a series of relative attention transformer blocks before the network goes on to a final global pooling and fully connected layer that leads to the output classification probabilities. Residual connections are made between each stage and before the feed-forward network of each transformer block. The authors pre-trained their networks on Google’s internal JFT-3B dataset [59], which as the name implies, consists of 3 billion images. It is worthy of note that training their best performing network took 20.1K TPUv3-core days. The authors of [53] start with ViT as a basis for their design and then introduce three changes. First, at the beginning of each transformer, they introduce what they call a convolutional token embedding, which involves reshaping the token sequence going into the transformer back into their 2D spatial positions and performing an overlapping, striding convolution. Then, they replace the linear projection before each self-attention block with what they call “convolutional projection”, which uses depth-wise separable convolutions [32] on the 2D-reshaped token map. This replaces the linear projection used by ViT that is applied to the query, key, and value embeddings. Finally, they remove the positional encoding that is usually present in the first stage of a transformer block. The question regarding the necessity of positional encoding in transformers used for vision tasks had been previously raised and studied [8]. Notably, this is the
SOTA in Image Classification: A Review
95
highest rank achieved using less than 300M additional training samples, as well as being the highest-ranking design to use a public dataset (Imagenet-22k) for its additional 14M samples of training data.
4
EfficientNet Networks
EfficientNet [46] is a model family that consists of progressively larger models which have been optimized for computation and parameter efficiency using Neural Architecture Search [61], which is a reinforcement learning method that learns the best neural network architecture to use for a given task. See Table 3 for a comparison of the EfficientNet networks reviewed here. Table 3. EfficientNet networks
Rank top-1 Acc. # of Params.
Addt’l. training samples
Paper title
12
88.5%
480M
300M
Fixing the train-test resolution discrepancy: FixEfficientNet [51]
23
87.3%
208M
14M
EfficientNetV2: Smaller Models and Faster Training [47]
25
87.1%
66M
300M
Fixing the train-test resolution discrepancy: FixEfficientNet [51]
28
86.8%
120M
14M
EfficientNetV2: Smaller Models and Faster Training [47]
30
86.7%
43M
300M
Fixing the train-test resolution discrepancy: FixEfficientNet [51]
34
86.4%
30M
300M
Fixing the train-test resolution discrepancy: FixEfficientNet [51]
In [47], the authors of the original EfficientNet paper continue their work by introducing EfficientNetV2. In their study, they argue that the scale of regularization needs to be proportional to the original image size of the dataset’s images. This includes varying the regularization on a single network design based on the original image size of the dataset it is being trained with. Networks that work with smaller images, should use less regularization, and networks that work with larger images should use more regularization. In their prior work, the authors scaled up the number of layers in every stage of their network by the same factor. In this study, they show that gradually adding additional layers in the later stages is superior. Their prior work achieved the then state-of-theart top-1 accuracy of 84.4%. This extension to that work achieved 87.3% top-1 accuracy—nearly a 4% absolute improvement.
96
A. Byerly et al.
The work in [51] can appropriately be seen as an extension of their earlier work [50]. In both papers, the authors note that there exists a discrepancy between the prevalent data pre-processing operations during training vs. evaluation. It is common to extract random rectangles from training images and scale them to a certain size each epoch as a form of data augmentation, but during evaluation, the common practice is to choose a central crop of equivalent size. This differing approach during training and evaluation results in varying typical scales of the objects trained on compared to objects of the same class during evaluation, and crucially, unlike with the case of translation, CNNs do not respond to scale differences in a predictable manner. In both works, the authors combat the scale discrepancy by allowing the network to learn how to resize the images during both training and evaluation. The details of the method by which they accomplish this are quite involved and beyond the scope of this paper. The interested reader is referred to the original works. In the first paper, the authors applied their method to ResNet networks and trained only with the 1.2M training images that are a part of the standard Imagenet-1k training set. In the second paper, they applied their method to EfficientNet [46] networks and used the standard Imagenet-1k training set with an additional 300M images for training.
5
Using Neither Transformers nor Convolutions
A single network using neither transformers nor convolutions ranks among the top 40 state-of-the-art networks we reviewed (see Table 4). Table 4. Networks using Neither Transformers nor Convolutions Rank top-1 Acc. # of Params. Addt’l. training samples
Paper title
17
MLP-Mixer: An all-MLP Architecture for Vision [48]
87.94%
431M
300M
The authors of [48] begin their introduction with the observation that “As the history of computer vision demonstrates, the availability of larger datasets coupled with increased computational capacity often leads to a paradigm shift”. Ironically, their architecture involves avoiding the usage of the canonical paradigm-shifting methods of convolutions and transformers and instead is made up entirely of simple multi-layered perceptrons (MLPs). Their architecture uses exclusively matrix multiplication, reshaping and transposition, and scalar nonlinearities. They use two different types of MLP layers. One which works independently on image patches, which “mix” the per-location features, and one which works across patches, which “mix” spatial information. They build their architecture from a series of “Mixer” layers, each of which is made up of each of
SOTA in Image Classification: A Review
97
the two types of “mixer” MLPs, each of which is two fully-connected layers and a GELU [19] nonlinearity. Mixer layers also include residual connections around the mixing sub-layers.
6
Teacher-Student Networks
Using teacher and student networks is arguably more of a training method than a network design. The overarching idea is that the two networks (a) have closelyrelated but nonetheless different goals or information and (b) either feed from the teacher to the student in a directed manner or to each other in a cyclic manner. See Table 5 for a comparison of the teacher-student networks reviewed here. Table 5. Teacher-student networks Rank top-1 Acc. # of Params. Addt’l. training samples
Paper title
5
90.2%
480M
300M
Meta Pseudo Labels [37]
6
90%
390M
300M
Meta Pseudo Labels [37]
13
88.4%
480M
300M
Self-training with Noisy Student improves ImageNet classification [55]
Pseudo-labeling [14] involves using a teacher network that generates pseudolabels on unlabeled data that is fed into a student network in tandem with labeled data. Eventually, the student outperforms the teacher. In [37], the authors extended on this idea, by allowing the teacher to receive feedback from the student and then to adapt. Specifically, how well the student performs on the labeled data is fed back to the teacher as a reward signal for the quality of the pseudo-labels it generated. This surprisingly simple idea leads to the highestranked design that we reviewed that does not use transformers. The work presented in [55] is clearly the prior steppingstone that led to [37] reviewed above, as the methods described are quite similar, and the papers share three authors. The first key difference in this paper is the attention they pay to the role of noise in the teacher-student training process, thus the name NoisyStudent. They never inject noise in the teacher model so that when it generates pseudo labels, those labels are as accurate as possible. However, when training the student, they inject considerable noise using RandAugment [9], dropout [45], and stochastic depth [24]. The second key difference is that in this paper rather than having the single student feedback to the single teacher, in this work, the authors follow a self-training framework [42] consisting of three steps. The first step is training the teacher with labeled data. The second step is to generate pseudo labels for unlabeled data with the teacher. The third step is to train the student with a mixture of labeled and pseudo-labeled data. These steps are
98
A. Byerly et al.
repeated several times, each time promoting the prior student to be the new teacher and creating a new student model. The authors compare their method to Knowledge Distillation [21] but note that in that work the student was often smaller so that it could infer faster and did not inject noise so aggressively. They say that their method could be thought of as Knowledge Expansion in that the student is larger, with greater capacity, and taught in a difficult environment made up of more noise.
7
Innovations Related to Training Procedures
In the remaining works we review, the authors credit their achievement of stateof-the-art results not on the design of the network they used, but rather on other innovations related to the training of the networks (see Table 6). Table 6. Innovations related to training procedures
Rank top-1 Acc. # of Params. Addt’l. training samples
Paper title
7
89.2%
527M
300M
High-Performance Large-Scale Image Recognition Without Normalization [5]
9
88.64%
480M
1.8B
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [29]
10
88.61%
480M
300M
Sharpness-Aware Minimization for Efficiently Improving Generalization [16]
29
86.78%
377M
0
Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error [17]
31
86.5%
438M
0
High-Performance Large-Scale Image Recognition Without Normalization [5]
33
86.45%
255M
0
Drawing Multiple Augmentation Samples Per Image During Training Efficiently Decreases Test Error [17]
36
86.3%
377M
0
High-Performance Large-Scale Image Recognition Without Normalization [5]
SOTA in Image Classification: A Review
99
In [29] the authors note that there are no publicly available labeled datasets of the order being used by many of the state-of-the-art network designs (e.g., JFT-300M and JFT-3B). This is due in large part to how costly and laborintensive it is to curate such datasets. In their work, they describe a process of downloading 1.8B images accompanied with alt-text from the internet, and rather than doing labor-intensive curation, instead opt to only perform a small amount of filtering to the alt-text. Although they don’t give a detailed explanation of their filtering process, it would stand to reason that they would filter out words that occurred very infrequently or extremely frequently. After the filtering process, they then have multiple noisy “labels”, one per word in the alt-text, per image. For the purposes of this review, we will focus on their efforts to use this data as supplementary data to train a network for validating on the Imagenet-1k validation data. As such, we only briefly mention that prior to doing training for the image classification task, they trained a different model to embed the image and alt-text pairs of their 1.8B image dataset into a shared embedding space where matched pairs were pushed together, and unmatched pairs were pushed apart. They then used this embedding to give each of the images’ associated alt-text words different weights as labels. The majority of networks, especially very deep networks like ResNets [18], employ Batch Normalization (BN) [25]. BN has the effect of smoothing the loss landscape which allows for larger learning rates and larger batch sizes. However, BN is a costly operation, behaves differently during training than it does during evaluation, and breaks the independence among the training examples in each batch. Furthermore, BN results in a tight coupling of batch size to network performance such that when the batch size is too small, the network performs poorly. The authors of [5] believe that in the long term, reliance on BN will impede progress in neural network research. They noted that by suppressing the scale of the activations on residual branches in ResNets, networks can be trained effectively without BN. Specifically, they propose Adaptive Gradient Clipping (AGC) which works by clipping the gradients based on the ratio of gradient norms to parameter norms. The authors note that their work is closely related to recent work studying “normalized optimizers” [3,56,57] which ignore gradient scale and instead choose adaptive learning rates inversely proportional to the gradient norms. They state that “AGC can be interpreted as a relaxation of normalized optimizers, which imposes a maximum update size based on the parameter norm but does not simultaneously impose a lower-bound on the update size or ignore the gradient magnitude”. The authors of [16] point out that with the heavily overparameterized models that are commonly in use, minimizing the training loss, which is the usual goal when training neural networks, can easily result in a suboptimal model. They propose a simple, yet effective approach of not only minimizing the training loss but while doing so, simultaneously minimizing the curvature of the loss landscape in the neighborhood of the loss. Among their other results, notably, they show that when using Sharpness-Aware Minimization (SAM), they achieve robustness to noisy labels “on par with that provided by state-of-the-art procedures
100
A. Byerly et al.
that specifically target learning with noisy labels”. In their related work section, they note that similar superior generalization had previously been observed by achieving wider minima, not by explicitly searching for such, but by arriving at it by evaluating on a moving average of the prior training weights [26]. The usual approach to online data augmentation is to draw n samples from the training data, augment each of them with whatever augmentation procedure is being followed, and then submit that batch of n augmented images to the training procedure. In [17], similar to earlier work as in [22] and [7], the authors perform a study of the consequences of drawing n samples, augmenting each of them c times and submitting a batch of size cn to be trained. One of their key findings is that for integer values of c greater than 1, higher accuracies were achieved, even in the presence of fixed batch sizes, which means the number of unique images in each batch was fewer. The authors noted that this was especially true in the cases of large batch sizes. The authors state of such models that “despite their superior performance on the test set, large augmentation multiplicities achieve slower convergence on the training set.” It is our opinion that it is not “despite” this but at least in part because of this. The authors also note that prior work has found that observations of the regularizing effect of large learning rates was proportional to the batch size used [28,35,44]. An interesting hypothesis put forward by the authors is that this observation is a specific case where c is held at 1 of the more general principle that the regularizing effect of large learning rates is proportional to the number of unique training samples in the batch.
8
Conclusion
In this work, we reviewed the 22 papers that elucidate the methods that result in the top 40 accuracies on the ILSVRC 2012 Imagenet validation set as ranked on Papers with Code. An obvious trend is the interest in transformer-based architectures. Additionally, the general trend towards larger and larger model capacities as measured in the number of trainable parameters is readily apparent (see Fig. 1). One thing that could be overlooked, though, is that along with the trend towards increased model capacities there exists the trend toward using more and more additional training data (see Fig. 2), the two largest sets of which are not publicly available. These trends present problems for independent researchers, researchers who are University faculty, and smaller labs. The first such problem is simply the availability of data. The creation of Imagenet represents a turning point in the history of computer vision. Up to that point, dataset sizes were most commonly measured in the 10 s of thousands of samples. Imagenet gave us a mega-scale dataset and has become the de-facto measure of state of the art as a result. The second problem is the compute power required to train giga scale models on giga
SOTA in Image Classification: A Review
101
scale data. For example, the highest-ranked model had to be trained for 20,100 TPUv3-core days. The published price for this much compute is over $300,000 and would take 10 days using the largest TPUv3 pod that exists. On a consumer GPU like an NVIDIA GeForce RTX 3090, it would take approximately 18 years to train this model. As such, state-of-the-art research is now dominated by large corporations like Google, Microsoft, and Facebook. What can we (the deep learning computer vision community) do to redemocratize the research of state-of-the-art methods? We are certainly not saying that the directions we have been going should be abandoned nor should such research be ceded to large corporations. Instead, we are saying that we should consider scaling up our efforts along an additional vector. That vector being the data we train these models with. We should consider prioritizing the collection of new standard benchmark datasets that fill the gaps between CIFAR-100 and Imagenet and between Imagenet and the internal giga scale datasets of large corporations. Furthermore, we should start researching the quality of the data and develop analytical methods of measuring sufficiency in data quantity as opposed to simply assuming more will always be better.
Fig. 1. The number of model parameters used to achieve the top 40 best accuracies on the ILSVRC 2012 Imagenet validation set.
102
A. Byerly et al.
Fig. 2. The amount of extra training data used to achieve the top 40 best accuracies on the ILSVRC 2012 Imagenet validation set.
References 1. Ahmed, K., Baig, M.H., Torresani, L.: Network of experts for large-scale image categorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 516–532. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-46478-7 32 2. Bao, H., Dong, L., Furu, W.: BEiT: BERT pre-training of image transformers (2021) 3. Bernstein, J., Vahdat, A., Yue, Y., Liu, M.-Y.: On the distance between two neural networks and the stability of learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 21370–21381 (2020) 4. Beyer, L., H´enaff, O.J., Kolesnikov, A., Zhai, X., van den Oord, A.: Are we done with ImageNet? (2020) 5. Brock, A., De, S., Smith, S.L., Simonyan, K.: High-performance large-scale image recognition without normalization (2021) 6. Chen, K., Xu, L., Chi, H.: Improved learning algorithms for mixture of experts in multiclass classification. Neural Netw. Off. J. Int. Neural Netw. Soc. 12(9), 1229–1252 (1999) 7. Choi, D., Passos, A., Shallue, C.J., Dahl, G.E.: Faster neural network training with data echoing (2019) 8. Chu, X., et al.: Conditional positional encodings for vision transformers (2021) 9. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: RandAugment: practical automated data augmentation with a reduced search space. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 3008–3017 (2020) 10. Dai, Z., Liu, H., Le, Q.V., Tan, M.: CoAtNet: marrying convolution and attention for all data sizes (2021)
SOTA in Image Classification: A Review
103
11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2010) 12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019) 13. Dong, X., et al.: CSWin transformer: a general vision transformer backbone with cross-shaped windows (2021) 14. Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: ICML 2013 Workshop: Challenges in Representation Learning, July 2013 15. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Ninth International Conference on Learning Representations (ICLR) (2020) 16. Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization for efficiently improving generalization. In: Ninth International Conference on Learning Representations (ICLR) (2020) 17. Fort, S., Brock, A., Pascanu, R., De, S., Smith, S.L.: Drawing multiple augmentation samples per image during training efficiently decreases test error (2021) 18. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 19. Hendrycks, D., Gimpel, K.: Gaussian Error Linear Units (GELUs) (2016) 20. Henighan, T., et al.: Scaling laws for autoregressive generative modeling (2020) 21. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015) 22. Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: better training with larger batches (2019) 23. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 24. Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-46493-0 39 25. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, pp. 448–456 (2015) 26. Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., Wilson, A.G.: Averaging weights leads to wider optima and better generalization (2018) 27. Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural Comput. 3(1), 79–87 (1991) 28. Jastrz¸ebski, S., et al.: Three factors influencing minima in SGD. In: International Conference on Artificial Neural Networks and Machine Learning (ICANN) (2018) 29. Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning (PMLR) (2021) 30. Jiang, Z., et al.: All tokens matter: token labeling for training better vision transformers (2021)
104
A. Byerly et al.
31. Jordan, M.I., Jacobs, R.A.: Hierarchical mixtures of experts and the EM algorithm. In: Proceedings of 1993 International Conference on Neural Networks (IJCNN-93Nagoya, Japan), vol. 2, pp. 1339–1344 (1993) 32. Kaiser, L., Gomez, A.N., Chollet, F.: Depthwise separable convolutions for neural machine translation. In: ICLR 2018 - International Conference on Learning Representations (2018) 33. Kardas, M., et al.: AxCell: automatic extraction of results from machine learning papers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 8580–8594 (2020) 34. Kolesnikov, A., et al.: Big Transfer (BiT): general visual representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 491–507. Springer, Cham (2020). https://doi.org/10.1007/978-3-03058558-7 29 35. Li, Y., Wei, C., Ma, T.: Towards explaining the regularization effect of initial large learning rate in training neural networks. In: Advances in Neural Information Processing Systems (2019) 36. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: The International Conference on Computer Vision (ICCV) (2021) 37. Pham, H., Dai, Z., Xie, Q., Luong, M.T., Le, Q.V.: Meta pseudo labels (2020) 38. Qiao, S., Wang, H., Liu, C., Shen, W., Yuille, A.: Micro-batch training with batchchannel normalization and weight standardization (2019) 39. Ramachandran, P., Bello, I., Parmar, N., Levskaya, A., Vaswani, A., Shlens, J.: Stand-alone self-attention in vision models. In: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) (2019) 40. Riquelme, C., et al.: Scaling vision with sparse mixture of experts (2021) 41. Ryoo, M.S., Piergiovanni, A.J., Arnab, A., Dehghani, M., Angelova, A.: TokenLearner: what can 8 learned tokens do for images and videos? (2021) 42. Scudder, H.J.: Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 11(3), 363–371 (1965) 43. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, vol. 2, pp. 464–468 (2018) 44. Smith, S.L., Dherin, B., Barrett, D.G.T., De, S.: On the origin of implicit regularization in stochastic gradient descent. In: International Conference on Learning Representations (ICLR) (2021) 45. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) 46. Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks (2019) 47. Tan, M., Le, Q.V.: EfficientNetV2: smaller models and faster training (2021) 48. Tolstikhin, I., et al.: MLP-mixer: an all-MLP architecture for vision (2021) 49. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., J´egou, H.: Going deeper with image transformers. In: The International Conference on Computer Vision (ICCV) (2021) 50. Touvron, H., Vedaldi, A., Douze, M., J´egou, H.: Fixing the train-test resolution discrepancy. In: Advances in Neural Information Processing Systems (2019) 51. Touvron, H., Vedaldi, A., Douze, M., J´egou, H.: Fixing the train-test resolution discrepancy: FixEfficientNet. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
SOTA in Image Classification: A Review
105
52. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5999–6009 (2017) 53. Wu, H., et al.: CvT: introducing convolutions to vision transformers. In: The International Conference on Computer Vision (ICCV) (2021) 54. Wu, Y., He, K.: Group normalization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8 1 55. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 10684–10695 (2020) 56. You, Y., Gitman, I., Ginsburg, B.: Large batch training of convolutional networks (2017) 57. You, Y., et al.: Large batch optimization for deep learning: training BERT in 76 minutes. In: Eighth International Conference on Learning Representations (ICLR) (2019) 58. Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: VOLO: Vision outlooker for visual recognition (2021) 59. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers (2021) 60. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup. In: International Conference on Learning Representations (ICLR) (2018) 61. Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings (2017)
Deep Convolutional Neural Networks for COVID-19 Detection from Chest X-Ray Images Using ResNetV2 Tomiris Rakhymzhan, Javad Zarrin(B) , Mahdi Maktab-Dar-Oghaz, and Lakshmi Babu Saheer Anglia Ruskin University, Cambridge CB1 1PT, UK [email protected], {javad.zarrin,mahdi.maktabdar,lakshmi.babu-saheer}@aru.ac.uk
Abstract. COVID-19 has been identified as a highly contagious and rapidly spreading disease around the world. The high infection and mortality rate characterizes this as a very dangerous disease and has been marked as a global pandemic by the world health organization. Existing COVID-19 testing methods, such as RT-PCR are not completely reliable or convenient. Since the virus affects the respiratory tract, manual analysis of chest X-rays could be a more reliable but not convenient or scalable testing technique. Hence, there is an urgent need for a faster, cheaper, and automated way of detecting the presence of the virus by automatically analyzing chest X-ray images using deep learning algorithms. ResNetV2 is one of the pre-trained deep convolutional neural network models that could be explored for this task. This paper aims to utilize the ResNetV2 model for the detection of COVID-19 from chest X-ray images to maximize the performance of this task. This study performs fine-tuning of ResNetV2 networks (specifically, ResNet101V2), which is performed in two main stages: firstly, training model with frozen ResNetV2 base layers, and secondly, unfreezing some layers of the ResNetV2 and retraining with a lower learning rate. Model fine-tuned on ResNet101V2 shows competitive and promising results with 98.50% accuracy and 97.24% sensitivity.
Keywords: Convolutional Neural Networks Fine-tuning
1
· Covid19 · ResNet ·
Introduction
COVID-19 (coronavirus) disease is a deadly respiratory infection caused by a virus called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The virus tends to be highly contagious and is spreading around the world at a very high speed. Currently, the number of COVID-19 infected people has reached around 135 million, with 2,9 million deaths [2]. One of the most prevalent methods to
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 106–116, 2022. https://doi.org/10.1007/978-3-031-10464-0_8
D-CNN for COVID-19 Detection from X-Ray Images
107
detect COVID-19 disease is the reverse transcriptase polymerase chain reaction (RT-PCR) test that detects presence of the SARS-CoV-2. However, this type of testing gives rise to several problems such as: shortage of supply of testing kits, significant time-delay of 48–72 h between testing and getting the results to the patients [1], costs of testing kits (USD 120–130) and costs of maintaining laboratory and PCR machines (USD 15,000 to USD 90,000) [20]. Moreover, the reliability of the tests is not always consistent and is questionable, since overall sensitivity of RT-PCR tests is estimated to be 87.8% and specificity to be 98% [12]. Radiological image screening is another promising method for detecting COVID-19, since it is widely available in hospital settings. Several studies have shown that chest X-ray (CXR) images and computed tomography (CT) images of patients infected with COVID-19 disease tend to have similar features like bilateral abnormalities [24], ground-glass opacification with occasional consolidation in the peripheries [15], and interstitial abnormalities [10]. Detecting such features manually by professionals would take a lot of time and labour cost. Therefore, a more automated approach of detecting COVID-19 through CXR images is an imperative need. Deep learning methods have already proven to be widely used in the biomedical field. CNN is a deep learning method applied on images, and has shown impressive results in feature extraction, supervised learning and producing low error rates. Sergio et al. [18] used CNNs for brain tumor segmentation, Thijs et al. [14] used CNNs for mammographic lesions detections, Wafaa et al. [4] used CNNs on CT scans to detect cancer. Therefore, using CNNs for detecting COVID-19 on CXR images has great potential. The main aim of the study is to fine-tune ResNet101V2 to maximise performance of the target task: classification of CXR images and accurate detection of COVID-19 positive cases. The main contributions of the paper are as follows: – Transfer learning approach was used to examine performance of the existing ResNet101V2 network architecture for the COVID-19 classification task on CXR images. – Achieved accuracy of 98.40% and sensitivity of 97.24% with the proposed model. – Model trained with large datasets, compared to those used in the literature. The rest of the paper is organized as follows: Sect. 2 covers related literature. Section 3 presents the dataset and its pre-processing. Section 4 discusses the fine-tuning strategy and covers implementation details of the proposed model. Section 5 presents the evaluations results followed by discussions in Sect. 6. And, finally Sect. 7 presents the conclusion.
2
Related Work
There have been significant efforts in the research community aimed to build deep learning solutions for detection of COVID-19 from CXR images, which include both transfer learning techniques and using existing CNN architectures
108
T. Rakhymzhan et al.
for inspiration. In [22], COVID-Net network leverages lightweight residual projection expansion projection extension (PEPX) design patterns and selective long-range connectivity. Main accomplishment of the network is its low computational complexity (11.75 million trainable parameters), and architectural diversity. The COVID-Net is reported to classify 3 classes (“COVID-19”, “NonCOVID-19”, “Normal”) with test dataset accuracy of 93.3% and COVID-19 sensitivity of 91.0%. [7] presented a ResNet50-based model that achieved 96.23% accuracy, 100% COVID-19 sensitivity with 25.6 million trainable parameters. In study by Tulin et al. [17] DarkCovidNet model architecture leverages an existing Darknet-19 network architecture. The model classifies 3 classes (“COVID19”, “No-findings”, “Pneumonia”), trained 1,164,434 parameters and performed with 87.02% accuracy and 85.35% sensitivity. Even though the model does not have high computational complexity, the accuracy and sensitivity metrics are significantly lower than most of the other reviewed studies. [16] compared 5 different pretrained models: ResNet50, ResNet101, ResNet152, InceptionV3 and Inception-ResNetV2. The model based on ResNet50 shows best accuracy (99.7%) and sensitivity (98.8%) results for binary classification between COVID-19 and Bacterial Pneumonia images. However, due to lack of COVID-19 images, the dataset used in the study was highly imbalanced, which means reliability of the evaluation metrics is questionable. CoroNet network [13] implements transfer learning approach with Xception as a base model with a dropout layer and two fully-connected layers added at the end. The model performs with 89.6% accuracy and 98.2% COVID-19 sensitivity. The downside of the model is its expensive computational complexity, since there are 33 million parameters being trained. COVID-CAPS model [3] utilizes Capsule Network model pretrained on X-ray images dataset of various medical conditions. The model performs with 95.7% accuracy and 90% sensitivity, and trains significantly lower number of parameters (295,488) compared to all the other reviewed studies. In [6] model based on Xception network showed 97.4% accuracy, and 97.0% sensitivity metrics. [19] fine-tuned a ResNet101V2-based model from scratch with the maximum amount of trainable parameters (44.7 million) being utilized. However, the resulting accuracy of 92.1% does not outperform accuracy of this study. DenseNet201 [5] in three-class classification returns the best results among other explored networks such as SqueezeNet, MobileNetV2, ResNet18, InceptionV3, ResNet101, CheXNet, VGG19: 97.94% accuracy and 97.97% sensitivity. In our study, ResNet101V2 network pre-trained on ImageNet dataset is used as a base model in the transfer learning approach. The fine-tuning process is done in multiple stages. During these stages, an increasing amount of layers are being unfreezed starting from top and the model is retrained with a lower learning rate. The main difference of the selected ResNetV2 network is pre-activation of weights, which makes generalization easier, compared to the original ResNet network [9].
D-CNN for COVID-19 Detection from X-Ray Images
3
109
Dataset
This section explains the dataset used for the proposed study. Dataset from a Kaggle repository named “COVID-19 Radiography Database” [5,19] was used, since the selected dataset is a winner of the COVID-19 Dataset Award by Kaggle Community. The database is a collection of the most trusted open-source repositories containing COVID-19 (3616), viral pneumonia (1345), other lung infections (6012) and normal (10192) CXR images. Figure 1 shows a sample of each of the presented class from the database.
Fig. 1. Samples of CXR images of each class from the selected database.
110
T. Rakhymzhan et al.
For this study, the viral pneumonia and other lung infections CXR images were merged into one class. Therefore, this study classifies between three classes: “covid”, “normal”, “other”. To ensure equal distribution of CXR images across all classes, undersampling of over-represented classes was conducted. 80% of the CXR images were allocated for training, and the rest 20% - for testing and evaluation. During the training process, 20% of training data is allocated for validation. Distribution of images across the classes is shown in Fig. 2.
Fig. 2. Distribution of data across classes in training and test subsets.
Data augmentation with transformations such as horizontal flip and zooming in or out of the image with 0.2 range was implemented. Figure 3 below shows a few examples of augmented training data:
Fig. 3. Augmented training data examples.
D-CNN for COVID-19 Detection from X-Ray Images
111
Fig. 4. The pipeline for the proposed work.
4
Architecture and Methods
This section presents the fine-tuning strategy employed in this study. Diagrammatic representation of the training process is shown in Fig. 4. After preprocessing the input dataset, the base model is freezed and a classification layer is added on top. The base model is trained and evaluated. Then, a few layers from top are unfreezed. Then the entire model is retrained with a lower learning rate and evaluated. If the model has improved its performance metrics, another round of fine-tuning is conducted. The process is stopped in case model has stopped improving. The main idea of the proposed fine-tuning framework is in gradual increase of level of layers to be unfreezed and tuned. 4.1
Proposed Model
By leveraging the existing architecture of ResNet101V2, this study conducted fine-tuning method of transfer learning to maximize the performance of the proposed model. Transfer learning is a technique, where model is trained on a base dataset for the source task, and then the learnt features are repurposed in training on a different dataset for a different target task; this approach is especially advantageous when the target dataset is small [23]. The author in [21] has shown that transfer learning from the large scale annotated natural image datasets like ImageNet to computer-aided-diagnosis problems has been consistently beneficial. In this study we performed fine-tuning, which is the process of backpropagating errors from the target task into the base (copied) features [23], i.e., adjusting weights for the target tasks.
112
T. Rakhymzhan et al.
ResNet101V2 networks implement pre-activation of weights, which makes generalization easier, compared to the original ResNet network [8]. The main benefit of the pre-activation approach is due to regularization and normalization of resulting output signal, which in turn causes overfitting to be reduced. Training of the proposed model was executed in the following three stages: Stage I : All layers of the base model are frozen. The pre-trained ImageNet weights are loaded. Top classifying layer with softmax activation function added. Model is trained with the following hyperparameters: 40 epochs, batch size 128, Adam optimizer with learning rate of 1e-3, cathegorical crossentropy loss function, softmax activation function for the top layer classification. If learning stagnates, i.e. validation loss does not decrease for 3 epochs, the training is stopped. To further optimize training, if validation accuracy does not improve for 3 epochs, learning rate is reduced by specified factor of 0.1. Number of trained parameters at this stage is 6147. Stage II : Part of the top layers are unfreezed (starting from layer 277). BatchNormalisation layers were left freezed, since each such layer contains two weights that keep track of the mean and variance of its inputs during training. If left unfreezed, model will destroy what it has already learnt. Training is proceeded with a lower learning rate of 1e-4. The rest of the hyperparameters remained unchanged. Number of trained parameters at this stage is 21,385,219. Stage III : More layers are unfreezed (starting from layer 237). Training is proceeded with lower learning rate of 1e-5. The rest of the hyperparameters remained unchanged. Number of trained parameters is 25,255,939. Further finetuning of the proposed model provided no significant change in evaluation metrics. This process of unfreezing layers and retraining the model with lower learning rate in a multi-stage manner demonstrate an optimal solution in case, where computational capacity is limited. Training and evaluation of the model was performed in Google Colaboratory using T4 and P100 GPU processors. To conduct the experiments Tensorflow Keras framework was used.
5
Results
The proposed model is evaluated on the unseen test dataset using Accuracy, Precision, Sensitivity, Specificity and F1-score metrics. These metrics are calculated using True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) base metrics of a confusion matrix. TP + TN TP + TN + FP + FN TP P recision = TP + FP TP Recall = TP + FN
Accuracy =
(1) (2) (3)
D-CNN for COVID-19 Detection from X-Ray Images
113
TN (4) TN + FP P recision ∗ Recall F 1 − Score = 2 ∗ (5) P recision + Recall Figure 5 demonstrates confusion matrices of each of the stages of the proposed model training. Figure 6 demonstrates training and validation accuracy graphs of each of the training stages. Figure 7 demonstrates training and validation loss graphs of each of the training stages. Table 1 provides metrics data gathered after each of the stages from confusion matrix classification reports. Specif icity =
Fig. 5. Confusion matrices of each of the three training stages. Label “0” Label “0” Refers to “covid”, “1” Refers to “normal”, “2” Refers to “other”.
Fig. 6. Training and validation accuracy graphs of each of the training stages.
Fig. 7. Training and validation loss graphs of each of the training stages.
114
T. Rakhymzhan et al. Table 1. Metrics recorded after each of the training stages. Stage I Stage II Stage III
6
Accuracy
88,99% 98,17%
98,40%
Precision
88,15% 97,51%
97,91%
Sensitivity 77,07% 96,96%
97,24%
Specificity 94,88% 98,77%
98,98%
F-1 score
97,57%
82,24% 97,23%
Discussion
In this study to perform fine-tuning, multiple rounds of unfreezing layers and retraining were executed. During each stage training was forcefully stopped when validation and training metrics (accuracy and loss) converged. Such multi-stage fine-tuning approach allowed to maximize the final performance of the model. The amount of layers to be unfreezed at each stage was tested empirically. One of the important findings is that unfreezing a larger amount of layers does not necessarily result in better performance, but rather in tendency to overfit. Figure 7 demonstrates rapid decrease of steepness of graphs with every consecutive stage, which underlines the decreasing nature of performance gain with every retraining process. From confusion matrices in Fig. 5 correct classification of COVID19 positive cases appears to be prevalent in all three stages of training. Since minimizing incorrect predictions of COVID-19 positives (i.e., FP) is the priority of the target task, sensitivity metric is one of the most important metrics to consider. Table 2 represents comparison of performance of models in related studies reviewed in this paper. Between models performing three-class classification our proposed model returns the best accuracy metric and a competitive sensitivity metric. From comparing the proposed model to other related studies higher number of trainable parameters can positively impact on metric results. However, with increasing number of parameters to be learned the difference in metrics improvement becomes less significant. Another observation is that with increasing number of classes achieving high metrics values becomes more challenging. Training process of models of the existing studies was capped by a limited amount of available COVID-19 CXR images. Significantly larger amount of data collected for this study noticeably increased performance of the proposed model.
D-CNN for COVID-19 Detection from X-Ray Images
115
Table 2. Comparison with related work. Classes Accuracy Sensitivity Parameters
7
Proposed work 3
98.40%
97.24%
25,255,939
[22]
3
93.30%
91.98%
11.75 m
[17]
3
87.02%
85.35%
1,164,434
[6]
3
97.40%
97.00%
22.9 m
[5]
3
97.94%
97.94%
20 m
[11]
3
94.20%
92.76%
2,873,106
[7]
4
96.23%
100%
25.6 m
[13]
4
89.60%
98.20%
33,969,964
[16]
2
99.70%
98.80%
25.6 m
Conclusion
This study conducted a multi-stage fine-tuning of ResNet101V2 for the task of classifying CXR images of three types and detecting COVID-19 positives. After examination of results, it has been identified that gradual partial unfreezing of layers during multiple rounds of retraining allows to get the most out of the possible network performance. The proposed model of this study after the last stage of fine-tuning returned the following results: accuracy - 98.40%, sensitivity - 97.24%, precision - 97.91%, specificity - 98.98%, f1 score - 97.57%. Comparison with results of other related studies has shown great competitiveness of the proposed model. Future work of the study includes expanding range of classes to be classified, so that the model is able to detect specific COVID-19 variants.
References 1. How long do COVID-19 test results take? (2021) 2. Worldometer: COVID-19 coronavirus pandemic (2021) 3. Afshar, P., Heidarian, S., Naderkhani, F., Oikonomou, A., Plataniotis, K.N., Mohammadi, A.: COVID-CAPS: a capsule network-based framework for identification of COVID-19 cases from X-ray images. Pattern Recogn. Lett. 138, 638–643 (2020) 4. Alakwaa, W., Nassef, M., Badr, A.: Lung cancer detection and classification with 3D convolutional neural network (3D-CNN). Lung Cancer 8(8), 409 (2017) 5. Chowdhury, M.E.H., et al.: Can AI help in screening viral and COVID-19 pneumonia? IEEE Access 8, 132665–132676 (2020) 6. Das, N.N., Kumar, N., Kaur, M., Kumar, V., Singh, D.: Automated deep transfer learning-based approach for detection of COVID-19 infection in chest X-rays. IRBM 43, 114–119 (2020) 7. Farooq, M., Hafeez, A.: COVID-ResNet: a deep learning framework for screening of covid19 from radiographs. arXiv preprint arXiv:2003.14395 (2020)
116
T. Rakhymzhan et al.
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 9. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 38 10. Huang, C., et al.: Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223), 497–506 (2020) 11. Hussain, E., Hasan, M., Rahman, M.A., Lee, I., Tamanna, T., Parvez, M.Z.: CoroDet: a deep learning based classification for COVID-19 detection using chest X-ray images. Chaos Solitons Fractals 142, 110495 (2021) 12. Jarrom, D., et al.: Effectiveness of tests to detect the presence of SARS-COV-2 virus, and antibodies to SARS-COV-2, to inform COVID-19 diagnosis: a rapid systematic review. BMJ Evid.-Based Med. 27, 33–45 (2020) 13. Khan, A.I., Shah, J.L., Bhat, M.M.: CoroNet: a deep neural network for detection and diagnosis of COVID-19 from chest X-ray images. Comput. Meth. Prog. Biomed. 196, 105581 (2020) 14. Kooi, T., et al.: Large scale deep learning for computer aided detection of mammographic lesions. Med. Image Anal. 35, 303–312 (2017) 15. Li, Q., et al.: Early transmission dynamics in Wuhan, China, of novel coronavirus– infected pneumonia. N. Engl. J. Med. (2020) 16. Narin, A., Kaya, C., Pamuk, Z.: Automatic detection of coronavirus disease (COVID19) using X-ray images and deep convolutional neural networks. Pattern Anal. Appl. 24(3), 1207–1220 (2021). https://doi.org/10.1007/s10044-021-00984-y 17. Ozturk, T., Talo, M., Yildirim, E.A., Baloglu, U.B., Yildirim, O., Acharya, U.R.: Automated detection of COVID-19 cases using deep neural networks with X-ray images. Comput. Biol. Med. 121, 103792 (2020) 18. Pereira, S., Pinto, A., Alves, V., Silva, C.A.: Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans. Med. Imaging 35(5), 1240–1251 (2016) 19. Rahman, T., et al.: Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Comput. Biol. Med. 132, 104319 (2021) 20. Ramdas, K., Darzi, A., Jain, S.: ‘Test, re-test, re-test’: using inaccurate tests to greatly increase the accuracy of COVID-19 testing. Nat. Med. 26(6), 810–811 (2020) 21. Shin, H.C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016) 22. Wang, L., Lin, Z.Q., Wong, A.: COVID-net: a tailored deep convolutional neural network design for detection of COVID-19 cases from chest X-ray images. Sci. Rep. 10(1), 1–12 (2020) 23. Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? arXiv preprint arXiv:1411.1792 (2014) 24. Young, B.E., et al.: Epidemiologic features and clinical course of patients infected with SARS-COV-2 in Singapore. JAMA 323(15), 1488–1494 (2020)
Deep Neural Networks for Remote Sensing Image Classification Giorgia Miniello1(B) , Marco La Salandra2 , and Gioacchino Vino1 1
2
Department of Physics, INFN Bari, Bari, Italy [email protected] Department of Earth and Geoenvironmental Sciences, University of Bari, Bari, Italy
Abstract. Nowadays, Artificial Neural Networks are used along with remote sensing technology in order to perform land cover classification tasks which are very useful for change detection and monitoring studies of hydro-geomorphological high risk areas on the Earth surface. Following this framework, several Convolutional Neural Networks (CNNs) have been trained to test an original dataset of images acquired by UAV missions along the Basento River (in Basilicata region, located in the southern Italy). The dataset is made of more than 3000 aerial images which have been divided in classes and downgraded firstly to 80 cm/pixel and then to 20 cm/pixel in order to be comparable with the spatial resolution of the images that are supposed to be acquired by a Very Low Earth Orbit satellite that has been designed in the context of CLOSE - Close to the Earth project. The data used have been gathered in the context of the RPASInAir project which aims to enable innovative services with the purpose of land monitoring through the employment of data collected by Remotely Piloted Aircraft Systems (RPAS). A comparison of the performance of different CNNs will be shown and results will be given in terms of model accuracy and loss. All the results have been derived exploiting the resources belonging to the ReCaS-Bari data center HPC cluster. Keywords: Convolutional Neural Networks · Remote sensing · ReCaS-Bari · Hydro-geomorphological area monitoring · Land cover classification · VLEO · CLOSE · RPASInAir
1
Introduction
Science and technological innovation has a major role in the development of basically all the fields. Environmental issues may be due to different geomorphological hazards and quick/real-time response to data processing can be crucial in emergency scenarios [1]. Moreover, high-resolution aerial image acquisition is fundamental for territorial mapping and change detection analysis of hydrogeomorphological high-risk areas [2]. In this context, the problem of analyzing huge areas of the Earth surface and processing the data as fast as possible turns to be a quite challenging issue for this kind of studies. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 117–128, 2022. https://doi.org/10.1007/978-3-031-10464-0_9
118
G. Miniello et al.
In this context, the use of Unmanned Aerial Vehicles (UAVs) along with low orbit satellites is quite useful for the investigation of the most arduous areas. These two different but quite complementary scenarios of technological applications are subject of study of two projects: CLOSE [3] and RPASInAir [4]. The CLOSE and RPASInAir projects are both co-funded by European Union - SIF, Ricerca e Innovazione 2014–2020. The CLOSE - Close To The Earth project aims to build a technological prototype opening the access to the missions at Very Low Earth Orbit (i.e. lower than 250 km from the Earth surface). Its main goal is the design of a low mass vehicle (i.e. below 500 kg, including the propulsion system and payload) with an operating life of at least three years. The RPASInAir project, instead, aims to enable innovative services with the purpose of land monitoring through the employment of data collected by Remotely Piloted Aircraft Systems (RPAS). Two main goals are supposed to be achieved: the need to manage new categories of critical events directly linked to the flight of these systems (loss of datalink between air platform and pilot station, loss of ATM-pilot station connection, loss of vehicle cognitive capacities) and the need to realize a Synthetic Environment (SE) to simulate the behaviour of RPAS in order to project operations with new types of RPAS in complex scenarios with a level of risk that could be considered under control. The Aerospace Technology District (DTA Scarl) is the leading proponent for both the projects. In this paper a first study of the territorial classification of a restricted area over the Basento River in Basilicata region (southern Italy) is presented. The analysis is based on a supervised learning method using Deep Neural Networks (DNNs). As already mentioned, the CLOSE projects aims to design a VLEO satellite prototype and the aerial images that the system would acquire are expected to have a very high spatial resolution. Since no Earth images at low or very low orbits were available, our work aimed to simulate the spatial resolution that should be expected in this cases, downgrading very high resolution images acquired during UAV missions. The UAV data acquisition was performed with “DJI Inspire 2”, a quadcopter with aluminum-magnesium composite body and carbon fiber arms equipped with on-board computer system and with optical sensor “Zenmuse X5S’. The flight altitude was on average 50 m above the take/off location ground level in order to acquire the desired Ground Sampling Distance (GSD) of ∼1 cm/pixel. In Sect. 2 the computing infrastructure used for developing this study is presented, in Sect. 3 we go deep into the details of the method and the neural networks used, in Sect. 4 the results of two different model implementations are shown along with the metrics adopted and lastly in Sect. 5 observations about our work are briefly discussed. The aerial images used were taken in order to perform photogrammetry study activities which are part of the objectives of a PhD project in Geosciences “Application of UAV system and SfM techniques to assess the hydrogeological hazard of a fluvial system” at the Department of Earth and Environmental Sciences of Bari and of the “RPASinAir - Integrazione dei Sistemi Aeromobili a Pilotaggio
Deep Neural Networks for Remote Sensing Image Classification
119
Remoto nello spazio aereo non segregato per servizi”, PON ricerca e innovazione 2014–2020. All the results of these activities are collected in the paper “Generating UAV high-resolution topographic data within a FOSS photogrammetric workflow using high-performance computing clusters” [5].
2
ReCaS-Bari Cloud Infrastructure: The HPC Cluster
All the computational tasks have been executed on the IT resources of the ReCaS-Bari data center [6]. In order to build and run the NNs, the resources belonging to the new HPC ReCaS-Bari cluster were used. In particular, we exploited the new ReCas-Bari Cloud Infrastructure at the HPC cluster. This new cluster of machines has been configured to allow users to take advantage of the high performing hardware and speed up their complex algorithms. Currently, the cluster is composed of five machines each with: – – – –
4 GPU NVIDIA V100 32 GB 96 CPUs 753.5 GB RAM 6 TB SSD Disk
Thanks to the high number of cores, the GPUs allow the running of highperformance parallel algorithms reducing the overall execution time. The configured cluster is able to run batch jobs or open interactive environments where users are able to write code and test it in real time. For our task, we ran the neural networks interactively using Jupyter notebook.
3
Method
Before implementing the deep neural networks a review of the most used models in land cover classification studies have been performed. One of the most used and studied satellite image dataset for land cover classification is the EuroSAT [7] dataset (available at https://github.com/phelber/eurosat), which we explored as a first approach before focusing on our original UAV dataset. The EuroSAT dataset consists of a collection of 27000 Sentinel-2 satellite images made of 13 spectral bands and 10 prelabeled classes: “Annual Crop”, “Forest”, “Herbaceous Vegetation”, “Highway”, “Industrial”, “Pasture”, “Permanent Crop”, “Residential”, “River” and “Sea Lake’. The results of this first exploration have been presented on the International Symposium on Grids & Clouds 2021 [8]. During this study which involved a series of different ML models, two particular networks turned to be quite performing: the Max Pooling Model and the VGG16 Keras Model. Taking advantage of this experience, we chose the same models to train the networks for our new dataset. The Max Pooling [9] is a sample-based discretization process. It aims to down-sample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality. It extracts the largest value in each patch of each
120
G. Miniello et al.
feature map. The results are down sampled or pooled feature maps which highlight the most present (not the average) feature in the patch. It works better than average pooling for computer vision (e.g. image classification). In Fig. 1 an illustration of Max Pooling with a 2 × 2 pixel filter size from 4 × 4 pixel input is shown. Using a max pooling, each filter takes the maximum value, then brings a new output with a size of 2 × 2 pixels.
Fig. 1. A scheme of max pooling with a 2 × 2 pixel filter size from 4 × 4 pixel input
The second model adopted, the VGG16 Keras Model [10], consists of 13 convolutional layers, 5 Max Pooling layers, and 3 Dense layers. The number 16 in VGG16 means that only 16 layers have weights (i.e. learnable parameters layer). Convolution and Max Pool layers are arranged throughout the whole architecture. Conv-1 Layer has 64 filters, Conv-2 has 128 filters, Conv-3 has 256 filters, Conv 4 and Conv 5 has 512 filters. In Fig. 2 an illustration of the VGG16 Keras Model architecture is shown. The VGG Network was firstly introduced by the researchers at Visual Graphics Group (VGG) at Oxford. This network is characterized by a pyramidal shape, where the bottom layers which are closer to the image are wide, whereas the top layers are deep. The main advantages of using this model are that it is a very good architecture for benchmarking on particular tasks and that pre-trained networks for VGG are freely available on the internet, so it is commonly used for various applications. On the contrary, the use of these models can severely slow down the training if used from scratch, without pre-training.
4
Land Cover Classification Results Using Deep Neural Networks
After the acquisition of the images, a first dataset made of 3554 images have been chosen and divided into classes in order to perform a land cover classification
Deep Neural Networks for Remote Sensing Image Classification
121
Fig. 2. A scheme of the architecture of the VGG16 Keras model
[11] of the selected area of the Basento River (in Basilicata region, in southern Italy) using Machine Learning (ML) techniques [12]. For change detection studies, three main classes have been considered in order to satisfy a “minimal classification” task: Ground, Vegetation, and Water. For a very first exploration of these classes and a first study of possible ML models for land cover classification, the images have been firstly downgraded to 80 cm/pixel (size: 64 pixel × 48 pixel) and then to 20 cm/pixel (size: 198 pixel × 264 pixel). These values of spatial resolution have been chosen to scan the possible range that we expected to be the one of interest for the CLOSE project. The population of each class is shown in Table 1 and in Fig. 3. Several models have been tested for territorial classification of the monitored area and only the best ones for both the datasets are here presented. It’s worth to underline that this one is a preliminary study for defining the best configuration for aerial image dataset production and the best spatial resolution for land-cover classification and change detection monitoring. The different amount of population for each class may affect the results of NN performances. Some image samples for each class can be found in Fig. 4. Table 1. Population for each class: Ground, Vegetation, and Water. Class label Class name Population 0
Ground
817
1
Vegetation
1539
2
Water
1198
As mentioned above, the first dataset was downgraded from 1.09 to 80 cm/pixel. In this case, two models have been considered: a max pooling one and a VGG16 Keras Model. The first one consists in a sequence of pairs of max pooling and convolution layers ending with a dropout layer (30%) and a dense layer. From Fig. 5 can be observed that, after 150 epochs, a model accuracy (a) of ∼92.2% was reached but a great instability in the model loss (b) was also noticed as reported. Figure 6 shows a summary of the accuracy for each class compared with the overall one. Using this model a non negligible overfitting (the vertical gap between the train and the test curves) can be observed starting from the epoch 40, therefore we decided to explore the use of a different model in order to exploit data augmentation techniques.
122
G. Miniello et al.
Fig. 3. Population for each class: Ground, Vegetation and Water.
Fig. 4. Aerial image samples for each class. From left to right: Water, Vegetation and Ground.
(a) Model Accuracy for Max Pooling DNN for the 80 cm/pixel Dataset
(b) Model Loss for Max Pooling DNN for 80 cm/pixel Dataset
Fig. 5. Score for max pooling DNN model for the 80 cm/pixel dataset
Deep Neural Networks for Remote Sensing Image Classification
123
Fig. 6. Summary of the accuracy for each class for the max pooling DNN model for the 80 cm/pixel dataset
Hence, for the second model a different approach was used, adding a convolutional base of the VGG16 Keras Model (with pre-loaded weights) to exploit data augmentation technique (i.e. horizontal flip, vertical flip and rotation), as mentioned above, and improving the results. Figure 7 shows a summary of the parameters used for the network. In addition to Model Checkpoint, the EarlyStopping and ReduceLROnPlateau callback functions were added to limit the overfitting and reduce the learning rate if no improvements are seen after a fixed number of epochs. The training set to 150 epochs was then early stopped after 55 epochs. From Fig. 8 can be observed that at the end of the training a model accuracy (a) of ∼94% was reached and also the model loss (b) has been reduced with respect to the previous max pooling model. Figure 9 shows a summary of the accuracy for each class compared with the overall one. It can be observed that using this model the overfitting is highly reduced, electing definitively this second model as the best one among the models which have been tested. The second dataset was downgraded to 20 cm/pixel and in this case only the Max Pooling model was used, since the VGG16 Keras one was found not efficient in terms of time processing, probably due to the size of the matrix involved that was greatly increased due to the increased size of the images themselves. Figure 10 shows a summary of the parameters used for the network. For this training two GPUs were used, instead of the single one used with the previous dataset. In this case, the dropout layer was set at 90%. From Fig. 11, it can be observed that, after 100 epochs, a model accuracy of ∼93% was reached with a model loss of about 19%. Figure 12 shows a summary of the accuracy for each class compared with the overall one. The overall training process took about 1 h vs 5 min of the previous method and no significantly improvement in the accuracy or in the minimization of the loss function was observed. Nevertheless, it can be observed that, even though the accuracy is slightly lower than the one obtained with the previous dataset downgraded to 80 cm/pixel using the VGG16 Keras Model, the trend of the test follows quite better the training, ensuring less overfitting issues. It could also be useful to make some observations about the misclassification of the classes and the percentage of wrong predictions for each one. The limited number of classes prevent us to do the same observation made using different datasets (e.g. EuroSAT from Sentinel-2 mission), where a reciprocity in misclassification (e.g. the class Vegetation is mostly misclassified for Water and vicev-
124
G. Miniello et al.
Fig. 7. Summary of the parameters for the max pooling model for the 80 cm/pixel dataset
(a) Model Accuracy for the VGG16 Keras Model for the 80 cm/pixel Dataset
(b) Model Loss for the VGG16 Keras Model for the 80 cm/pixel Dataset
Fig. 8. Score for the VGG16 Keras model for the 80 cm/pixel dataset
Deep Neural Networks for Remote Sensing Image Classification
125
Fig. 9. Summary of the accuracy for each class for the VGG16 Keras model for the 80 cm/pixel dataset
Fig. 10. Summary of the parameters for the max pooling model for the 20 cm/pixel dataset
ersa). Obviously, the more the dataset is populated the better this behaviour can be evident. Figure 13 shows the number of wrong predictions for each class. As expected, the Ground class is the worst classified being less populated and the Vegetation one is the best being the most well populated.
126
G. Miniello et al.
(a) Model Accuracy for the Max Pooling DNN for 20 cm/pixel Dataset
(b) Model Loss for the Max Pooling DNN for 20 cm/pixel Dataset
Fig. 11. Score for the max pooling DNN for 20 cm/pixel dataset
Fig. 12. Summary of the accuracy for each class for the max pooling DNN for 20 cm/pixel dataset
Fig. 13. Wrong prediction for each class for the max pooling model with the 20 cm/pixel dataset
Deep Neural Networks for Remote Sensing Image Classification
5
127
Conclusions
A preliminary study of land cover classification of one reach the Basento River in Basilicata (southern Italy) was performed. An original dataset of 3554 aerial images acquired during a UAV mission was downgraded firstly to a spatial resolution of 80 cm/pixel and then to 20 cm/pixel. Two models, a Max Pooling one and a VGG16 Keras Model were chosen. The first model was adopted for both the datasets, while the second one was used only for the 80 cm/pixel dataset. In terms of model accuracy, the best model was the VGG16 Keras one. The dataset was part of a more populated aerial image sample firstly acquired for a specific photogrammetric study of the explored area, for which very high spatial resolution images are absolutely mandatory. Nonetheless, the use of a single GPU, the general efficiency of processing and the best performance of the model can ensure that, for the purpose of a territorial classification, the use of a spatial resolution which spans between 20 cm/pixel and 80 cm/pixel can be considered an optimized compromise between model training efficiency and processing time to reach the purpose. The ultimate goal would be to get real time results for monitoring studies, so that highly performing networks and technologies are mandatory for this activity. Further immediate improvements can be achieved using a more populated dataset and more homogeneously populated classes.
References 1. Sun, A.Y., Scanlon, B.R.: How can Big Data and machine learning benefit environment and water management: a survey of methods, applications, and future directions. Environ. Res. Lett. 14, 073001 (2019) 2. Fichera, C.R., Modica, G., Pollino, M.: Land Cover classification and changedetection analysis using multi-temporal remote sensed imagery and landscape metrics. Eur. J. Remote Sens. 45(1), 1–8 (2012). https://doi.org/10.5721/ EuJRS20124501 3. https://www.dtascarl.org/progettidta/close-to-the-earth/ 4. https://www.dtascarl.org/progettidta/rpasinair/ 5. La Salandra, M., et al.: Generating UAV high-resolution topographic data within a FOSS photogrammetric workflow using high-performance computing clusters. Int. J. Appl. Earth Observ. Geoinf. 105, 102600 (2021). https://doi.org/10.1016/j.jag. 2021.102600 6. https://www.recas-bari.it 7. Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: a novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 12, 2217–2226 (2019) 8. Miniello, G., La Salandra, M.: A new method for geomorphological studies and land cover classification using Machine Learning techniques. PoS(ISGC2021)031 (2021) 9. https://keras.io/api/layers/pooling layers/max pooling2d/ 10. https://keras.eio/api/applications/vgg/
128
G. Miniello et al.
11. Vandana, S.: Land Cover classification using machine learning techniques a survey. Int. J. Eng. Tech. Res. 9(06) (2020). https://doi.org/10.17577/ IJERTV9IS060881 12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations. arXiv:1409.1556 (2015)
Linear Block and Convolutional MDS Codes to Required Rate, Distance and Type Ted Hurley(B) National University of Ireland Galway, Galway, Ireland [email protected]
Abstract. Algebraic methods for the design of series of maximum distance separable (MDS) linear block and convolutional codes to required specifications and types are presented. Algorithms are given to design codes to required rate and required error-correcting capability and required types. Infinite series of block codes with rate approaching a given rational R with 0 < R < 1 and relative distance over length approaching (1 − R) are designed. These can be designed over fields of given characteristic p or over fields of prime order and can be specified to be of a particular type such as (i) dual-containing under Euclidean inner product, (ii) dual-containing under Hermitian inner product, (iii) quantum error-correcting, (iv) linear complementary dual (LCD). Convolutional codes to required rate and distance and infinite series of convolutional codes with rate approaching a given rational R and distance over length approaching 2(1 − R) are designed. The designs are algebraic and properties, including distances, are shown algebraically. Algebraic explicit efficient decoding methods are referenced. Keywords: Code Convolutional
· MDS · Dual-containing · QECC · LCD ·
MSC Classification: 94B05 · 11T71 · 16S99
1 1.1
Introduction Motivation, Summary
Linear block and convolutional codes are error-correcting codes which are used extensively in many applications including digital video, radio, mobile communication, and satellite/space communications. Maximum distance separable (MDS) codes are of particular interest and dual-containing codes, which lead to the design of quantum error-correcting codes, QECCs, and linear complementary dual, LCD, codes are also of great interest with many application. Codes for which there exist efficient decoding methods are required and necessary for applications. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 129–157, 2022. https://doi.org/10.1007/978-3-031-10464-0_10
130
T. Hurley
This paper gives design methods for both linear block codes and convolutional codes of the highest distance possible for a particular length and rate. The design methods are then extended to particular types of codes. Types here include DC (dual-containing), QECC (quantum error-correcting codes), LCD (linear complementary dual). MDS convolutional codes are designed where MDS here means the codes attain the GSB (generalised Singleton bound, see Sect. 1.2 below for definition) for convolutional codes. The methods allow the design of codes to given specifications and to design infinite series of such codes. The block linear MDS DC codes are designed under both Euclidean and Hermitian inner products and lead to MDS QECCs under Euclidean or Hermitian inner products. For given (allowable) rate R and given distance (2t + 1) (that is, for specified error-correcting capability), design methods for MDS block linear codes with efficient decoding algorithms are given. Design methods for types of such are then derived, where ‘types’ can be DC or LCD; QECCs are obtainable from DC codes. These are further specified to be over fields of characteristic a fixed prime p or over a field of prime order. In fields of prime order the arithmetic is modular arithmetic which is particularly nice and efficient. Infinite series of codes, which can be required to be DC or LCD, in which the rate approaches a given R and the relative distance (ratio of distance over length) approaches (1 − R) are designed. These infinite series can be specified to be codes with characteristic a given prime p or to be codes over a field of prime order. Again note that QECCs are obtainable from DC codes, and if the DC code is MDS then the QECC obtained is maximum distance attainable for a QECC with those parameters. In general the convolutional codes designed offer better distances than the equivalent block linear code of the same length and rate. Note that the convolutional codes are designed algebraically.1 1.2
Background, Notation
Background on coding theory may be found in [1–5] and many others. The notation for linear block codes is fairly standard. Here [n, r, d] denotes a linear block code of length n, dimension r and distance d. The maximum distance attainable by an [n, r] linear block code is (n − r + 1) and this is known as the Singleton bound, see [1] or [4]. A linear code [n, r] attaining the maximum distance possible is called an MDS (maximum distance separable) code. The ratio of distance over length features here and we refer to this as the relative distance, rdist for short, of the code. The MDS linear block codes are those with maximum error correcting capability for a given length and dimension. MacWilliams and Sloane refer to MDS codes in their book [5] as “one of the most fascinating chapters in all of coding theory”; MDS codes are equivalent to geometric objects called n-arcs and combinatorial objects called orthogonal arrays, [5], and are, quote, “at the heart of combinatorics and finite geometries”. 1
There exist very few algebraic constructions for designing convolutional codes and search methods limit their size and availability, see McEliece [4] for discussion and also [10–13].
Linear Block and Convolutional MDS Code
131
A dual-containing, DC, code C is a code which contains its dual C ⊥ ; thus a DC code is a code C such that C ∩ C ⊥ = C ⊥ . A linear complementary dual, LCD, code is one such that its intersection with its dual is zero, that is, it’s a code C such that C ∩ C ⊥ = 0 LCD codes and DC codes are ‘supplemental’ to one another in the sense that C is DC if C ∩ C ⊥ = C ⊥ and C is LCD if C ∩ C ⊥ = 0. We shall see this further in action when MDS DC block linear codes are extended to MDS LCD convolutional codes and LCD MDS codes are extended to MDS DC convolutional codes. Why DC? DC codes have been studied extensively in particular since they lead by the CSS construction to the design of quantum error-correcting codes, QECCs, see [22] and also [23]. The CSS constructions are specified as follows: – Let C be a linear block code [n, k, d] over GF (q) containing its dual C ⊥ . The CSS construction derives a quantum (stabilizer) [[n, 2k − n, ≥ d]] code over GF (q). – Let D be a linear block code over GF (q 2 ) containing its Hermitian dual D⊥H . The CSS construction derives a quantum (stabilizer) code [[n, 2k − n, ≥ d]] code over GF (q 2 ). For more details on CSS constructions of QECCs see [39,40]; proofs of the above may also be found therein. The work of [39] follows from Rains’ work on non-binary codes [38]. As noted in for example [39] if the DC code used for the CSS construction is an MDS linear code then the quantum code obtained is a quantum MDS code which means it has the best possible distance attainable for such a quantum code. Why LCD? LCD codes have been studied extensively in the literature. For background, history and general theory on LCD codes, consult the nice articles [14–16,21] by Carlet, Mesnager, Tang, Qi and Pelikaan. LCD codes were originally introduced by Massey in [19,20]. These codes have been studied amongst other things for improving the security of information on sensitive devices against side-channel attacks (SCA) and fault non-invasive attacks, see [17], and have found use in data storage and communications’ systems. Notation for Convolutional Codes. Notation(s) for convolutional codes can be confusing. Different equivalent definitions are given in the literature and these are analysed nicely in [31]. The following definition is followed here. A rate nk convolutional code with parameters (n, k, δ) over a field F is a submodule of F[z]n generated by a reduced basic matrix G[z] = (gij ) ∈ F[z]r×n of rank r k where n is the length, δ = i=1 δi is the degree with δi = max1≤j≤k deg gij . Also μ = max1≤i≤r δi is known as the memory of the code and then the code may be given with parameters (n, k, δ; μ). The parameters (n, r, δ; μ, df ) are used for such a code with free (minimum) distance df . Suppose C is a convolutional code in F[z]n of rank k. A generating matrix G[z] ∈ F[z]k×n of C having rank k is called a generator or encoder matrix of C. A matrix H ∈ F[z]n×(n−k) satisfying
132
T. Hurley
C = ker H = {v ∈ F[z]n : vH = 0} is said to be a control matrix or check matrix of the code C. Convolutional codes can be catastrophic or non-catastrophic; see for example [3] for the basic definitions. A catastrophic convolutional code is prone to catastrophic error propagation and is not much use. A convolutional code described by a generator matrix with right polynomial inverse is a non-catastrophic code; this is sufficient for our purposes. The designs given here for the generator matrices allow for specifying directly the control matrices and right polynomial inverses of the generator matrices. By Rosenthal and Smarandache, [30], the maximum free distance attainable by an (n, r, δ) convolutional code is (n−r)( δr +1)+δ +1. The case δ = 0, which is the case of zero memory, corresponds to the linear Singleton bound (n − r + 1). The bound (n − r)( δr + 1) + δ + 1 is then called the generalised Singleton bound, [30], GSB, and a convolutional code attaining this bound is known as an MDS convolutional code. The papers [30] and [42] are major contributions to the area of convolutional codes. In convolutional coding theory, the idea of dual code has two meanings. The one used here is what is referred to as the convolutional dual code, see [25] and [7], and is known also as the module-theoretic dual code. The other dual is called the sequence space dual code. The two generator matrices for these ‘duals’ are related by a specific formula. If G[z]H T [z] = 0 for a generator matrix G[x], then H[z −1 ]z m for memory m generates the convolutional/(module theoretic) dual code. The code is then dual-containing provided the code generated by H[z −1 ]z m is contained in the code generated by G[z]. The papers [24,32] introduce certain algebraic decoding techniques for convolutional codes. Vetterbi or sequential decoding are available for convolutional codes, see [1,2] or [3] and references therein. The form of the control matrix derived here leads to algebraic implementable error-correcting algorithms. The MDS block linear codes to requirements rate R are extended to MDS convolutional codes with rate R and with the order of twice the distance of the linear block MDS codes of the same length. These may also be specified to be (i) of characteristic a fixed prime p, (ii) of prime order, (iii) DC, or (iv) LCD. Noteworthy here is how MDS DC block linear lead to convolutional MDS with memory 1 LCD codes, and in characteristic 2, LCD MDS block linear codes lead to the design of DC memory 1 convolutional MDS codes. The DC codes may be designed with Euclidean inner product or with Hermitian inner product. Previously. In [37] a general method for deriving MDS codes to specified rate and specified error-correcting capability is established; [9] gives a general method for designing DC MDS codes of arbitrary rate and error-correcting capability, from which MDS QECCs can be specified, [8] specifies a general method for designing LCD MDS codes to arbitrary requirements and [36] gives general methods for designing convolutional codes. The unit-derived methods devised and amplified in [26–29,33–35] are often in the background.
Linear Block and Convolutional MDS Code
1.3
133
Abbreviations
DC: Dual-containing LCD: Linear complementary dual QECC: Quantum error-correcting code MDS: Maximum distance separable2 GSB: Generalised Singleton bound rdist: Relative distance, which is the ratio of distance over length.
2
Summary of Design Methods
Section 5 describes explicitly the design algorithms in detail. Here’s a summary of the main design methods. 2.1
Linear Block MDS
Design methods are given initially for: (i) MDS linear block codes to rate ≥ R and distance ≥ (2t + 1) with efficient decoding algorithms. (ii) MDS linear block codes to rate ≥ R and distance ≥ (2t + 1) with efficient decoding algorithms over fields of (fixed) characteristic p. (iii) MDS linear block codes to rate ≥ R and distance ≥ (2t + 1) with efficient decoding algorithms over prime order fields. Then types of codes are required. Thus design methods are obtained, as in the above (i)–(iii), where “MDS linear block” is replaced by “MDS linear block of type X” where of type X is (a) DC or (b) LCD. These may be designed with Hermitian inner product when working over fields of type GF (q 2 ). For Hermitian inner products, in item (iii), the ‘prime order fields’ needs to be replaced by ‘fields of order GF (p2 ), where p is prime’. The DC codes designed are then used to design MDS QECCs. Then infinite series of such block codes are designed so that the rate approaches R and rdist approaches (1 − R) for given R, 0 < R < 1; for DC codes it is required 12 < R < 1. The infinite series of MDS QECCs designed from the DC codes have rdist approaching (2R − 1) and rate approaching R for given R, 1 > R > 12 . Specifically: – Design of infinite series of linear block codes [ni , ri , di ], such that limi→∞ nrii = R, and limi→∞ ndii = 1 − R – Design of infinite series of MDS block linear codes [ni , ri , di ], such that limi→∞ nrii = R and limi→∞ ndii = 1 − R in fields of characteristic p. 2
This has different parameter requirements for linear block codes, convolutional codes and QECCs.
134
T. Hurley
– Design of infinite series of MDS block linear codes [ni , ri , di ], such that limi→∞ nrii = R and limi→∞ ndii = 1 − R in prime order fields or in fields GF (p2 ) for Hermitian inner product. Further in the above a ‘type’ may be included in the infinite series designed where ‘type’ could be ‘DC’ or ‘LDC’. For DC, it is necessary that R > 12 , and then i = infinite series of QECCs [[ni , 2ri − ni , di ]] are designed where limi→∞ 2rin−n i di 1 2R − 1 (limit of rates) but still limi→∞ ni = 1 − R, for given R, R > 2 2.2
Convolutional MDS
Convolutional MDS codes and infinite series of convolutional MDS codes are designed. These are designed to specified rate and distance and in general have better relative distances. In memory 1 the distances obtained are of the order of twice that of the corresponding MDS linear block codes with the same length and rate. Higher memory convolutional codes are briefly discussed. Thus memory 1 MDS convolutional codes are designed as follows. – Design of MDS convolutional codes of memory 1 with the same length and rate as the corresponding MDS linear block code but with twice the distance less 1. – Design of MDS convolutional in characteristic p of the same length and rate as the corresponding MDS block linear codes but twice the distance less 1. – Design of MDS convolutional over a prime field of the same length and rate as the corresponding MDS block linear codes but with twice the distance, less 1. These may be extended to higher memory MDS convolutional codes. These are then specified for particular types such as DC or LCD. The LCD linear block codes when ‘extended’ to convolutional codes give rise to DC codes, and the DC linear block codes when ‘extended’ to convolutional codes give rise to LCD codes in characteristic 2. Infinite series of convolutional codes are designed as follows. – Design of infinite series of MDS convolutional codes (ni , ri , ni −ri ; 1, 2(ni −ri )) such that limi→∞ nrii = R, (2R − 1), limi→∞ ndii = (2R − 1). – Design of such infinite series over fields of (fixed) characteristic p. – Design of such infinite series over fields of prime order. Then such infinite series designs of convolutional codes are described for ‘types’ of codes such as DC or LCD or QECCs, which may be designed from DC codes. 2.3
Characteristic 2 and Prime Fields
The designs over the fields GF (2i ) and over prime fields GF (p) = Zp have a particular interest and have nice properties. The designs to specific requirements and specific type both linear block and convolutional can be constructed over such fields. GF (p), p a prime, has an element of order (p − 1) which is easily found and arithmetic within GF (p) is modular arithmetic.
Linear Block and Convolutional MDS Code
2.4
135
Examples
Examples are given throughout. Although there is no restriction on length in general, examples explicitly written out here are limited in their size. Example 1 below is a prototype example of small order which has some hallmarks of the general designs; it is instructional and may be read now with little preparation. Example 4 is an instructional example on the design methods for linear block (MDS) codes to meet specific rates and distances. The general designs are more powerful and include designs for convolutional codes.
3
Constructions
The following constructions follow from [37]; see also [26,28]. Construction 31. Design MDS linear block codes. – Let Fn be a Fourier n × n matrix over a finite field. – Taking r rows of Fn generates an [n, r] code. A check matrix for the code is obtained by eliminating the corresponding columns of the inverse of Fn . – Let r rows of Fn be chosen in arithmetic sequence such that the arithmetic difference k satisfies gcd(k, n) = 1. The code generated by these rows is an MDS [n, r, n−r +1] code. There exists an explicit efficient decoding algorithm of O(max{n log n, t2 }), t = n−r 2 , t is the error-correcting capability of the code. In particular this is true when k = 1, that is when the rows are taken in sequence. General Vandermonde matrices may be used instead of Fourier matrices but are not necessary. For a given Fourier n × n matrix Fn under consideration the rows of Fn in order are denoted by {e0 , e1 , . . . , en−1 } and n times the columns in order are denoted by {f0 , f1 , . . . , fn−1 }. Fn is generated by a primitive nth root of unity ω; thus ω n = 1, ω r = 1, 0 < r < n. Hence ei = (1, ω i , ω 2i , . . . , ω i(n−1) ). Indices are taken modulo n so that ei+n = ei . The arithmetic sequences in Construction 31 may wrap around; for example when n = 10, such arithmetic sequences include {e8 , e9 , e0 , e1 }, {e3 , e6 , e9 , e2 , e5 }. Note also that if B is a check matrix then so also is nB for any n = 0. Thus in the above Construction 31 the check matrix may be obtained from n times the columns of the inverse. Construction 32. Design DC MDS linear block codes. This construction follows from [9]. – Let Fn be a Fourier n × n matrix with rows {e0 , . . . , en−1 } in order and n times the columns of the inverse order are denoted by {f0 , . . . , fn−1 }. ⎛ in ⎞ – Let r >
n2
e0 e1
and define A = ⎝ .. ⎠. . er−1
136
T. Hurley
– A check code for the code generated by A is B = (fr , fr+1 , . . . , fn−1 ). ⎛ fT ⎞ ⎛ ⎞ r T
⎜ fr+1 ⎟ – Then B T = ⎝ . ⎠ = ⎝ .. T fn−1
en−r en−r−1
.. .
⎠. Hence the code generated by A is a DC
e1
[n, r, n − r + 1] MDS code. – This works for any r such that n > r > n2 . Construction 33. Design LCD MDS linear block codes. The design technique is taken from [8]. – Construct a Fourier n × n matrix. Denote its rows in order by {e0 , . . . , en−1 } and n times the columns of the inverse in order are denoted by {f0 , . . . , fn−1 }. – Design A as follows. A consists of first row e0 and rows {e1 , en−1 , e2 , en−2 , . . . , er , en−r } for r ≤ n2 . (A consists of e0 and pairs {ei , en−i } starting with {e1 , en−1 }.) – Set B T = (fr+1 , fn−r−1 , . . . , f n−1 , f n+1 ) when n is odd and 2 2 B T = (fr+1 , fn−r−1 , . . . , f n2 −1 , f n2 +1 , f n2 ) when n is even. – Then AB T = 0 and B generates the dual code C ⊥ of the code C generated by A. – Using fi T = en−i it is easy to check that C ∩ C ⊥ = 0. – The rows of A are in sequence {n − r, n − r + 1, . . . , n − 1, 0, 1, . . . , r − 1} and so A generates an MDS LCD linear block [n, 2r + 1, n − 2r] code. The general method of constructing MDS codes by choosing rows from Fourier matrices does not take into account the power of the other non-chosen rows. This can be remedied by going over to convolutional codes. The convolutional codes obtained carefully in this way are MDS convolutional codes with free distance of the order of twice the distance of the corresponding MDS linear block code with the same length and rate. Lemma 1. Let F be a Fourier n × n matrix with rows {e0 , e1 , . . . , en−1 }. Define ⎛ 0 ⎞ ⎛ e0 ⎞ e1 . ⎜ .. ⎟ ⎜ .. ⎟ ⎜ ⎟ 0 n . ⎟ A=⎜ er ⎟ where r > 2 and the first (2r − n) rows of B ⎝ . ⎠,B = ⎜ ⎝ .. ⎠ .. . er−1
en−1
consist of zeros. Let P be a non-zero vector of length n. Then wt P (A + Bz) ≥ 2(n − r) + 1. Proof. P A has wt ≥ (n − r + 1) as A generates an [n, r, n − r + 1] code. P B has weight ≥ r + 1 except when P has the last (n − r) entries consisting of zeros, as the non-zero rows of B generate an [n, n − r, r + 1] code. Now (n − r + 1) + r + 1 = n + 2 > 2(n − r) + 1. When P has last (n − r) entries consisting of zeros then P A contains a non-zero sum of {e0 , e1 , . . . , e2r−n } which is part of an [n, 2r − n, 2r − n − n + 1] = [n, 2r − n, 2(n − r) + 1] code and so has weight ≥ 2(n − r) + 1 as required.
Linear Block and Convolutional MDS Code
137
In fact if P is a polynomial of degree t then wt P (A + Bz) ≥ 2(n − r) + t + 1 so weight increases with the degree of the multiplying polynomial vector. Lemma 1 may be generalised as follows. Lemma 2. Let F be a Fourier n × n matrix with rows {e0 , e1 , . . . , en−1 }. Let A be chosen by taking r rows, r > n2 , of the Fourier n × n matrix in arithmetic sequence with arithmetic difference k satisfying gcd(k, n) = 1. Let B be the matrix with first (2r − n) rows consisting of zeros and the last (n − r) rows consisting of the rest of the rows of F not in A; these last rows of B are also in sequence with arithmetic difference k satisfying gcd(k, n) = 1. Let P be any non-zero vector of length n. Then wt P (A + Bz) ≥ 2(n − r) + 1. Before describing the general design methods, it is instructional to consider the following example. See also Example 4 below for a larger example demonstrating the design techniques for constructing a code to given rate and distance. When gcd(p, n) = 1, OrderMod(p, n) denotes the least positive power s such that ps ≡ 1 mod n. Example 1. Consider n = 7. Now OrderMod(2, 7) = 3 so Fourier 7 × 7 matrix may be constructed over GF (23 ). The Fourier 7 × 7 matrix may also be constructed over GF (36 ) as OrderMod(3, 7) = 6 and over many other fields whose characteristic does not divide 7. It may be formed over the prime field GF (29) as OrderMod(29, 7) = 1; arithmetic in GF (29) is then modular arithmetic. We’ll stick to GF (23 ) for the moment; when Hermitian inner product is required we’ll move to GF (26 ). The rows in order of a Fourier 7×7 matrix under consideration are denoted by {e0 , e1 , e2 , . . . , e6 } and (7 times) the columns of the inverse in order are denoted by {f0 , . . . , f6 }; note 7 = 1 in characteristic 2. Then ei fj = δij , eTi = f7−i , fiT = e7−i . e0
1. Construct A = ee12 . Then A ∗ (f4 , f5 , f6 ) = 0. The Euclidean dual matrix e3 T
e3 f4 is (f4 , f5 , f6 )T = f5T = ee2 . Thus the code generated by A is a DC code f6T
1
[7, 4, 4] code. 2. To obtain a DC code relative to the Hermitian inner product work in GF (26 ). Again the rows of the Fourier 7 × 7 matrix over GF (26 ) are denoted by {e0 , . . . , e6 }. Here eli = eil as explained below where l = 23 and thus since l 23 ≡ 1 mod
7 it follows that ei = eil = ei . Thus the code generated by A=
e0 e1 e2 e3
is an Hermitian DC MDS [7, 4, 4] code.
3. A DC MDS convolutional code over GF (23 ) and a DC Hermitian MDS convolutional code over GF (26 ) are obtained as follows. The distance obtained is of the order of twice the distance of the corresponding MDS linear block code. (In characteristic 2, −1 = +1.)
138
T. Hurley
e0 0
4. Now design G[z] = ee12 + ee45 z. Then G[z]∗((f4 , f5 , f6 )+(f1 , f2 , f3 ))z = 0, e3 e6
e3 e6 T H [z] = (f1 , f2 , f3 ))z. Then H[z −1 ] = ee2 + ee5 z −1 . Thus a control 1 4
e6
e3 e e matrix is K[z] = e2 z + e5 . It is easy to show that the convolutional 1
4
code generated by K[z] has trivial intersection with the convolutional code generated by G[z]. Thus the convolutional code generated by G[z] is a LCD (7, 4, 3; 1, df ) code. The GSB for a code of this form is 3( 34 + 1 + 3 + 1 = 7. The free distance of the one constructed may be shown to be 7 directly or from the general Lemma 1 below. 5. Starting with the MDS DC [7, 4, 4] code, a corresponding convolutional code (7, 4, 3; 1, 7) is designed of memory 1 which is LCD and has almost twice the distance of the DC linear block code. 6. Is it possible to go the other way? Methods for designing MDS LCD linear block codes are established in [8]. Following the method of [8], let A =
e0 e1 e6 e2 e5
that the code genand hence A ∗ (f3 , f4 ) = 0. Then (f3 , f4 )T = ( ee43 ) giving e0 0 erated by A is an MDS [7, 5, 3] LCD code. Then G[z] =
e1 e6 e2 e5
+
0 0 e3 e4
z=
A + Bz, say, gives a convolutional (7, 5, 2; 1, 5) MDS code. A control matrix is H T [z] = (f3 , f4 ) − (f2 , f5 )z. H[z −1 ] = ( ee43 ) + ( ee52 ) z −1 . Thus the dual code has generating matrix ( ee43 ) z +( ee52 ). Now it is necessary to show that the code generated by G[z]is DC. Note
( 00 00 00 01 10 )∗{
e0 e1 e6 e2 e5
+
0 0 0 e3 e4
z} = ( ee52 )+( ee43 ) z. Hence the code generated
by G[z] is DC over GF(23 ) and DC over GF (26 ). is Hermitian
e5 e0 0 7. Construct G[z] = ee1 + e3 z + 0 z 2 e6 e4 2 Then G[z] is a convolutional code of type (7, 3, 5; 2); the degree is 5. The GSB for such a code is (7 − 3)( 53 + 1 + 5 + 1 = 4 ∗ 2 + 5 + 1 = 14. In fact the free distance of this codes is actually 14. This may be shown in an analogous way to the proof of Lemma 1. G[z]∗((f3 , f4 , f5 , f6 )−(f1 , f2 , 0, 0)−(0, 0, f0 , f2 )z 2 )) = 0, G[z]∗((f0 , f1 , f2 )) = 7I3 . The result is that the code generated by G[z] is a non-catastrophic convolutional MDS (7, 3, 5; 2, 14) code. Note the free distance attained is 5 ∗ 3 − 1 where 5 is the free distance of a [7, 3, 5] MDS code; the distance is tripled less 1. This is a general principle for the more general cases – the free distance is of order three times the distance of the same length and dimension MDS code. It’s not a dual-containing code nor a LCD code. Toget such
0 codes requires
e0 0 a compromise on the distance. G[z] = ee1 + e3 z + e5 z 2 . This give e4 e6 2 (7, 3, 4; 2) convolutional code which turns out to be an LCD code but the free distance is only 7. The GSB for such a code is 13.
Linear Block and Convolutional MDS Code
139
8. Now for memory 3 define G[z] = ( ee01 ) + ( ee23 ) z + ( ee45 ) z 2 + e06 z 3 The GSB for such a code is (7 − 2)( 52 + 1) + 5 + 1 = 21. The free distance of the code is actually 21 so the code is a (7, 2, 5; 3, 21) convolutional MDS code. This is 6 ∗ 4 − 3 where 6 is the free distance of the corresponding block linear MDS code [7, 2, 6]. 9. Ultimately get a convolutional code e0 + e1 z + e2 z 2 + e3 z 3 + e4 z 4 + e5 z 5 + e6 z 6 which is the convolutional MDS code (7, 1, 6; 6, 47) code which is repetition convolutional code.
4 4.1
Specify the Codes Matrices to Work and Control
Many of the designs hold using general Vandermonde matrices but the Fourier matrix case is considered for clarity. If the Fourier n × n matrix Fn exists over a finite field then the characteristic p of the field does not divide n which happens if and only if gcd(p, n) = 1. Let Fn denote a Fourier matrix of size n. Over which finite fields precisely may this matrix be constructed? Suppose gcd(p, n) = 1. Then pφ(n) ≡ 1 mod n, where φ is the Euler φ-function. Hence there exists a least positive power s that ps ≡ 1 mod n; this s is called the order of p modulo n. Use OrderMod(p, n) to denote the order of p mod n. Lemma 3. Let p be any prime such that p |n and s = OrderMod(p, n). (i) There exists an element of order n in GF (ps ) from which the Fourier n × n matrix may be constructed over GF (ps ). (ii) The Fourier n × n matrix cannot exist over a finite field of characteristic p of order smaller than GF (ps ). (iii) There exists a Fourier n×n matrix over any GF (prs ), r ≥ 1 and in particular over GF (p2s ). Proof. (i) There exists an element of order ps − 1 in GF (ps ), that is for some s ω ∈ GF (ps ), ω p −1 = 1. Now (ps − 1) = qn for some q and so (ω q )n = 1 in GF (ps ), giving an element of order n in GF (ps ). This element may then be used to construct the Fourier n × n matrix over GF (ps ). Proofs of (ii) and (iii) are omitted. For a vector v = (v1 , v2 , . . . , vr ) define v l = (v1l , v2l , . . . , vrl ). The following lists some properties of a Fourier matrix of size n over a finite field. These are used throughout. 1. Let Fn be a Fourier n × n matrix over a field generated by ω, where ω n = 1 and ω r = 1 for 0 < r < n. 2. Denote the rows of Fn by {e0 , e1 , . . . , en−1 } in order and n times the columns of the inverse of Fn in order by {f0 , f1 , . . . , fn−1 }. 3. Then ei fj = δij , ei T = fn−i , fi T = en−i .
140
T. Hurley
4. The rows of the Fn are given by ei = (1, ω i , ω 2i , . . . , ω (n−1)i ). Indices are to be taken modulo n. 5. eli = (1, ω il , ω 2il , . . . , ω (n−1)il ) = eil . 6. Note that if l ≡ 1 mod n then eli = eil = ei . Within GF (p2s ) the Hermitian inner product is defined by u, v H = u, v l E where l = ps . In this setup ei , ej H = ei , elj E = ei , ejl E . This facilitates the construction of Hermitian inner product codes over GF (22s ). Example 2. Consider length n = 10. Now OrderMod(3, 10) = 4. Construct the Fourier 10 × 10 matrix F10 over GF (34 ). Denote the rows in order of F10 by {e0 , e1 , . . . , e9 } and the 10 times the columns of the inverse of F10 in order by {f0 , f1 , . . . , f9 }. Then ei fj = δij . Also ei T = f10−i , fi T = e10−i . ⎛ e0 ⎞ e1
1. As in [28] construct A = ⎝ ee23 ⎠. Then A ∗ (f6 , f7 , f8 , f9 ) = 0. Now by Cone4 e5
struction 31, [37] the code generated by A is an⎛MDS ⎞[10, 6, 5] code. e4
f6 T T 2. The Euclidean dual of A is (f6 , f7 , f8 , f9 )T = ⎝ ff7 T ⎠ = ee32 . Thus A is a 8
f9 T
e1
DC code. 2 3. Now e3i = e9∗i = e−i = e10−i . Thus A is not a DC code under the Hermitian inner product induced in GF (34 ). Consider GF (38 ). This has an element of order 10, as 38 −1 = (34 −1)(34 +1), and so the Fourier 10×10 matrix may be constructed over GF (38 ). Here then 4 the Hermitian inner product is u, v H = u, v 3 E , where the suffix E denotes the Euclidean inner product. Now eli = eil = ei∗(34 ) = ei . Thus ei , ej H = ei , ej E . Hence here then the code generated by A, constructed in GF (34 ), is a DC code under the Hermitian inner product. 4. Also OrderMod(7, 10) = 4 so the above works over GF (74 ) and over GF (78 ) when seeking Hermitian DC codes. 5. Better though is the following. OrderMod(11, 10) = 1 and so the prime field GF (11) may be considered. A as above is then a DC code over the prime field GF (11) and a DC Hermitian code when considered over GF (112 ). 6. What is an element of order 10 in GF (11)? In fact ω = (2 mod 11) works. In GF (11) the arithmetic is modular arithmetic. An element of order 10 is required in GF (112 ). Now GF (112 ) is constructed by finding an irreducible .A⎞ primitive element is easily found. polynomial of degree 2 over Z11⎛ 0 0
7. Let A be as above and B = ⎝ ee67 ⎠. Define G[z] = A + Bz. The code C e8 e9
generated by G[z] is a (10, 6, 4; 1, df ) convolutional code. 8. Then G[z] ∗ {(f6 , f7 , f8 , f9 ) − (f2 , f3 , f4 , f5 )z} = 0 and G[z] ∗ {f0 , f1 , f2 , f3 , f4 , f5 )} = I5 . Thus G[z] is a non-catastrophic generator for the code.
Linear Block and Convolutional MDS Code
141
9. The GSB of such a code is (10 − 6)( 46 ) + 4 + 1 = 4 + 4 + 1 = 9. The free distance of C may be shown, using Lemma 1 essentially, to be 9 and so is thus an MDS convolutional (10, 6, 4; 1, 9) code. Since OrderMod(11, 10) = 1 the calculations may be done over the prime field GF (11), and over GF (112 ) when Hermitian codes are required. Example 3. Consider n = 25 − 1 = 31. Then OrderMod(2, 31) = 5 and so the Fourier 31 × 31 matrix may be constructed over GF (25 ) but also over GF (210 ). Let r = 17, and A consist of rows e0 , e1 , . . . , e16 and B = (f17 , f18 , . . . , f30 ). Then A generates an [31, 17, 15] MDS code. Now B T consists of rows {e14 , e13 , . . . , e1 }, using fi T = e31−i . Thus The code generated by A is a DC MDS [31, 17, 15] code Euclidean over GF (25 ) and Hermitian over GF (210 ). Example 4. Example of a general technique. It is required to construct a rate ≥ 78 codes which can correct 25 errors; thus a distance ≥ 51 is required. LINEAR BLOCK: 1. An [n, r, n − r + 1] type code with nr ≥ 78 = R and (n − r + 1) ≥ 51 is required. Thus n − r ≥ 50 giving n(1 − R) ≥ 50 and so it is required that n ≥ 400. 2. Construct Fourier matrix F400 of size 400 × 400 over some suitable field, to be determined. The rows are denoted by {e0 , . . . , e399 } and the columns of 400 times the inverse by {f0 , . . . , f399 }. 3. Define A to be the matrix with rows {e0 , . . . , e349 }. Then by [9] A is a DC [400, 350, 51] MDS code. By the CSS construction a [[400, 300, 51]] MDS QECC is designed. – Over which fields can F400 be defined? The characteristic must not divide 400 but otherwise the fields can be determined by finding OrderMod(p, n) where gcd(p, n) = 1. – Now OrderMod(3, 400) = 20, OrderMod(7, 400) = 4, OrderMod(401, 400) = 1 so it may be constructed over GF (320 ), GF (74 ), GF (401) and others. – Now GF (401) is a prime field and arithmetic therein is modular arithmetic; it is in fact the smallest field over which the Fourier 400×400 can be constructed. – An Hermitian dual-containing code may be obtained by working over GF (p2l ) when there exists an element of the required order in GF (pl ). Just define A as above to be a Fourier 400 × 400 matrix over say GF (4012 ) using a 400th root of unity in GF (4012 ). = ei∗401 = ei as 401 ≡ 1 mod 400. Thus the – The ei in this case satisfy e401 i code obtained is an Hermitian dual-containing [400, 350, 51] code from which a QECC [[400, 300, 51] MDS code is designed. By taking ‘only’ {e0 , . . . , e349 } the full power of all the rows of the Fourier matrix is not utilised. 1. Define A as before and B to be the matrix whose last 50 rows are {e350 , . . . , e399 } and whose first 350 rows are zero vectors.
142
T. Hurley
2. Define G[z] = A + Bz. The convolutional code generated by G[z] is a (400, 350, 50; 1) code. It may be shown to be non-catastrophic by writing down the right inverse of G[z]. 50 + 1) + 50 + 1 = 101. Using 3. The GSB for such a code is (450 − 350)( 350 Lemma 1 it may be shown that the free distance of the code generated by G[z] is 101 so it’s an MDS convolutional (400, 350, 50; 1, 101) code. The distance is twice less 1 of the distance of an MDS [400, 350, 51] linear block code. To get an LCD linear block code of rate ≥ 78 and d ≥ 51 it is necessary to go to length 401 or higher. Use the methods of [8]. – For length 401, let A be the matrix generated {e0 , e1 , e400 , e2 , e399 , . . . , e175 , e226 }. (The selection includes pairs ei , e401−i .) – Then the code generated by A is an LCD [401, 351, 51] MDS code. – The fields required for n = 401 are fairly large. Go to 511 = 29 − 1 as here OrderMod(2, 511) = 9 so the field GF (29 ) works and has characteristic 2. – Require r ≥ 511 ∗ 78 for a rate ≥ 78 . Thus require r ≥ 448. Take r = 449 for reasons which will appear later. – Let F511 be the Fourier 511 × 511 matrix over GF (29 ) or for the Hermitian case over GF (218 ). – Let A be the matrix with rows {e0 , e1 , . . . , e448 }. – Then A is an [511, 449, 63] MDS DC code – and DC Hermitian code over GF (218 ). – From this QECC MDS codes [[511, 387, 63]] are designed and are Hermitian over GF (218 ). Convolutional 1. Let B be the matrix of size 449 × 511 with last 62 rows consisting of {e449 , . . . , e510 } and other rows consisting of zero vectors. 2. Define G[z] = A + Bz. Then the code generated by G[z] is a non-catastrophic MDS (511, 449, 62 : 1, 125) convolutional code. The proof of the distance follows the lines of Lemma 1. It has twice the distance less 1 of the corresponding MDS block linear [511, 449, 63] code. Now design LCD codes in GF (29 ). 1. Let A be the matrix with rows {e0 , e1 , e510 , e2 , e509 , . . . , e224 , e287 }. Notice the rows are in sequence and so A generates an [511, 449, 63] linear block code. 2. The check matrix is H T = (f286 , f225 , f285 , f226 , . . . , f256 , f255 ). 3. Then H consists of rows {e225 , e286 , . . . , e255 , e256 }. Thus the code generated by A has trivial intersection with the code generated by H and so the code is an MDS LCD block linear [511, 449, 63] code. This is an Hermitian LCD MDS code over GF (218 ). 4. Let B be the matrix whose last 62 rows are {e286 , e225 , e285 , e226 , . . . , e256 , e255 } and whose first 449 rows consists of zero vectors. 5. Define G[z] = A + Bz. A check matrix for the code generated by G[z] is (f286 , f225 , f285 , f226 , . . . , f256 , f255 ) − (0, 0, . . . , 0, f254 , f257 , . . . , f224 , f287 )z = C + Dz, say.
Linear Block and Convolutional MDS Code
⎛ e225 ⎞ ⎛ e286 ⎜ ee226 285
⎜ ⎜ 6. Recall we are in characteristic 2. Now C T + DT z −1 = ⎜ ⎜ ⎝ ⎛
0
⎞
⎛ e225 ⎞
.. . .. .
e255 e256
143
⎞ .. ⎟ ⎜ . ⎟ ⎟ ⎜ 0 ⎟ ⎟ ⎜ e257 ⎟ −1 ⎟ + ⎜ e254 ⎟ z . ⎟ ⎜ . ⎟ ⎠ ⎝ . ⎠ . 0
e287 e224
e286 . ⎟ ⎜ .. ⎟ ⎜ ee226 ⎜ 0 ⎟ ⎜ 285 ⎟ ⎜ e257 ⎟ ⎜ .. ⎟ 7. Thus the dual matrix is ⎜ e254 ⎟ + ⎜ . ⎟ z = E + F z say. ⎜ . ⎟ ⎜ . ⎟ ⎝ . ⎠ ⎝ . ⎠ . . e287 e224
e255 e256
8. It is relatively easy to show that there is a matrix K such that K(A + Bz) = E + F z and so the code generated by A + Bz is a DC convolutional (511, 449, 62; 1) code. 9. The GSB for such a code is 125 and this is the distance attained, so the code is a DC MDS convolutional (511, 449, 62; 1, 125) code. From this a quantum convolutional code may be designed. To obtain Hermitian DC, work in GF (218 ). This can be extended to higher degrees. Thus in a sense: DC block linear −→ LCD convolutional degree 1 at twice the distance. LCD block linear −→ DC convolutional degree 1 at twice the distance. The LCD block linear to give DC convolutional requires characteristic 2.
5
Algorithms
Algorithm 51. Construct block linear codes of rate ≥ R and distance ≥ (2t+1) for 0 < R < 1, and positive integer t, and with efficient decoding algorithm. A [n, r, n−r +1] linear block code will be designed. Thus nr ≥ R, (n−r +1) ≥ 2t . 2t + 1. This requires n − r ≥ 2t, n(1 − R) ≥ 2t and so require n ≥ 1−R 2t 1. Choose n ≥ 1−R and construct the Fourier n × n matrix Fn over a suitable field. Now choose r ≥ nR. 2. Select any r rows of Fn in arithmetic sequence with arithmetic difference k satisfying gcd(k, n) = 1 and form the matrix A consisting of these rows. 3. The block linear code with generator matrix A is an MDS [n, r, n − r + 1] code. 4. The rate is nr ≥ R, and the distance d = n − r + 1 = n(1 − R) + 1 ≥ 2t + 1 as required.
Algorithm 52. Construct block linear codes of rate ≥ R and distance ≥ (2t+1) for given rational R, 0 < R < 1, and positive integer t, with efficient decoding algorithms, over fields of characteristic p.
144
T. Hurley
A [n, r, n − r + 1] code is designed in characteristic p. As in Algorithm 51 it 2t . Require in addition that gcd(p, n) = 1. is required that n ≥ 1−R 2t 1. Require n ≥ 1−R and gcd(p, n) = 1. 2. Construct the Fourier n × n matrix over a field of characteristic p. 3. Proceed as in items 1–4 of Algorithm 51.
Algorithm 53. Construct block linear codes DC codes of rate ≥ R with 12 < R < 1 and distance ≥ (2t + 1) for positive integer t, with efficient decoding algorithm. A [n, r, n − r + 1] block dual-containing code is required. As before require 2t . n ≥ 1−R 2t and construct the Fourier n × n matrix Fn over a suitable 1. Choose n ≥ 1−R field. 2. Choose r ≥ nR. As R > 12 then r > n2 + 1. 3. Select A to be the first r rows of Fn , that is A consists of rows T T , . . . , er−1 {e0 , e1⎛ ⎞}. Then B = (fr , . . . , fn−1 ) satisfies AB = 0. Now
B = ⎝
en−r en−r+1
.. .
⎠. Thus the code generated by A is a dual-containing MDS
e1
[n, r, n − r + 1] code. 4. The rate is nr ≥ R, and the distance d = n − r + 1 = n(1 − R) + 1 ≥ (2t + 1) as required. Algorithm 54. Design block linear DC codes of rate ≥ R > 12 and distance ≥ (2t + 1) for given R, t, with efficient decoding algorithm, over fields of characteristic p. A [n, r, n − r + 1] block dual containing code is required. As before require 2t . n ≥ 1−R 2t and also such that gcd(p, n) = 1. 1. Choose n ≥ 1−R 2. Construct the Fourier n × n matrix Fn over a field of characteristic p. 3. Now proceed as in items 1–4 of Algorithm 53.
Algorithm 55. (i) Design MDS QECCs of form [[n, 2r − n, n − r + 1]]. (ii) Design MDS QECCs of form [[n, 2r − n, n − r + 1]] over a field of characteristic p. Method: (i) By Algorithm 53 construct MDS DC codes [n, r, n − r + 1]. Then by CSS construction, construct the MDS [[n, 2r − n, n − r + 1]] QECC. (ii) By Algorithm 54 construct MDS DC codes [n, r, n − r + 1] over a field of characteristic p. Then by CSS construction, construct the MDS [[n, 2r − n, n − r + 1] QECC in a field of characteristic p.
Linear Block and Convolutional MDS Code
145
Algorithm 56. Design LCD MDS codes of rate ≥ R and distance ≥ 2t + 1. This design follows from [8]. 2t 1. Choose n ≥ 1−R and r ≥ nR. 2. For such n let Fn be a Fourier n × n matrix with rows {e0 , . . . , en−1 } in order and n times the columns of the inverse in order are denoted by {f0 , . . . , fn−1 }. 3. For 2r + 1 ≥ nR and r ≤ n2 define A as follows. A consists of row e0 and rows {e1 , en−1 , e2 , en−2 , . . . , er , en−r } for . (A consists of e0 and pairs {ei , en−i } starting with e1 , en−1 .) 4. Set B T = (fr+1 , fn−r−1 , . . . , f n−1 , f n+1 ) when n is odd and B T = 2 2 (fr+1 , fn−r−1 , . . . , f n2 −1 , f n2 +1 , f n2 ) when n is even. 5. Then AB T = 0 and B generates the dual code C ⊥ of the code C generated by A. 6. Using fi T = en−i it is easy to check that C ∩ C ⊥ = 0. 7. Now the rows of A are in sequence {n − r, n − r + 1, . . . , n − 1, 0, 1, . . . , r − 1} and so A generates an MDS, LCD, [n, 2r + 1, n − 2r] code.
Algorithm 57. Construct MDS and MDS DC codes and MDS QECCs, all over prime fields. Construct Hermitian such codes over GF (p2 ) for a prime p. – Let GF (p) = Zp be a prime field. This has a primitive element of order (p − 1) = n, say. – Construct the Fourier n × n matrix over GF (p). – Choose r > p−1 . 2 ⎛ ⎞ e0 e1
– Construct A = ⎝ .. ⎠. . er−1
– H T = (fr , fr+1 , . . . , fn−1 ) is a check matrix. – H consists of rows {en−r , en−r−1 , . . . , e1 } and so the code generated by A is a DC [n, r, n − r + 1] MDS code in GF (p). – Over GF (p2 ) with u, v H = u, v p E (and different ei ) the code generated by A is a Hermitian DC ⎛ [n, r, n⎞ − r + 1] code. 0 ⎛ e0 ⎞ e1 ⎜ .. ⎟ ⎜ .. ⎟ ⎜ 0. ⎟ ⎟ . ⎟ ⎜ – Construct ⎜ ⎝ . ⎠ + ⎜ er+1 ⎟ z where the second matrix has (2r − n) initial ⎝ . ⎠ .. .. er−1 en−1
zero rows. This gives a convolutional (n, r, n − r; 1, 2(n − r) + 1) code over GF (p). This code is not dual-containing even over GF (p2 ). Algorithm 58. Design infinite series of DC codes [ni , ri , di ], (i) with rates and rdists satisfying limi→∞ nrii = 12 and limi→∞ ndii = 12 , (ii) with rates and rdists satisfying limi→∞ nrii = qs and limi→∞ ndii = 1 − qs . 1: Consider a series of n1 < n2 < n3 < . . .. and let pi be a prime such that pi |ni . Then OrderMod(pi , ni ) = si , for some si .
146
T. Hurley
si ni 2: Construct ⎞ Fourier ni × ni matrix over GF (pi ). Let ri = 2 + 1. Then ⎛ e0 the
A=⎝
e1
.. .
⎠ generates a DC [ni , ri , ni − ri + 1] MDS code over GF (psi i ).
eri −1
2si ni 3: Construct ⎛ e0 the ⎞ Fourier ni × ni matrix over GF (pi ). Let ri = 2 + 1. Then e1
A = ⎝ .. ⎠ generates an Hermitian DC [ni , ri , ni − ri + 1] MDS code over . 4:
er−1 2si GF (pi ). i +1 limi→∞ nrii = 12 , limi→∞ ni −r ni
= 12 .
Algorithm 59. Design infinite series of DC codes [ni , ri , di ] over fields of given characteristic p with (i) rates and rdists satisfying limi→∞ nrii = 12 , limi→∞ ndii = ri di 1 s 1 s 2 , (ii) rates and rdists satisfying limi→∞ ni = q > 2 , limi→∞ ni = 1 − q . 1: Consider a series of n1 < n2 < n3 < . . . where gcd(p, ni ) = 1. Then OrderMod(p, ni ) = si for some si . ni si 2: Construct the ⎛ ⎞ Fourier ni × ni matrix over GF (p ). Let ri = 2 + 1. Then e0 e1
A = ⎝ .. ⎠ is a DC [ni , ri , ni − ri + 1] code over GF (psi ). . er−1
ni 2si 3: Construct ⎛ e0 the ⎞ Fourier ni × ni matrix over GF (p ). Let ri = 2 + 1. Then
A=⎝
e1
.. .
⎠ is a Hermitian DC [ni , ri , ni − ri ] code over GF (p2si ).
eri −1
i +1 = 12 . 4: limi→∞ nrii = 12 , limi→∞ ni −r ni p (ii) The general fraction R = q may be obtained by choosing ri = qs ni in item 2.
By taking a series of odd integers 2n1 + 1 < 2n2 + 1 < . . . infinite such series are obtained over fields of characteristic 2. The odd series (22 − 1) < (23 − 1) < (24 − 1) < . . . is particularly noteworthy. Note OrderMod(2, n) = s for 2s − 1 = n and GF (2s ) contains an element of order 2s − 1 describing all the non-zero elements of GF (2s ). The Fourier matrix of size n × n may then be constructed over GF (2s ). This has many nice consequences for designing codes of different types in characteristic 2. This is illustrated as follows for the case GF (25 ). Example 5. Construct codes and particular types of codes over GF (25 ) and Hermitian such codes over GF (210 ). Let F15 = F be the Fourier matrix of size 15 × 15 over GF (25 ) or as relevant over GF (210 ). ⎛ e0 ⎞ e1
1: For r > 7, A = ⎝ .. ⎠ generates a DC MDS code [15, r, 15 − r + 1]. . er−1
Linear Block and Convolutional MDS Code
⎛
147
⎞
e0 e1
2: For r > 7, A = ⎝ .. ⎠ considered as a matrix over GF (210 ) generates an . er−1
Hermitian DC MDS code r, 15 − r + 1]. ⎞ ⎛[15, ⎛ e0 ⎞ 0 e1 e8 ⎜ ee9 ⎟ ⎜ e2 ⎟ 10 ⎟ 3: G[z] = ⎝ ee34 ⎠ + ⎜ ⎝ ee11 ⎠ z generates an MDS convolutional code e5 12
e6 e7
e13 e14
(15, 8, 7; 1, 15). This has distance twice the distance less 1 of the linear MDS code [15, 8, 8]. But also this code is an LCD code. ⎛ e0 ⎞ e1
⎜ ee14 2 ⎟ e ⎟ generates a LCD, MDS code [15, 9, 7]; see [8] for details. A ∗ 4: A = ⎜ ⎝ e13 3 ⎠ e12 e4 e11
(f5 , f10 , f6 , f9 , f7 , f8 ) = 0. 5: Consider A as above constructed in GF (210 ). Then A is a Hermitian LCD code over GF (210 ).⎛ ⎞ ⎛ e0 ⎞ 0 e1 0 e14 0 ⎟ ⎜ ⎜ e2 ⎟ e ⎜ 5⎟ e13 ⎟, B = ⎜ e10 ⎟ designs the MDS convolutional code (15, 9, 6; 1, 13), 6: A = ⎜ ⎝ e3 ⎠ ⎝ ee6 ⎠ e12 e4 e11
9
e7 e8
generated by G[z] = A + Bz. This code in addition is a dual-containing MDS convolutional code. 7: By taking r = 34 15 + 1 = 12 or r = 34 15 = 11 codes of rate ‘near’ 34 4 11 are obtained; in this case codes of rate 12 15 = 5 or 15 attaining the MDS are obtained. Algorithm 510. Construct infinite series MDS codes of various types of linear block and convolutional codes over fields of the form GF (2i ) and Hermitian such codes over fields of the form GF (22i ). 1. First note OrderMod(2, 2i − 1) = i for given i. 2. Construct the ni × ni Fourier matrix Fni over GF (2i ) with ni = 2i − 1. 3. For ri > n2i let Cri be the code generated by rows e0 , e1 , . . . , er−1 . Then Cri is an [ni , ri , ni − ri + 1] dual-containing MDS linear code over GF (2i ). This gives an infinite series of codes Cri of type [ni , ri , ni − ri + 1]. 4. For ri = n2i this gives an infinite series Cri with limi→∞ nrii = 12 and limi→∞ ndii = 12 where di = ni − ri + 1 is the distance. 5: For ri = 3n4 i this gives an infinite series Cri with limi→∞ nrii = 34 and limi→∞ ndii = 14 where di = ni − ri + 1 is the distance. 6: Other such infinite series can be obtained with different fractions by letting ri = Rni (R rational) and then series Cri with limi→∞ nrii = R and limi→∞ ndii = 1 − R are obtained where di = ni − ri + 1 is the distance.
148
T. Hurley
Construct the ni × ni Fourier matrix Fni over GF (22i ) with ni = 2i − 1. For ri > n2i let Cri be the code generated by rows {e0 , e1 , . . . , eri −1 }. Then Cri is an [ni , ri , ni − ri + 1] DC MDS Hermitian linear code over GF (22i ). This gives an infinite series of Cri of type [ni , ri , ni −ri +1] which are Hermitian DC. For ri = n2i this gives an infinite series of Hermitian DC Cri codes with limi→∞ nrii = 12 and limi→∞ ndii = 12 where di = ni − ri + 1 is the distance. ri = 3n4 i designs an infinite series of Hermitian DC codes Cri with limi→∞ nrii = 34 and limi→∞ ndii = 14 where di = ni − ri + 1 is the distance. Other such infinite series of Hermitian codes can be obtained with different fractions by choosing ri = snq i for a fraction qs < 1. Algorithm 511. Construct infinite series of codes of various ‘types’ over prime fields GF (p) and with Hermitian inner product over fields GF (p2 ). Let p1 < p2 < . . . be an infinite series of primes. For pi construct the Fourier ni × ni matrix over GF (pi ) where ni = pi − 1. For ri > n2i let Cri be the code generated by rows e0 , e1 , . . . , eri −1 . Then Cri is an [ni , ri , ni − ri + 1] DC MDS linear code over GF (pi ). The arithmetic is modular arithmetic. This gives an infinite series of codes Cri of type [ni , ri , ni − ri + 1]. For ri = n2i + 1 this gives an infinite series Cri of type [ni , ri , ni − ri + 1] DC codes over prime fields with limi→∞ nrii = 12 and limi→∞ ndii = 12 where di = ni − ri + 1 is the distance. For ri = 3n4 i this gives an infinite series of such Cri with limi→∞ nrii = 34 and limi→∞ ndii = 14 where di = ni − ri + 1 is the distance. For ri = pq ni infinite series are obtained with limi→∞ nrii = pq and limi→∞ ndii = 1 − pq where di = ni − ri + 1 is the distance. Construct the ni × ni Fourier matrix Fni over GF (p2i ). For ri > n2i let Cri be the code generated by rows {e0 , e1 , . . . , eri −1 }. Then Cri is an [ni , ri , ni − ri + 1] Hermitian DC MDS linear code over GF (p2i ). This gives an infinite series of Cri of type [ni , ri , ni −ri +1] which are Hermitian DC over GF (p2i ). For ri = n2i + 1 this gives an infinite series of Hermitian DC Cri codes with limi→∞ nrii = 12 and limi→∞ ndii = 12 where di = ni − ri + 1 is the distance. For ri = 3n4 i this gives an infinite series of Hermitian DC Cri codes with limi→∞ nrii = 34 and limi→∞ ndii = 14 where di = ni − ri + 1 is the distance. For ri = pq ni an infinite series are obtained with limi→∞ nrii = pq and limi→∞ ndii = 1 − pq where di is the distance. The following algorithm explains how to design convolutional codes with order twice the distance of the corresponding MDS code of the same length and rate. Algorithm 512. Design convolutional MDS memory 1 codes to the order of twice the distance of the linear block MDS codes of the same length and rate.
Linear Block and Convolutional MDS Code
149
1. Let Fn be a Fourier n × n matrix. Denote the rows in order of Fn by {e0 , e1 , . . . , en−1 } and n times the columns of the inverse of Fn by {f0 , f1 , . . . , fn−1 }. Then ei fj = δij , fiT = en−i , eTi = fn−i with indices taken mod n. 2. Let r > n2 . Let A be the matrix with first r rows {e0 , e1 , . . . , er−1 } of Fn and let B be the matrix whose last rows are {er , . . . , en−1 } in order and whose first (n − r) rows consists of zero vectors. 3. Define G[z] = A+Bz. Then G[z] is a generating matrix for a non-catastrophic convolutional code (n, r, n − r; 1, 2(n − r) + 1) of free distance 2(n − r) + 1. A control matrix is easily written down, as is a right inverse for G[z]. Thus the MDS convolutional code produced of rate nr has twice the distance, less 1, of the MDS linear code [n, r, n − r + 1] with the same rate and length. Note that r can be any integer > n2 so all rates nr for n2 < r < n are obtainable. The dual code has rate (1 − R) where R = nr is the rate of C so rates R with R < 12 are obtainable Alternatively item 2. of Algorithm 512 may be replaced by taking rows in arithmetic sequence as follows: Let r > n2 and A be formed from Fn by taking r rows in geometric sequence with geometric difference k satisfying gcd(k, n) = 1. Define B to be the matrix whose last rows are the other rows of Fn not in A (which also are in geometric sequence satisfying the gcd condition) and whose first (n − r) rows consist of zero vectors. The methods are illustrated in the following examples. The cases 28 = 256 and the near prime 257 are worth noting. Example 6. (i) Construct DC MDS codes of length 255 of various permissible rates over GF (28 ). (ii) Construct Hermitian DC MDS codes of length 255 of various permissible rates over GF (216 ). (iii) Construct QECC MDS codes of length 255 of various rates over GF (28 ). (iv) Construct Hermitian QECC MDS codes of length 255 of various rates over GF (216 ). (v) Construct LCD MDS codes of length 255 of various rates over GF (28 ). (vi) Construct Hermitian LCD MDS codes of length 255 of various rates over GF (2552 ). 1. Over GF (28 ) construct the Fourier 255 × 255 matrix F . 2. For r > 255 2 = 127 let A be the code generated by the first r rows of F . Then A is a DC [255, r, 255 − r + 1] code. 3. For r = 128 the DC [255, 128, 128] code of rate about 12 and rdist of about 12 is designed. 3 4. For r = 255∗3 4 = 191 the code [255, 191, 65] of rate about 4 and rdist of about 1/4 is obtained. 7 5. For r = 255∗7 8 = 223 the code [255, 223, 33] of rate about 8 and rdist of 1 about 8 is obtained. This can correct 16 errors. 6. Over GF (216 ) construct the Fourier 255 × 255 matrix. 7. For r > 255 2 = 127, let A be the code generated by the first r rows of F . Then A is an Hermitian DC [255, r, 255 − r + 1] code.
150
T. Hurley
8. For r = 128 the Hermitian DC [255, 128, 128] code of rate about 12 and rdist of about 12 is designed. 9. For r = 2554∗ 3 = 191 the Hermitian DC code [255, 191, 65] of rate about 34 and rdist of about 41 is obtained. 10. For r = 2558∗ 7 = 223 the Hermitian DC code [255, 223, 33] of rate about 78 and rdist of about 81 is obtained. This can correct 16 errors. To obtain the QECCs, apply the CSS construction to the DC codes formed. LCD codes are designed as follows; see [8] where the method is devised. 1. Let C be the code generated by the rows e0 , e1 , e254 , e2 , e253 , e3 , e252 , . . . , er , e255−r for 2r < 255. 2. Then C is a [255, 2r + 1, 255 − 2r] code; notice that the rows of A are in sequence {255 − r, 255 − r + 1, . . . 254, 0, 1, 2, . . . , r} so the code is MDS. 3. The dual code of C is the code generated by the transpose of (f255−r−1 , fr+1 , fr+2 , f255−r−2 , . . . , fr+2 , f255−r−2 ) and this consists of rows {er+1 , e255−r−1 , . . . , e255−r−2 , er+2 }. Thus C ∩ C ⊥ = 0 and so the code is an LCD MDS code. 4. To get LCD MDS Hermitian codes of length 255 of various rates as above, work in GF (216 ). Lemma 4. Let A, B, C, D be matrices of the same size r × n. Suppose the code generated by A intersects trivially the code generated by C and the code generated by B intersects trivially the code generated by D. Then the convolutional code generated by A + Bz intersects trivially the convolutional code generated by C ± Dz. Proof. Compare coefficients of P [z](A+Bz) with coefficients of Q[z](C ±Dz) for 1 × r polynomial vectors P [z], Q[z], P [z] = P0 + P1 z + . . . , Q[z] = Q0 + Q1 z + . . .. In turn get P0 = 0 = Q0 then P1 = 0 = Q1 and so on. Using this, LCD MDS convolutional codes may be designed leading on from the DC codes designed in Algorithm 54. Algorithm 513. Design MDS LCD convolutional codes of the order of twice the distance of the MDS DC block code with the same length and rate. First of all design the DC codes as in Algorithm 512. 1. Let Fn be a Fourier n × n matrix. 2. Let r > n2 . Define A to be the matrix of the first r rows (e0 , e1 , . . . , er−1 ) of Fn and define B be the matrix whose last rows are er , . . . , en−1 in order and whose first n − r rows consist of zero vectors. 3. Define G[z] = A+Bz. Then G[z] is a generating matrix for a non-catastrophic convolutional code (n, r, n − r; 1, 2(n − r) + 1) of free distance 2(n − r) + 1. A check matrix is easily written down, as is a right inverse for G[z]. 4. The code generated by G[z] is an LCD MDS convolutional code. This is shown as follows:
Linear Block and Convolutional MDS Code
151
A control matrix for the code⎛ is H T [z] =⎛ (fr , . . . , ⎞ fn−1 ) − (fn−r , fn−r+1 , er en−r ⎞ . . . , fr−1 )z. Then H[z −1 ] = ⎝ ⎛ matrix is ⎝
en−r en−r−1
.. .
⎞
⎛
⎠z − ⎝
e1
en−r−1
er er−1
.. .
.. . e1 ⎞
⎠−⎝
er−1
.. .
⎠ z −1 . Thus a control
en−r+1
⎠ equal to say −C + Dz. Now the code
en−r+1
generated by C has trivial intersection with the code generated by A and the code generated by D has trivial intersection with the code generated by B. Hence by Lemma 4 the convolutional code C generated by A + Bz has trivial intersection with the code generated by C − Dz. Hence C is an LCD convolutional MDS (n, r, n − r; 1, 2(n − r) + 1) code. 5.1
QECC Hermitian
Of particular interest are QECCs with Hermitian inner product. These need to be designed over fields GF (q 2 ) where the Hermitian inner product is defined by u, v H = u, v q E . Hermitian QECCs can be designed by the CSS construction from DC Hermitian codes. A separate algorithm is given below although it follows by similar methods to those already designed. Algorithm 514. Construct QECCs over GF (q 2 ). GF (q 2 ) has an element of order q 2 − 1 and hence an element of order q − 1 as q 2 − 1 = (q − 1)(q + 1). Let ω be an element of order (q − 1) in GF (q 2 ). Let n = (p − 1) and construct the Fourier n × n matrix Fn with this ω. The rows of Fn are denoted by {e0 , e1 , . . . , en−1 }. Let r > n2 and define A to be the matrix with rows {e0 , . . . , er−1 }. Then the code generated by A is an Hermitian DC MDS [n, r, n − r + 1] code over GF (q 2 ). Use the CSS construction to form a QECC MDS Hermitian code [[n, 2r − n, n − r + 1] code. For r = n2 + 1 a DC MDS code of rate about 12 is obtained and an MDS QECC of rate 0 and rdist of about 12 . For r = 3n 4 a DC MDS code is obtained of rate about 34 and a MDS, QECC of rate about 12 and rdist of about 34 . Higher rates may be obtained. Infinite series of such codes may also be obtained. The following Algorithm gives an infinite series of characteristic 2 such codes but other characteristics are obtained similarly; the characteristics may be mixed. Algorithm 515. Construct infinite series of characteristic 2 Hermitian DC [ni , ri , ni − ri + 1] codes Ci in which (i) limi→∞ nrii = R, for 1 > R ≥ 12 and (ii) limi→∞ ndii = 1 − R; R here is rational. From this derive infinite series of Hermitian MDS QECCs Di of form [[ni , 2ri − ni , ni − ri + 1]] in which i = (2R − 1), and (ii) limi→∞ ndii = (1 − R). limi→∞ 2rin−n i
152
T. Hurley
Consider GF (22i ). This has an element of order (2i − 1) = ni and use this to form the Fourier ni × ni matrix over GF (22i ). Let rj,i > n2i and A be the matrix with rows {e0 , . . . , erj,i −1 }. Then the code Ci,j generated by A is a Hermitian dual-containing code [ni , rj,i , ni − rj,i + 1] code. This gives an infinite series Ci,j of Hermitian dual-containing codes in characteristic 2. The ri, j can vary for each GF (22i ). Now fix rj,i = n2i + 1 = ri for each Ci,j and let Ci be the codes obtained. This gives the infinite series Ci of [ni , ri , ni − ri + 1] codes and limi→∞ nrii = di 1 1 2 , limi→∞ ni = 2 . Fixing rj,i = 3n4 i = ri gives an infinite series Ci of [ni , ri , ni − ri + 1] codes with limi→∞ nrii = 34 , limi→∞ ndii = 14 . Fixing rj,i = pnq i = ri , 1/2 < p/q < 1 gives an infinite series Ci of [ni , ri , ni − ri + 1] codes and limi→∞ nrii = pq , limi→∞ ndii = 1 − pq . The infinite series of Hermitian QECCs with limits as specified is immediate. 5.2
Higher Memory
Higher memory MDS convolutional codes may be obtained by this general method of using all the rows of an invertible ‘good’ matrix. The principle is established in [36] where rows of an invertible matrix are used to construct convolutional codes. Here just one example is given and the general construction is left for later work; some extremely nice codes are obtainable by the method. Example 7. Consider again, Example 1, the Fourier 7 × 7 matrix over GF (23 ) with rows {e0 , . . . , e6 } and 7 times the columns of the inverse denoted by {f0 , . . . , f6 }.
e5
e0 0 Construct G[z] = ee1 + e3 z + 0 z 2 2
e4
e6
Then G[z] is a convolutional code of type (7, 3, 5; 2); the degree is 5. The GSB for such a code is (7 − 3)( 53 + 1 + 5 + 1 = 4 ∗ 2 + 5 + 1 = 14. In fact the free distance is actually 14. This may be shown in an analogous way to the proof of Lemma 1. G[z] ∗ (f3 , f4 , f5 , f6 ) − (f1 , f2 , 0, 0) − (0, 0, f0 , f2 )z 2 ) = 0, G[z] ∗ ((f0 , f1 , f2 ) = 7I3 . The result is that the code generated by G[z] is a non-catastrophic convolutional MDS (7, 3, 5; 2, 14) code. Note the free distance attained is 5 ∗ 3 − 1 where 5 is the free distance of a [7, 3, 5] MDS code; the distance is tripled less 1. This is a general principle – the free distance is of order three times the distance of the same length and dimension MDS code. It’s not a dual-containing code nor
0 code.
To0 get such codes requires a
e0aLCD compromise on the distance. G[z] = ee1 + e3 z+ e5 z 2 . This give (7, 3, 4; 2) e4 e6 2 convolutional code which turns out to be an LCD code but the free distance is only 7. The GSB for such a code is 13.
Linear Block and Convolutional MDS Code
153
Convolutional codes (n, r, δ) have maximum free distance (n − r)( rδ + 1) + δ + 1. When r > δ this maximum free distance is n − r + δ + 1. Here is a design method for maximum free distance memory 1 convolutional codes. Example 8. Construct a convolutional code of rate 15 16 which has free distance ≥ 61. It is required to construct a (n, r, n − r) convolutional code such that nr ≥ 15 16 and 2(n − r) + 1 ≥ 61. Thus require (n − r) ≥ 60 and hence require n(1 − R) ≥ 30. Thus n ≥ 30/(1−R) ≥ 30 ∗ 16 = 480. Construct the Fourier 480×480 matrix over 15 a suitable field. Require nr ≥ 15 16 and r ≥ 16 ∗ 480 = 450. Now by Algorithm 512 construct the (480, 450, δ) convolutional code with δ = 30. This code has free 15 distance 2(n − r) + 1 = 60 + 1 as required. The rate is 450 480 = 16 . The Fourier 480 × 480 may be constructed over a field of characteristic p where gcd(p, 480) = 1. Now 74 ≡ 1 mod 480 so the field GF (74 ) can be used. This has an element of order 480 and the Fourier matrix of 480 × 480 exists over GF (74 ). Suppose now a field of characteristic 2 for example is required. Then replace “n ≥ 480” by “n ≥ 480 and gcd(2, n) = 1 ”. As we shall see, it is convenient to take n to be (2s − 1) and in this case take n = 26 − 1 = 511 in which case the arithmetic is done in GF (26 ). The first prime greater than 480 is 487 so the construction can be done over the prime field GF (487). These codes have twice the error-correcting capability as MDS codes and of the same rate so should be very useful as codes. Series of block linear codes which are DC, QECCs, LDC are designed in the Algorithms 51 to 54. Now we work on the types of convolutional codes that can be formed from these types when extending according to Algorithm 512. Thus design methods for convolutional DC, QECCs and LCD codes are required. Comment. From a recent article: “Far more efficient quantum error-correcting codes are needed to cope with the daunting error rates of real qubits. The effort to design better codes is “one of the major thrusts of the field,” Aaronson said, along with improving the hardware. Ahmed Almheiri, Xi Dong and Daniel Harlow [6] did calculations suggesting that this holographic “emergence” of space-time works just like a quantum errorcorrecting code. They conjectured in the Journal of High Energy Physics that space-time itself is a code - in anti-de Sitter (AdS) universes, at least. This lead to a wave of activity in the quantum gravity community, leading to new impulse to quantum error-correcting codes that could capture more properties of space-time. What this is saying is that “Ahmed Almheiri, Xi Dong and Daniel Harlow originated a powerful new idea that the fabric of space-time is a quantum errorcorrecting code”.
154
5.3
T. Hurley
DC LCD Convolutional
Convolutional DC codes over fields of characteristic 2 may be designed as follows. Algorithm 516. 1. Let F2m+1 be a Fourier matrix over a field of characteristic 2. Denote its rows in order by {e0 , e1 , . . . , en−1 } and the columns of its inverse times n is denoted by {f0 , f1 , . . . , fn−1 } in order, where n = 2m + 1. Then ei fj = δij, fi T = en−i , ei T = fn−i . 2. Choose the matrix A as follows. Let e0 be its first row and then choose r pairs {ei , en−i } for the other rows and such that 2r ≥ m. Thus A has (2r + 1) rows and A is an (2r + 1) × n matrix. 3. Choose B with first (4r − 2m + 1) rows consisting of the zero vector and the other 2(m − r) rows consisting of the rest of the pairs ei , en−i (m − r pairs) not used in item 2.. Then B is a (2r + 1) × n matrix. 4. Construct G[z] = A + Bz. 5. G[z] generates a convolutional dual-containing code from which quantum convolutional codes may be constructed. The control matrix of the code is easy to construct. There is a matrix K such that GK = I2r+1 thus ensuring the code is non-catastrophic. The degree, δ, of the code is 2(m − r). δ 6. The GSB of such a (n, 2r + 1, δ) code is (n − 2r − 1)( 2r+1) +δ+1 = n − 2r − 1 + δ + 1 = n − 2r + δ = n − 2r + 2m − 2r = 4(m − r) + 1. 7. It may be shown that the code generated by G[z] is an MDS convolutional MDS (n, 2r + 1, 2(m − r); 1, 4(m − r) + 1) code. Consider the field GF (2n ). This has an element of order 2n − 1 = q and it seems best to construct the Fourier Fq × Fq over GF (2n ). Example 9. Construct 31 × 31 DC convolutional codes. The order of 2 mod 31 is 5 and thus work in the field GF (25 ). Form the F31 × F31 matrix over GF (25 ). Now proceed as in Algorithm 516. For example let {A, B} have rows e0 , e1 , e30 , e2 , e29 , e3 , e28 , e4 , e27 , e5 , e26 , e6 , e25 , e7 , e24 , e8 , e23 , 0, 0, 0, e9 , e22 , e10 , e21 , e11 , e20 , e12 , e19 , e13 , e18 , e14 , e17 , e15 , e16 respectively. Now form G[z] = A + Bz. The code generated by G[z] is then a (31, 17, 14) DC convolutional code. The GSB for such a code is (n − r)(δr + 1) + δ + 1 = (14)(1)+14+1 = 29. The generators of A may be arranged in arithmetic sequence with difference 1 and so these form a [31, 17, 15] MDS linear code. Similarly the non-zero vectors in B generate a [31, 14, 18] MDS linear code. Using these it may be shown that this code is an MDS convolutional code. A quantum convolutional code may be designed from this. Larger rate DC convolutional MDS codes may also be derived. For example let A, B have rows e0 , e1 , e30 , e2 , e29 , e3 , e28 , e4 , e27 , e5 , e26 , e6 , e25 , e7 , e24 , e8 , e23 , e9 , e22 , e10 , e21 , 0, 0, 0, 0, 0, 0, 0, e11 , e20 , e12 , e19 , e13 , e18 , e14 , e17 , e15 , e16 respectively. This gives a (31, 21, 10) DC code. The GSB for such a code is (n − r)( rδ + 1) + δ + 1 = 10 + 11 = 21. The free distance of this code is exactly 21. Similarly (31, 23, 8) codes with free distance 17, (31, 25, 6) with free distance 11 and so on may be obtained.
Linear Block and Convolutional MDS Code
5.4
155
Addendum
It is shown in [36] how orthogonal matrices may be used to construct convolutional codes. Using orthogonal matrices does not allow the same control on the distances achieved as can by using Vandermonde/Fourier matrices. Low density parity check (LDPC) codes have important applications in communications. Linear block LDPC codes are constructed algebraically in [41] and the methods can be extended to obtain convolutional LDPC codes. This is dealt with separately.
References 1. Blahut, R.E.: Algebraic Codes for Data Transmission. Cambridge University Press (2003) 2. Johannesson, R., Zigangirov, K.: Fundamentals of Convolutional Coding. WileyIEEE Press (1999) 3. McEliece, R.J.: Theory of Information and Coding, 2nd edn. Cambridge University Press (2002) 4. McEliece, R.J.: The algebraic theory of convolutional codes. In: Handbook of Coding Theory, Volume I. Elsevier Science, North Holland (1998) 5. MacWilliams, F.J., Sloane, N.J.A.: The Theory of Error-Correcting Codes. Elsevier (1977) 6. Almheiri, A., Dong, X., Harlow, D.: Bulk locality and quantum error correction in AdS/CFT. arXiv arXiv:1411.7041 (2014) 7. Bocharova, I., Hug, F., Johannesson, R., Kudryashov, B.: Dual convolutional codes and the MacWilliams identities. Probl. Inf. Transm. 48(1), 21–30 (2012) 8. Hurley, T.: Linear complementary dual, maximum distance separable codes. arXiv arXiv:1901.04241 (2020) 9. Hurley, T., Hurley, D., Hurley, B.: Quantum error-correcting codes: the unitderived strategy. Int. J. Inf. Coding Theor. 5(2), 169–182 (2018) 10. Almeida, P., Napp, D., Pinto, R.: A new class of superregular matrices and MDP convolutional codes. Linear Algebra Appl. 439(7), 2145–2157 (2013) 11. Almeida, P., Napp, D., Pinto, R.: Superregular matrices and applications to convolutional codes. Linear Algebra Appl. 499, 1–25 (2016) 12. Guardia, G.: On negacyclic MDS-convolutional codes. Linear Algebra Appl. 448(Suppl. C), 85–96 (2014) 13. Mu˜ noz Porras, J., Dom´ınguez P´erez, J., Iglesias, C.J., Serrano Sotelo, G.: Convolutional Goppa codes. IEEE Trans. Inf. Theor. 52(1), 340–344 (2006) 14. Carlet, C., Mesnager, S., Tang, C., Qi, Y: Euclidean and Hermitian LCD MDS codes. Des. Codes Crypt. 86(11), 2605–2618 (2018). arXiv:1702.08033 (2017) 15. Carlet, C., Mesnager, S., Tang, C., Qi, Y., Pelikaan, R.: Linear codes over Fq are equivalent to LCD codes for q>3. IEEE Trans. Inf. Theor. 64(4), 3010–3017 (2018) 16. Carlet, C., Mesnager, S., Tang, C., Qi, Y.: New characterization and parametrization of LCD codes. IEEE Trans. Inf. Theor. 65, 39–49 (2018). arXiv:1709.03217 (2017) 17. Carlet, C.: Boolean functions for cryptography and error correcting codes. In: Crama, Y., Hammer, P. (eds.) Boolean Models and Methods in Mathematics, Computer Science, and Engineering, pp. 257–397. Cambridge University Press, Cambridge (2010). Monograph Book
156
T. Hurley
18. Carlet, C., Guilley, S.: Complementary dual codes for counter-measures to sidechannel attacks. In: Pinto, R., Rocha Malonek, P., Vettori, P. (eds.) Coding Theory and Applications. CIM Series in Mathematical Sciences, vol. 3, pp. 97–105. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17296-5 9. J. Adv. in Math. of Comm., 10(1), 131–150, 2016 19. Massey, J.L.: Linear codes with complementary duals. Discrete Math. 105(106), 337–380 (1992) 20. Massey, J.L.: Reversible codes. Inf. Control 7(3), 369–380 (1964) 21. Mesnager, S., Tang, C., Qi, Y.: Complementary dual algebraic geometry codes. IEEE Trans. Inf. Theor. 64, 4 (2018) 22. Calderbank, A.R., Rains, E.M., Shor, P.M., Sloane, N.J.A.: Quantum error correction via codes over GF (4). IEEE Trans. Inf. Theor. 44(4), 1369–1387 (1998) 23. Aly, S.A., Grassl, M., Klappenecker, A., R¨ otteler, M., Sarvepalli, P.K.: Quantum convolutional BCH codes. In: Proceedings of the IEEE 10th CWIT, pp. 180–183 (2007) 24. Gluesing-Luerssen, H., Helmke, U., Iglesias Curto, J.I.: Algebraic decoding for doubly cyclic convolutional codes. arXiv:0908.0753 (2009) 25. Gluesing-Luerssen, H., Schneider, G.: A MacWilliams identity for convolutional codes: the general case. IEEE Trans. Inf. Theor. 55(7), 2920–2930 (2009) 26. Hurley, T.: Maximum distance separable codes to order. arXiv arXiv:1902.06624 (2019) 27. Hurley, P., Hurley, T: Module codes in group rings In: ISIT 2007, Nice, pp. 1981– 1985 (2007) 28. Hurley, B., Hurley, T.: Systems of MDS codes from units and idempotents. Discrete Math. 335, 81–91 (2014) 29. Hurley, T.: Convolutional codes from units in matrix and group rings. Int. J. Pure Appl. Math. 50(3), 431–463 (2009) 30. Rosenthal, J., Smarandache, R.: Maximum distance separable convolutional codes. Appl. Algebra Engrg. Comm. Comput. 10(1), 15–32 (1999) 31. Rosenthal, J.: Connections between linear systems and convolutional codes. In: Marcus, B., Rosenthal, J. (eds.) Codes, Systems, and Graphical Models, Minneapolis, New York, pp. 39–66 (1999) 32. Rosenthal, J.: An algebraic decoding algorithm for convolutional codes. In: Picci, G., Gilliam, D.S. (eds.) Dynamical Systems. Control, Coding, Computer Vision: New Trends, Interfaces, and Interplay, pp. 343–360. Birkh¨ auser, Boston-BaselBerlin (1999) 33. Hurley, P., Hurley, T.: Codes from zero-divisors and units in group rings. Int. J. Inform. Coding Theor. 1, 57–87 (2009) 34. Hurley, P., Hurley, T: Block codes from matrix and group rings, chap. 5. In: Woungang, I., Misra, S., Misma, S.C. (eds.) Selected Topics in Information and Coding Theory, pp. 159–194. World Scientific (2010) 35. Hurley, P., Hurley, T.: LDPC and convolutional codes from matrix and group rings, chap. 6. In: Woungang, I., Misra, S., Misma, S.C. (eds.) Selected Topics in Information and Coding Theory, pp. 195–239. World Scientific (2010) 36. Hurley, T.: Convolutional codes from unit schemes. arXiv arXiv:1412.1695 (2020, revised) 37. Hurley, T., Hurley, D.: Coding theory: the unit-derived methodology. Int. J. Inf. Coding Theor. 5(1), 55–80 (2018) 38. Rains, E.: Nonbinary quantum codes. IEEE Trans. Inf. Theor. 43, 1827–1832 (1999)
Linear Block and Convolutional MDS Code
157
39. Ashikhmin, A., Knill, E.: Nonbinary quantum stabilizer codes. IEEE Trans. Inf. Theor. 47(7), 3065–3072 (2001) 40. Ketkar, A., Klappenecker, A., Kumar, S., Sarvepalli, P.K.: Nonbinary stabilizer codes over finite fields. IEEE Trans. Inf. Theor. 52(11), 4892–4914 (2006) 41. Hurley, T., McEvoy, P., Wenus, J.: Algebraic constructions of LDPC codes with no short cycles. Int. J. Inf. Coding Theor. 1(3), 285–297 (2010) 42. Smarandache, R., Gluesing-Luerssen, H., Rosenthal, J.: Constructions for MDSconvolutional codes. IEEE Trans. Inf. Theor. 47, 2045–2049 (2001)
A Review of Unsupervised Machine Learning Frameworks for Anomaly Detection in Industrial Applications Usman Ahmad Usmani1 , Ari Happonen2(B) , and Junzo Watada3 1 Universiti Technologi Petronas, Perak, Malaysia 2 LUT University, Lappeenranta, Finland
[email protected] 3 Waseda University, 1 Chome-104 Totsukamachi, Shinjuku City, Tokyo 169-8050, Japan
Abstract. Unsupervised learning, also known as unsupervised machine learning, analyzes and clusters unlabeled data utlizing machine learning techniques. Without human input, these algorithms discover patterns or groupings in the data. In the domain of abuse and network intrusion detection, interesting objects are often short bursts of activity rather than rare objects. Anomaly detection is a difficult task that requires familiarity and a good understanding of the data and the pattern does not correspond to the common statistical definition of an outlier as an odd item. The traditional algorithms need data preparations while unsupervised algorithms can be prepared so that they can handle the data in war format. Anomaly detection, sometimes referred to as outlier analysis is a data mining procedure that detects events, data points, and observations that deviates from the expected behaviour of a dataset. The unsupervised machine learning approaches have shown potential in static data modeling applications such as computer vision, and their use in anomaly detection is gaining attention. A typical data might reveal critical flaws, such as a software defect, or prospective possibilities, such as a shift in consumer behavior. Currently, academic literature does not really cover the topic of unsupervised machine learning techniques for anomaly detection. This paper provides an overview of the current deep learning and unsupervised machine learning techniques for anomaly detection and discusses the fundamental challenges in anomaly detection. Keywords: Anomaly detection · Unsupervised machine learning · Outliers · Feature representation · Deep learning · Neural network · Machine learning · Real-time video · Pattern matching · Time series · Classifiers · Boltzmann machine · Metric analysis · Sampling · Digitalization · Industry 4.0
1 Introduction Data representation by using the machine learning algorithms is an important concern in the current literature. Despite this, a considerable portion of the actual effort necessary to run machine learning algorithms is spent setting up in the feature selection and data © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 158–189, 2022. https://doi.org/10.1007/978-3-031-10464-0_11
A Review of Unsupervised Machine Learning Frameworks
159
transformations. Feature engineering is useful, but it takes time and highlights a flaw in current learning algorithms: their inability to extract the data’s information. Human intellect and prior knowledge can compensate for this flaw in the application design, but humans are usually good on noticing patterns, they can imagine the data could contain, which is not the case, for unexpected patterns or groupings. Making learning algorithms less dependent on features extraction will dramatically increase the machine learning’s breadth and simplicity of use, enabling speedier development of new applications and, more importantly, advancement toward Artificial Intelligence (AI) utilization in multiple industries, and specially in traditional industries like (references) [1]. An AI must have a deep understanding of the environment around it, which can be accomplished if the machine can recognize and extract the underlying explanatory components inherent in low-level sensory data. Design development can be combined with feature learning to produce cutting-edge results on real-world problems. The basic technique is to learn higher-level features along-side hand-crafted ones. Feature learning is data transformation and learning representation that makes the extraction of useful information from data, such as immunization records, tracking the patient’s health history, understanding customer responses to new products, segmenting unfamiliar markets, differentiating a company brand from its competition, and repositioning a product after its market image has gone stale. The distribution of underlying explanatory variables for the observed data is presented in an appropriate probabilistic model form [2]. This research focuses on reviewing the deep learning techniques such as Convolutional Neural Networks (CNNs) [90], Re-current Neural Networks (RNNs) and Generative Adversarial Networks [91] which can construct more non-linear, abstract representations, among the many ways for learning the features representations. The composition of representations creates the deep architecture, with the number of layers being a free parameter that may be changed depending on the task’s requirements. We look at recent breakthroughs in this field, concentrating on issues like finding the optimal aims for machine representations (i.e., inferences), geometric connections between feature learning and density estimation [3]. Machine learning is a subset of AI that helps a machine to learn automatically from the past data without programming explicitly [4]. Non-parametric local learners, such as kernel machines with a fixed generic local-response kernel, have been studied for flexibility by machine learning researchers (such as the Gaussian kernel). As previously shown in [5], the majority of these solutions rely on instances to directly map the target function. While smoothness is the desired assumption, it is insufficient since there is a large variation in the target function, and it expands exponentially in proportion to the number of the connected interacting components or input dimensions. However, it is better to construct a linear model or kernel system [6] on top of a learned representation: this is equivalent to learning the kernel, i.e., the function space. Kernel machines are important, but they rely on a predetermined similarity metric or feature space that allows for quick comparisons; we’d want to utilize the data to choose acceptable features as well. Unsupervised learning is a kind of machine learning that uses as little human supervision as possible to discover data groupings or hidden patterns without using labels. Unsupervised learning, also known as self-organization, allows for the modeling of
160
U. A. Usmani et al.
probability densities across inputs instead of supervised learning, which generally uses human-labeled data. [7] Along with supervised and reinforced learning, it is one of the three major types of machine learning. A similar form employs both supervised and unsupervised processes. Two often-used unsupervised learning approaches are principal component analysis and cluster analysis. In unsupervised learning, cluster analysis is used to group or segment data sets with comparable properties to identify algorithmic linkages [9]. Cluster analysis is a machine learning method that divides data into clusters that are either unlabeled, categorized, or categorized. Cluster analysis, rather than listening to feedback, looks for patterns in data and reacts to the presence or absence of these patterns in each new piece of data. This approach makes it easier to find data items that don’t fall into either of the two groups. Although statistical density estimation is a frequent unsupervised learning application [10], it also has a variety of other applications, including data summary and interpretation. Furthermore, since we cannot determine how accurate the outputs are because the predicted output is unclear. The cluster method [8, 9] gives poor results when it comes to segmenting and targeting customers. Association mining detects groupings of objects in the data collection. Basket analysis is popular with businesses because it allows analysts to rapidly find frequently bought goods, creating more successful marketing and merchandising strategies. Unsupervised learning varies from classification and regression. The input data isn’t labeled (i.e., no labels or classes are given), and the algorithm learns the data structure independently. As a consequence, there are two key grounds of disagreement. To begin with, since the data does not need to be manual, we may examine vast volumes of data. Second, although supervised learning employs an explicit fine measure, assessing the quality of an unsupervised approach may be challenging [10]. Principal Component Analysis projects data onto its orthogonal subspace feature is one of the most fundamental, straightforward, and extensively used dimensional reduction approaches [11]. All observations are ellipsoids in the original feature space subspace, and the new basis set in this subspace is aligned with the ellipsoid axis. Because the basis vectors are orthogonal, we can eliminate the strongly related features. Although the ellipsoid size is generally the same as the original spatial dimensions, in case the data is in a smaller subspace, new projections can be used to eliminate the subspace. We choose each ellipsoid axis in turn, based on the largest dispersion, in a’greedy’ manner. One of the most prevalent issues in unsupervised learning is reduction. Data visualization (e.g., the t-SNA approach) and data preparation for supervised learning algorithms may benefit from dimensional reduction (e.g., decision trees) [12]. While analyzing a time series, some critical questions are: Is there a general tendency toward average measurements? Is there seasonality or a predictable pattern of highs and lows that correlates to calendar time (seasons, quarters, months, days of the week, etc.)? Is there anybody here from out of town? In regression, outliers are data points that deviate significantly from your line. In time-series data, outliers depart significantly from the remainder of the data. Is there any long-term cycle or phase that the seasons aren’t affected? Is the variance constant throughout time, or does it fluctuate? Is there a significant difference in the level of volatility in the series? Environmental samples of natural or man-made materials are often used to create unlabeled data. Images, audio
A Review of Unsupervised Machine Learning Frameworks
161
recordings, movies, news articles, tweets, x-rays (if making a medical app), and other unlabeled data may all be used. On the other hand, unsupervised machine-learning algorithms learn what is normal and then use a statistical test to determine if a data point is abnormal. A device using this form of anomaly detection technology can identify all types of abnormalities, even those never seen before. Determining what is normal to follow the time series is the most difficult part of utilizing unsupervised machine learning algorithms to identify abnormalities. The following are the major contributions of this paper: • We present an overview of the anomaly detection and briefly describe the deep learning models used for finding the anomalies in complex dataset. • We study how unsupervised machine learning can be used for finding anomalies in various industrial and research domains. • We explain the frameworks for anomaly detection and explain how the anomaly can be efficiently detected by using unsupervised machine learning architectures.
2 Unsupervised Machine Learning and Deep Learning This section covers the unsupervised machine learning models and temporal connection modeling models and approaches. The learning process can create meaning on its own since it is unlabeled. Unsupervised learning can be used as a method of achieving a goal or as a goal in itself (discovering hidden patterns in data). In specific pattern recognition systems, the training data is a collection of input vectors X that do not match the target values. The purpose of these unsupervised learning problems might be to figure out how the data is distributed spatially, as in estimated density, or to cluster similar occurrences in the data. Now we give a brief overview of the various machine learning models. 2.1 Restricted Boltzmann Machines Boltzmann machines shown in Fig. 1 are stochastic and generative neural networks that, given enough time, can learn internal representations and represent and solve complex problems [13]. The Boltzmann distribution (also known as the Gibbs distribution) is a fundamental concept in statistical mechanics that describes how entropy and temperature influence quantum states in thermodynamics [14]. Restricted Boltzmann machines (RBM’s) are non-deterministic (or stochastic) deep Learning models with just two types of nodes: hidden and visible nodes. All parameters, patterns, and data correlations are available once input is supplied. Consequently, they’re as Deep Generative Models and Unsupervised Deep Learning, respectively [15, 16]. RBM’s are a two-layer generative artificial neural networks. They can figure out what probability distribution their data falls within. Boltzmann machines with a limited number of visible and hidden unit connections are known. With many inputs, the first step of RBM training is shown in the Fig. 1 below. The first hidden node will receive a vector multiplication of the inputs multiplied by the first weights column before adding the appropriate bias component [17].
162
U. A. Usmani et al.
The formulae of the sigmoid function is as follows: ex 1 + e−x = + ex 1 1 So the equation that we get in this step would be, S(x) =
(1)
H (1) = S(v(0)T W + a)
(2)
v(1) = S(h(1) W T + a)
(3)
The hidden and visible layers’ vectors with superscription (v(0) signifies network feedback) are h(1) and v(0), respectively. This graphic now depicts the reversal phase, often known as the re-building phase [18]. During the back pass reconstruction, we compute the probability of output v(1) based on input h(1) and weights W depending on: P(h(1) |v(0) ; W )
(4)
This is referred described as generative learning, as opposed to discriminating learning, which occurs in a classification problem (mapping of label inputs) [19]. Divergence in Contrast Boltzmann Machines (or) is energy-based models with a shared architecture of visible and hidden components [20]. jhiddenbj hj − vi hj wij (5) E(v, h) = − ivisibleai vi − i,j
where v, hj , the binary conditions of the visible unit, hidden unit j, ai , bj are their preconditions and wi j is their weight. The likelihood that the network will allocate to a visible vector is calculated by summing up all possible hidden vectors: 1 e−E(v,h) (6) p(v) = h Z This leads to a very simple learning rule for the stochastic climb in the log chance of the training data: where alpha is a learning rate. ∂logp(v) = vi hj data − vi hj model (7) ∂wij 2.2 Autoencoders An unsupervised artificial neural network learns how to compress and encode data efficiently before reassembling data from the reduced encoded representation to the representative representation that is as close as feasible to the original input by learning how to avoid data noise lower data. An unsupervised artificial neural network learns how to effectively compress and encode data before reassembling data from the reduced encoded representation to a representative representation similar to the original input shown in Fig. 2 [21]. The network architecture can alter depending on whether it is a single FeedForward network, LSTM, or Neural Network. [22, 23] Because the encoding process is based on correlated data compression features, the approach works well when the data are correlated.
A Review of Unsupervised Machine Learning Frameworks
163
2.3 Recurrent Neural Networks A Recurrent Neural Network (RNN) is a class of artificial neural networks where connections form a directed or undirected graph between nodes along a temporal sequence and allows it to exhibit temporal dynamic behavior. A directed graph is generated by the connections between the nodes of a RNN over time shown in Fig. 3. As a consequence, it may display a variety of temporal behaviors like the trajectory of the states in a state space, followed by the system during a certain time interval [25, 26]. These are based on neural networks and have an internal state that enables them to take a wide range of input length sequences. RNN refers to two sets of networks with a similar general structure, one with a finite impulse and the other with an unbounded impulse [24]. Two applications include speech recognition and networked, unsegmented handwriting recognition [25]. All network groups’ temporal behavior is difficult to anticipate. [26] This is known as neural network feedback [27]. The three categories of nodes are input nodes (which take data from outside the network), output nodes (which deliver results), and hidden nodes (which do not supply results) (modifying data) [28]. Sequences of real-time input vectors enter the input nodes one at a time for supervised learning in varied temporal contexts. As a non-linear function of the weighted total of all linked units’, the rising non-input unit computes its true activation (outcome) at each time [29, 30]. This might be used to play a game where the number of points scored decides the winner. Each sequence produces an error equal to the difference between the target signal and network activation. The cumulative error for a training set of distinct sequences is the total of all individual sequence defects [31]. An Elman network is a three-layer network with several backdrop units (shown as x, y, and z in the Fig. 3. The intermediate (hidden) layer is linked to the weighted background units [32]. As a result, the network can retain a state, allowing it to do tasks such as sequence prediction that would be difficult with a traditional multi-layer [33]. 2.4 Deep Learning Deep learning (sometimes called deeply structured learning) is a machine learning system that focuses on artificial neural network representational learning [34]. There are significant contrasts between and biological brains. The biological brains of most living things are fluid (plastic) or comparable, while neural networks seem to be static and symbolic [35–37]. For example, a deep learning system should figure out which qualities to employ to arrange the level better on its own. (Variable layer counts and layer widths, for example, might result in varying degrees of abstraction.) [38, 39]. The limit is a sequence of input-output modifications used to look for potential causal relationships between input and outcome. In a feed-forward neural network, the size is determined by the number of hidden layers plus one layer (as the output layer is also). When a signal passes through a layer several times, the CAP depth in a recurrent neural network is almost endless [40]. Although there is no general agreement on what separates shallow from deep learning, most studies feel that deep learning necessitates the use of more than two CAP depths. The CAP of depth 2 is universal since it can
164
U. A. Usmani et al.
Fig. 1. Training an RBM with multiple inputs
Fig. 2. De-noising of image
imitate any function [41]. On the other hand, the network function is unaffected by additional layers. To create deep learners, a greedy layer-by-layer strategy might be applied [42]. Deep learning aids in disengaging and identifying which brain regions improve performance [43]. Deep structures that can be trained without supervision include neural history compressors [44] and deep faith networks [45]. A neural network (CNN or ConvNet) is a deep neural network often used in deep learning image processing [56]. It’s also called invariant ships or spacious artificial neural networks [89] because of its shared-weight design and translation invariance (SIANN) [46, 47]. Before transferring input to the next layer, layers mix it. It’s comparable to how a visual brain cell responds to a specific stimulus [48]. In Convolution technique the number of free parameters is reduced in a network, allowing it to evolve faster [49].
A Review of Unsupervised Machine Learning Frameworks
165
Fig. 3. Recurrent neural network
By integrating neuron cluster outputs from one layer into a single neuron in the next layer, pooling layers lower data. Regional pooling is often used to connect small 2 2 clusters. Pooling also computes a total or average of all layer neurons [50, 51]. In maxpooling [52], the biggest value for each previous layer neuron cluster is chosen [53]. In average pooling [54], the largest value of each previous layer neuron cluster is chosen. This is because convolution is done many times, considering the value of a single pixel and the values of the pixels surrounding it [55]. The memory footprint is minimized since each receptive field has its bios and vector weighting, while all receptive fields employing this filter have a single bias and vector of weighting [56]. Learning from temporal consistency in sequential data such as audio and video provides a natural and plentiful source of data that seems to be a physiologically more trustworthy signal than most present machine learning assignments [57]. The HMM (Hidden Markov Model) shown in Fig. 4 [58] is a Markov Statistical model that assumes the represented system is a Markov process with unknown (i.e., hidden) conditions. The Markov cycle is shown in the figure below by the connection between both are the HIDDEN STATES.
3 Unsupervised Machine Learning Frameworks for Anomaly Detection In anomaly detection rare events are identified, e.g. observations or items that differ significantly from standard behaviors or patterns. Standard deviations, outliers, noise, novelty, and exceptions are all terms used to describe data anomalies. In this section, we will look at some common anomaly detection problems in various spheres, and
166
U. A. Usmani et al.
the models that have been used in the literature to tackle them. We explore mostly the industrial applications that are in demand so that these proposed frameworks helps in detecting potential accidents and economic losses by detecting the anomalies on time. 3.1 Unsupervised Machine Learning for Anomaly Detection in Electrical Substation Circuits Cyber-physical systems (CPS) allow assets to monitor, track, and interact with one another for a physical network, such as smart grids, to function safely and effectively. According to the literature, CPS intrusion detection systems (CPS IDS) should detect attacks in host audit logs and network traffic and in various location (physical plane) measurements of various equipment. Physical limitations can be used to cope with atypical conditions in distributed agreement algorithms in power grids, voltage, and current control [59]. The detector might be programmed directly into a hybrid CPS IDS with CPS-specific physical limits. The current literature provide preliminary findings and an alternative classification technique for normal, fault, and attack states in a smart distribution substation. CPS uses this approach as part of a CPS IDS. The current works use RTDS to simulate the electric distribution system at the substation and collect data for the computer’s learning. The functional vector for each of the three phases comprises RTDS-generated, timealigned stress and current magnitudes at four separate locations, for a total of 24 features. A time variable load represents a typical load profile for a residential customer in the simulated circuit. Five 24-s simulation time intervals, or five compressed days, replicate the 24-h stress profile. The feeder’s real determined delivery substation price decides the rate at which loads are consumed. Samples are often used to train and evaluate data sets in machine learning. Generalization claims exclusively depend on validation set training method testing since the system learns from training set data. Even though it is non-deterministic and event traces with start and end transients that make an assignment to the event difficult, around 15 samples are used per fault trace for scoring and false alarms and 30 samples per attack for injection. If the methods locate 29 samples out of a (nominal) 30-sample assault trial, they’ll claim a 96.66% detection rate [60–62]. This technique allows choosing the learning rate, pattern match quality, new pattern classes, and related learning patterns. This technique employs both external and internal learning loops. A sample pattern is supplied to the classifier on each excursion through the inner loop (a measurement track following normalization of the feature). The outer loop alters the learning rates and criteria and connects learned to related classes. Transform samples are created for the designs and unique features in different units vary in size (volts and amperes). Consequently, the medium is erased and divided by default to normalize each feature (column in the matrix). Subtract the mean row from each row (matrix time sample) by aligning the sample around zero and reducing the impact of the load curve. Finally, a function is utilize (squashing) with S equal to 1.0 for our research. The SOM (map) is the name given to the collection of patterns considered by nomenclature [63]. A SOM pattern class is eliminated from SOM if it earns too few data patterns at the end of the inner loop, as indicated by a cutting threshold (pruned). Comparable patterns are seeked at the end of each external loop phase (SOM is run effectively through the SOM). Depending on the number of patterns each model has won, models that match
A Review of Unsupervised Machine Learning Frameworks
167
Fig. 4. A hidden markov model
other patterns based on suit criteria may be blended using a weighted average. Except for the last iteration, the results of this article nearly reflect the cutting and pattern mixing of the outer loop. Our study contained 31,250 time-aligned measurement patterns, the first n of which were used for the exercise and the remainder for testing and validation. Training sets are created by varying the number of training samples, which might have two faults, all three defects, and one injection attack. The first 5000, 5500, or 6000 samples are choosen to demonstrate how the machine learning algorithm can distinguish between unique occurrences that are or are not in the training set. Relay faults 91 and 92 can be found in samples 1–5000, relay defects 93 can be found in batches 5001–5500, and relay assaults 90 can be found in samples 5500–6000. Six actual defects and eight injection assaults were picked at random from the remaining samples [64]. Standardization should be utilized for all steady-state activity samples. Normal operation phases, non-malicious faults (compatible with KCL/KVL), and inaccurate measurement injection should all be distinguishable by the classifier. Teach a pattern that corresponds to injection as a class that does not match any regular or non-malicious fault pattern throughout the training phase [65]. A pattern matching an injection should not belong to either the normal or non-malicious pattern classes during the validation process. Seven patterns are identified by the classifier. The typical examples are taught as a class, whereas the F91 and F92 cases are learnt as separate pattern classes, as in previous results. F93 is divided into two pattern classes, with 14 samples in the trace’s centre
168
U. A. Usmani et al.
remembered as one pattern and samples at the start and end remembered separately. The attack trace A90 is divided into two pattern classes: one with 26 samples and the other with three samples at the start and end. Sample 17135 comes from the F92 event and is a single false alarm sample. As shown in the n = 5500 experiment, samples for different occurrences of attack A93 match the learned pattern for fault F93, lowering attack detection performance. As in the previous experiment, 21 samples of the A93 event near sample 24411 are classified as anomaly yet have very high scores, whereas the A93 trace near sample 26233 is completely disregarded. This run has a less than 0.1% false alarm rate and a detection rate of 71.11% of samples or 83.33% of traces, with all missed detections happening at position 93. These findings show that including an attack trace in the training phase has no influence on the results or detection performance, implying that attack-free training data is not required. 3.2 Unsupervised Machine Learning System for the Detection of Cyber Based Attacks in Smart Grids Intelligent grid technologies increase the electrical system’s reliability, security, and efficiency. However, its reliance on digital communication technology creates new risks that must be handled for the power delivery to be efficient and dependable [54]. This research argues that issues may be recognized without being seen by using statistical correlations between data. The goal of the current unsupervised machine learning algorithms in this domain is to develop an all-encompassing anomaly detection engine for smart grids to distinguish between a genuine outrage and a disturbance or sophisticated. The proposed method employs symbolic dynamic filtering (SDF) to reduce processing requirements while revealing causal linkages across subsystems [59]. Simulation findings on the bus systems IEEE 39, 118, and 2848 confirm the proposed technique’s performance under various operating situations. According to the data, the method is 99% accurate, with 98% true positives and just 2% false positives which shows the better performance of the method [66]. The following are the work’s key contributions: Without the ability to categorize data sets, a mechanism for identifying a problem in smart grids arises that is unmanageable. SDF data reduction is indicated as an approach for reducing computing effort. Creating a DBN-based learning paradigm that works [67]. The authors provide a model-free approach to integrating smart grids into hierarchical and topological networks; the smart grid is seen as a multi-agent system in Fig. 5. These agents include a generator, a measuring unit, a distributed control agent, and an energy storage system that may inject or absorb the system’s actual power [67]. There are two states in which the system may exist: dynamic and static. The system condition (X) shows both the dynamic state of the generator (e.g., rotor speed, rotor angle) and the static status of the network (voltage magnitude and phase angle). The measurement of the nonlinear function is indicated by h, and the nonlinear, dynamic behavior of the generators is marked by f (). The letters u and z stand for vector performance and measurement, respectively. This research aims to understand better and predict the intelligent power grid (as shown in this section) to identify anomalies. The fourth (twoaxis) model of the generator is shown in the Fig. 6. SDF, DBN, and RBM are used to provide a computer-efficient approach for detecting subsystem linkages [81]. This is
A Review of Unsupervised Machine Learning Frameworks
169
based on the notion that the invader has limited resources and can only use a restricted set of tactics. This is a reasonable assumption since it’s challenging to think that all sensors offer inaccurate data when utilizing power networks. On the other hand, changing all metrics takes a long time and costs a lot of money for attackers. It would be difficult for an outsider to comprehend the software. Consequently, the attacker only has a rudimentary grasp of the architecture and security measures of the system. This data may be collected by statistical analysis of data sent to the control center by remote terminal units (RTU) or by physically organizing the node’s safety data. This section shows how to train data for a detection system using DBN modeling, MI feature removal, and RBM. The unattended DBN model records system behavior patterns, while DBN and MI evaluate smart grid test systems with massive amounts of data [68]. The system is first separated. After then, the SDF is used to figure out what’s causing the nominal subsystem characteristics. Of dealing with whole systems at once, the recommended technique is a computer-friendly tool that saves time and money by 1) choosing a subset of measures by selecting features and SDF, as well as domain breakdown and parallel data processing to selected subsystems. The research develops an anomaly detection tool that uses a feature extraction method and time series to identify causal links between subsystems in real-time and with little processing cost. DBN and Boltzmann-based learning approaches uncover non-observable dangers using free energy as an anomaly index. The performance of the recommended method was evaluated using a variety of IEEE test systems and operating settings (TPR, FPR, and ACC). According to statistics, the device has a 99% accuracy rate, a 98% TPR rate, and a less than 2% FPR rate [69]. In order to verify the efficacy of the proposed technique, four potential scenarios are investigated: FDI attacks on lines 6–31 and 11–12: FDI attacks on lines 6–31 and 11–12: FDI attacks on lines 6–31 and 11–12: FDI attacks on lines 6–31 and 11–12: 1) no attack, 2) random attack, 3) single FDI attack on 6–31, 4) numerous, simultaneous FDI attacks on lines 6–31 and 11–12. The proposed technique is compared to the LNR and Chi-Square tests, the two most often used BDD approaches.
Fig. 5. Unsupervised machine learning framework for detection of cyber-attacks [54]
To reduce false positives due to noise, the threshold is set at 3 and the standard deviation is set at, resulting in an FPR of less than 1% [44]. The threshold for all detectors has been standardised for accurate and wide comparison. The LNR test uses the same methods to establish the threshold. Because the attack is unintelligent, it will leave a trail in the data sets, informing the operator that an attack has taken place. The measurement
170
U. A. Usmani et al.
set’s random anomaly data causes significant changes in the measurement residual vector, resulting in a cost function increase. We look at the cost function based on the data residual in optimum state estimation. Under normal functioning, the cost function follows a normal distribution with a zero mean when no anomaly data is accessible in the system. The cost function will pass the optimum state estimation threshold in a random attack. As a result, both the LNR and the chi-square tests will set off the alert. In the face of single or multiple FDI attacks, the cost function for both the LNR and the Chi-Square detectors stayed within the real range of predefined thresholds, resulting in normalized residue levels that were lower than the specified threshold, making it impossible to detect the attack in the system. The output of the suggested detector, however, exceeds the provided threshold with the identical setup, resulting in an alert. The residual vector of the measurement vector is used in the LNR and Chi-Square tests, however cyber-attacks are designed to leave no trace in the residual vector. All of the case studies had the same outcomes. The average detection time in all case studied was 1 ms, with ranges of 0.2 ms. Any FDI attack on a line or system architecture, in general, leads in the same network alterations, with slight variations. As a result, the suggested method can detect FDI attacks coming from a range of sources. The suggested system’s success rate is also independent of attack situations since it analyses patterns in both compromised and regular data. The methodologies for identifying smart grid anomalies discussed in the literature are mostly machine approaches with limits for dealing with constantly changing cyber threats. Using a feature extraction approach and time series partitioning, it presents a real-time and computationally efficient anomaly identification tool that identifies causal relationships across subsystems. Hidden attacks that employ free energy as the anomalous index are discovered using the DBN concept and learning algorithms based on the Boltzmann Machine. The performance of the suggested approach was examined for a range of criteria on a variety of IEEE test systems and in a variety of operating conditions (TPR, FPR, and ACC). According to the numbers, the system has a 99% accuracy, a 98% TPR, and an FPR of less than 2%. 3.3 Unsupervised Machine Learning for Anomaly Detection in Network Centric Architecture Based on IoT The vital infrastructure networks should be designed in a way so that all the cyber based attacks can be prevented. For example, patches and software updates for antivirus software have failed to protect IoT apps from security flaws. The authors propose a behavior-learning approach for detecting sensitive situations [70]. The current literature demonstrated [70] that they could utilize unsupervised machine learning to identify different forms of assaults in real-time, utilizing the predictability of TCP traffic in IoT devices. The machine learning classifier can distinguish between normal and abnormal traffic based on a small number of variables. The current research concepts can be incorporated into a larger network through IP spoofing, allowing SDN-based processes to avoid attack traffic close to the source to be adopted. In terms of identifying new and unexpected attacks, unsupervised ML systems beat supervised ML techniques by an accuracy of around 15% [70]. The research show that ML models can learn from IoT network data by exploiting reconstruction faults. Previously, these methods were primarily used in the security
A Review of Unsupervised Machine Learning Frameworks
171
business for feature selection. Our unconstrained machine learning classification system was built to spot SYN floods and slow HTTP assaults in any IoT networks with nearperfect accuracy and minimal latency. The current literature looked at how well deep learning (auto-encoder) and statistical classifications performed in detection machine learning (ML) classifiers (PCA). It also demonstrated that both of these non-controlled ML classifiers outperformed a supervised classifier regarding fresh and unexpected hazards (SVM). The research show how the solution could be incorporated into a broader network to identify weak endpoints in the face of IP spoofing and block attack traffic near the source using SDN-based procedures. The focus on a range of retraining techniques for keeping our classifiers current and coping with network abnormalities is shown in Fig. 6 [71, 72]. Three types of data sets are gathered from three separate hosts in our simulated network: Type A is a benign data set. Type B is packets caught during an SYN attack. Type C is packets collected during a slow HTTP attack. The attack epoch for Type B data sets is generally 40–60%, while the attack epoch for Type C data sets is 70%. Type A data sets utilize two unsupervised classifiers per host (one using and the other using PCA). To train supervised SVM-based classifiers, Type B data sets are employed. We put both of our classifiers to the test on Type B and C data sets to see whether the attacks could be detected. Python scientific library and TensorFlow is used to create machine learning models. All layers employ the ReLU activation function except the output layer, which uses the activation function. The middle layer is the bottleneck layer. The ReLU activation function is operated on every level and the Mean Square error loss is minimized using the Adam optimizer, then train them in 32 sizes over 100 epochs. In the loss function, there are no terms. The library’s default learning and weight loss settings are utilized. The layer’s measurements are as follows: The system comprises seven input and output layers, four bottleneck levels, and fourteen extra layers. Sciencelearn library model is used with default PCA values, followed by a polynomial kernel for SVM that outperforms other kernels (e.g., linear, RBF). When trained on innocuous traffic data, auto-decoder-based classifiers can anticipate network activity, identify irregularities, and detect attacks on the industrial Internet [73]. Some supervised ML classifiers outperform when it comes to detecting new and previously unknown dangers. Machine learning technique are also created for recognizing affected sources as IP spoofing, the method want to broaden the scope of this first investigation in the future. We’ll need to expand the quantity and types of IoT devices we utilize to collect data. Even though the examined flows were limited to TCP, future protocols are fore-shadowed and risks. Although comparisons with other unsupervised processes such as single class and clustering should be made, the focus of this study was on re-training approaches and source identification. Another area that has to be investigated further is assessing source behavior. Consider changing settings throughout the classifier training phase to enhance attack detection. Unsupervised classifiers like autoencoders and PCAs work well on Type B test data sets, which contain attack traffic, after being trained on Type A benign data sets. The results indicates that when trained on Type B data sets, the supervised SVM classifier performs well (known attack). On Type C data sets, the unsupervised Autoencoder and PCA classifiers continue to beat the supervised SVM classifier, while the supervised
172
U. A. Usmani et al.
SVM classifier shows a considerable reduction in performance. These findings demonstrate that the Autoencoder-based classifier, which was trained only on benign traffic data, can recognize a broad spectrum of DDoS attacks. On the other hand, these data suggest that the SVM-based supervised classifier is incapable of categorizing unknown attacks for which it has not been trained. 3.4 Unsupervised Anomaly for the Detection and Diagnosis in Multivariate Time Series Data Many real-world systems, including power plants and wearable devices [88], acquire multivariate time series data rapidly. The purpose of multivariate time series anomaly detection and diagnosis in certain stages is to find out what’s wrong and why it’s happening. As a result, developing such a system is challenging since it requires the recording of time dependency in expanding time series and the storing of linkages between different time series pairs. The applications used should also be noise-resistant and provide operators with varying degrees of anomaly depending on how often specific occurrences occur. While various unattended anomaly detection algorithms have been created, only a handful are capable of addressing all of these problems simultaneously. The authors provide an MSCRED (multi-scale innovative encoder-decoder) method for detecting and diagnosing multi-variate time series data problems in this work. MSCRED first produces multi-scale (resolution) signature matrices to determine numerous device status rates at various time phases. The inter-sensor (time series) correlations are then encoded using an encoder, and temporal patterns are stored using a focus-based Long-Short Term Memory network (ConvLSTM) [74]. Finally, a decoder reconstructs the input signature matrices using feature charts that include correlations and temporal information, and the residual signature matrices are utilized to detect and diagnose problems. MSCRED beats state-of-the-art baseline approaches, according to a detailed empirical assessment using synthetic and data from real-world power plants shown in Fig. 7. Use a recurrent encoder-decoder to avoid the issues mentioned above (MSCRED) [75]. MSCRED creates multi-scale signature matrices to characterize different degrees of device status (resolution). The passage of time may be broken down into several phases. Multiple degrees of gadget status, in actuality, signify the risk of some unplanned occurrences. The correlation patterns (time series) are then encoded using an encoder, and the temporal patterns are aggregated using a focus-based Long-Short Term Memory (ConvLSTM) network [76]. In contrast, a decoder is a function map that stores both temporal and correlations. Signature matrices and residual signature matrices used for reconstruction may be utilized to identify and address anomalies. According to the idea, if MSCRED has never experienced a similar device state before, it will not recreate the signature matrices effectively. Anomaly detection, root cause identification, and anomaly duration are the three main tasks in anomaly detection and diagnosis [54]. In contrast to previous research, which focused on each problem independently, the methods are tackling all of these issues simultaneously. An encoder for inter-sensor correlations, cautious ConvLS networks for temporal pattern integration, and a decoder for signature matrix reconstruction are used by the authors to generate. MSCRED is the only model which uses multivariate
A Review of Unsupervised Machine Learning Frameworks
173
time series similarities to identify anomalies and achieves all three goals simultaneously. MSCRED out-performs state-of-the-art fundamental approaches, according to our results [77].
Fig. 6. Training of model and epoch classification based on the reconstruction error [58].
In this work, MSCRED signature matrices includes channels (s = 3) for capturing system status in varied sizes. To determine the severity of an anomaly, MSCRED(S), MS-CRED(M), and MSCRED(L) anomaly scores are created based on the residual signature matrices of three channels, small, medium, and large, with segment sizes w = 10, 30, and 60, respectively (L). Then we assess how well they do on three different sorts of anomalies: short, medium, and long, which last 10, 30, and 60 s, respectively. MSCRED(S) detects all forms of anomalies, while MSCRED(M) detects anomalies that persist for a long or short period of time. MSCRED(L), on the other hand, is only capable of detecting long-term issues. As a result, the three anomaly ratings are utilized to determine the severity of an anomaly. It’s more probable that the aberration will continue if it can be observed in all three channels. It can also be a one-off or short-term occurrence. MS-CRED(S) finds all five anomalies in this case: three short-duration anomalies, one medium-duration anomaly, and one long-duration anomaly. MSCRED(M) misses two short-duration anomalies, whereas MSCRED(L) identifies just one long-duration anomaly. In four injected anomalous event residual signature matrices, the outcomes of the root cause inquiry are also shown. In this situation, we can clearly identify more than half of the uncommon underlying reasons (shown by red rectangles in the rows and columns).
174
U. A. Usmani et al.
Fig. 7. Framework of the model that has been proposed: (a) In this the signature matrices are being encoded via fully convolutional neural networks. (b) Describes the temporal patterns that are being modelled by attention based convolutional LSTM networks. (c) Signature matrices being decoded via deconvolutional neural networks. (d) Loss function [59].
3.5 Unsupervised Machine Learning for Anomaly Detection in Unmanned Aerial Vehicles To address a variety of resource and latency restrictions, a real-time anomaly detection system needs a steady supply of labeled and operational data. Most solutions rely on set rules that vary based on the circumstance, whereas traditional methods to the issue rely on well-defined qualities and supervised historical experience shown in Fig. 8. These principles work well in controlled conditions, but they can’t be employed outside of known instances since they rely on a large amount of data to detect abnormalities. Existing literature is examined to find known and unknown anomalous events and think outside the box to improve decision-making [78]. The isolation forest’s value in engineering applications is evaluated using the Aero-Propulsion System Simulation to outperform other uncontrolled distance-based approaches. The scientists employed an unmanned aerial aircraft to show alternate system utilization to conduct real-time testing. Because of the conditionality curve, the most widely used detection algorithms depend on distance measurements, which might be erroneous in high-dimensional scenarios. As a result, these systems aren’t built to detect abnormalities, false alerts or alarms can be issued. Feature elimination is common, and PCA and auto-encoders are employed to reduce the data set; nevertheless, real-time solutions are difficult due to the computational cost. A summary of current developments in aircraft anomaly detection systems. In most anomaly detection systems, the isolation forest is an out-of-the-box solution for dealing with and correlation challenges that do not need expert knowledge in the setup
A Review of Unsupervised Machine Learning Frameworks
175
process. Over time, it has grown to encompass a range of more powerful algorithms, such as random forests and isolation forests, under the banner of ensemble techniques. In various feature and training environment sectors, the former creates a flurry of decisionmaking agreements. The findings of each tree, as well as vote for the best prediction, are combined. As a result, predictor instability is significantly reduced. The latter goes a step further by analyzing data in high-dimensional space by constructing a random binary tree and separating the input space into patterns, assuming that anomalous behavior falls into the separated regions [79]. Isolating aberrant data from ordinary data should be easier using the forest isolation strategy. This situation has been recursively partitioned until each data point has its leaf inside the tree. When isolated, the depth of a data point in this tree is the statistic that matters here (the number of iterations to reach from the input data to the sample). This approach is used to create more decision trees to isolate an exception with just a few divisions if it occurs. The distance traveled to an anomaly is often substantially less than the distance traveled to ordinary data because anomalies are regarded as unusual or notable. Finally, the distance from the path is normalized and crosses the depth in all of the trees to determine the anomaly. Because low average tree depth data sets may isolate fewer splits, the methods have a larger anomaly score, implying that a higher score is anomalous. Data is carried to a terminal node or maximum depth via each isolation tree during preparation (forest). The depth option controls the level of detail on the anomalous screen. The split attribute is used to divide the tree while the tree.height property is used to determine the node’s height. The is the point when two objects are split apart. Left child trees make up half of the space tree model, while right child trees make up the other half. A data window is employed during the online operation to send data samples to the server. The amount of data points necessary to study this data determines the global window [80]. This method may also be used to evaluate multivariate data. The rest of the computations are completed using the standard procedure for calculating results. Although the method works effectively with high-dimensional data, correlation analysis and sorting of the data are still required before searching for anomalies. The correlation coefficients are employed in this feature selection technique to examine and consolidate the relationships between variables. A pair of synthetic anomalies were constructed at t = 5000 and t = 10000 for evaluation purposes. It causes anomalies by altering the random mean and variance of Gaussian distributions [2]. In the isolation forest, a hundred trees were trained. The results are depicted, as well as the distribution of points assigned to anomalous and non-anomalous data. The bulk of normal data is graded between 0.6 and 0.7, despite the fact that the distribution of anomalous observations exceeds 0.75. 25 However, locating it remains a struggle. This approach is useful not just for labeled data, but it can also be used to provide warnings when the probability surpasses a certain threshold, such as the 95th percentile. The outcomes follow looks at all nine characteristics that affect system behavor during UAV takeoff and hovering. Many of these messages are caused by intermittent sensor connectivity issues. The PCA results three main components, shows anomaly-tagged points. Alarms were raised between 189 and 206 s and 325 and 345 s, according to the data. It is impossible to discern why an
176
U. A. Usmani et al.
Fig. 8. Simulation of commercial modular aero-propulsion dataset [68].
event was unusual just by looking at the number at the time. As a consequence, the analyst is stumped as to where to start their investigation. A number of criteria were analyzed and grouped together to locate and pinpoint anomalies within that group in order to fix this issue (of variables). This implies that the algorithm would have to be performed individually (and in parallel) on each group in order to discover anomalies in the incoming data. Despite the fact that this seems to be the best strategy for making a real-time decision, the authors decided to do a postoffline analysis by statistically examining the odd occurrences in the data using the violin plot, a visually attractive technique. This approach can also be used to rank potentially anomalous variables by their spread and skewness, as well as those with the greatest number of points outside the min/max quartile range. The most changeable variables are gyro readings 4,5,6, and variable 9, with variable 9 having the largest variation to contribute to the isolated forest score. When the video from the UAV experiment was analyzed at these points, it was determined that the system was attempting to restore its height after losing it. Although this strategy cannot ensure a definitive diagnosis of a problem’s root cause, it helps to get a better knowledge of the possibilities and therefore narrow down the search. 3.6 Unsupervised Machine Learning Algorithm for Anomaly Detection in Real-Time Video Surveillance The need for enhanced real-time video surveillance has risen due to rapid urbanization and self-driving manufacturing settings. Recent improvements in artificial intelligence for video surveillance anomaly identification directly address these difficulties, disregarding the changing presence of aberrant activity for the most part. Another issue is the sparse assessment based on a reconstruction error and the dependency on a known normality training. To address the constraints and limits of real-time video surveillance anomaly detection and localization, the authors suggest an ISTL. ISTL is uncontrolled
A Review of Unsupervised Machine Learning Frameworks
177
deep learning that uses active, fuzzy aggregating learning to continuously update and discriminate between new anomalies and normalcy as they emerge over time. The accuracy, robustness, total computational, and contextual elements of ISTL are shown and assessed using three benchmark data sets. These findings back up our participation and the technology’s potential for real-time video monitoring. A deep learning model for online anomaly detection and localization learns typical behavior patterns from video surveillance data. To adapt swiftly to changing unknown/new normative behaviors, rapid accumulation of active learning outcomes in the continuous learning cycle is essential. Analyze the video surveillance stream utilizing two criteria: anomaly threshold and temporal threshold, rather than making an arbitrary judgment based only on reconstruction mistakes. The Chinese University of Hong Kong’s Avenue [81] and the UCSD Pedestrian [82] (Pedestrian 1 and 2) are utilized as benchmark Video Surveillance to show and assess the essential components of ISTL (CUHK) shown in Fig. 9 and Fig. 10. The picture measures 224 pixels by 224 pixels and has a pixel normalization range of 0 to 1. Based on the frame rate of the needed training data, which is roughly one-third of a second longitude, we build a temporal cuboid range of T = 8 (i.e., 26 FPS). Due to the huge depth of the input cuboids, T selection is focused on enhancing the movement to be taken in following frameworks while restricting deep learning model convergence. When the input surveillance data has lower frame rates, long movements may be caught with limited temporal depths. In this work, we used deep learning and active learning to create a new approach for identifying spatio-temporal abnormalities in real-time video surveillance. The methodology addressed three significant challenges: detecting abnormal behavior in video surveillance streams while managing high-dimensional data streams in real-time, formulating abnormality identification to learn normal, and adapting to dynamically evolving normal behavior using fluid integration and active learning. The suggested ISTL method used a self-encoder model with convolution layers to learn spatial regularities and ConvLSTM layers to learn temporal frequencies while keeping the video stream’s spatial structure. Dynamic integration of input from human observers is integrated into a continuous, active learning process to address the issues associated with ISTL. According to the results of three studies, the suggested approach is accurate, resilient, low-cost to process, and incorporates contextual indications, suggesting that it is acceptable for use in industrial and urban contexts. A Gaussian mixed model was used in this experiment. The first parametric technique uses several multivariate dispensations for widespread modeling addiction between two distinct photographs taken separately. The goal of thirdfamily approaches is to evaluate the link between historical and varied photographs and current places before classifying and discovering changes in the two images using invariant measures of similarity through image mode (such as correlation and mutual information). The purpose of the anomaly-based CD problem is to find (typically rare) variations in ground characteristics across two heterogeneous images collected in the exact location using two different imaging modalities. It’s a binary categorization activity in which (small) local spatial variances are probable signs of anything that’s changed over time in the region of interest, and anomalies may be detected as a consequence (i.e., varying data seen through two different image modalities). [83] In contrast, the test
178
U. A. Usmani et al.
phase preserves the solidity to recognize the minority class, i.e., the shift class’s unusual events, as anomalies [58]. Learning a compressed representation in the least-squares sense, reducing reconstruction errors in residual space for the two imaging modalities, and estimating the reconstruction error of any bi-temporal input pattern as an accurate anomaly value from a local gray-level set are just a few of its main features. This score is then used to differentiate between patterns that haven’t changed and abnormal (abnormal) patterns created by an abnormality (shift mark detection). The authors suggest learning a layered, limited neural system model which may be learned in phases and serves as a good representation for improving our anomalous pattern-based model. They also recommend employing a stacked sparse, which may find intriguing structures in image data and offers an unsupervised reconstruction framework made up of many sparse layers [76]. It enables us to build a trustworthy anomalous CD model for identifying weird and irregular properties with a minimal margin of error. Cross-modality and functionality were learned deep and other deep learning methods. This model includes several intriguing features. A stacked sparse model autoencoder with satin and purine neural functions is utilized before and after training to learn about and infer an efficient latent representation of common visual patterns in pictures. By encoding and decoding the pair’s inputs with its secret, stacked images, the anomalous CD model trains regular image patterns (belonging to the class label), and the changing class is distinguished from irregular feature patterns in the residual space to recognize and distinguish it from regular image patterns (belonging to the class label). The results show a qualitative assessment of the anomaly places. In the UCSD Ped 1 dataset, ISTL finds anomalies such as bicycles and automobiles on the routes, pedestrians crossing pathways, crowd lingering, and persons pulling trolleys. Negative skateboarding detections in the Ped 1 dataset were incorrect. Only 10 of the 12 test video clips featuring skateboarders were recognized by the ISTL model. All video frames, including skateboarding, were recognized by the Ped 2 dataset. The camera viewpoint in the Ped 1 datasets explains this since the height makes distinguishing between pedestrians and skateboarders difficult. According to the UCSD Ped 2 test samples, bicyclists, automobiles, and pedestrians all go in opposite directions. Biker anomalies were the most prevalent in the Ped 2 test samples, occurring in 11 of the 12 cases. The CUHK avenue dataset contains an abandoned bag, a person tossing a bag, a little kid playing in the surveillance area, people walking in the other direction, and individuals sprinting. To show ISTL’s active learning capacity, pedestrian route scenarios were explored from the UCSD Ped 1 and Ped 2 datasets. Since bicycling through pedestrian walkways was considered a common activity in this study, all anomaly detections from rider test samples were deemed normal. To train the ISTL model using human observer verification, four samples are tested from each of the Ped 1 and Ped 2 datasets. After the training phase, anomalies in the test samples are looked at and the four samples are rejected that were chosen for further training. Two test samples involving cyclists were identified as abnormal during the analysis of the Ped 1 dataset due to crossing sidewalk cycling motions. Two previously recognised as uncommon test situations are utilized to further explore the efficiency of the active learning technique: 1) on a pedestrian walkway, a cyclist pedaling alone; 2) on
A Review of Unsupervised Machine Learning Frameworks
179
a pedestrian walkway, a cyclist riding beside a vehicle. Test video A was judged to be okay after the evaluation, however test video B was found to be anomaly. Video B was ruled anomaly due to the moving car, however video C was deemed normal. The anomaly detection technique’s real-time video surveillance capacity, as well as the compute overheads for the sequential process of anomaly identification and localization, were assessed. The average time it takes to detect and locate anomalies is 37 ms. At a frame rate of roughly 27 frames per second, ISTL has shown the capacity to identify anomalies in video surveillance feeds in real time. Although frames are expanded for anomaly detection, localization is relied on the original frame resolution, hence differences in initial resolution have been linked to differences in dataset processing time. The ISTL was used in video surviellance in a sequential manner. On the other side, detection and localization are parallelized, lowering run time and allowing for greater FPS rates.
Fig. 9. Proposed framework for anomaly detection in real-time video surveillance [75].
Fig. 10. Localized anomalies described as. (a) UCSD Ped 1 dataset, (b) UCSD Ped 2 dataset, and (c) CUHK avenue dataset. That is best being viewed in color [75].
3.7 Unsupervised Machine Learning Approach for Anomaly Detection in Hyperspectral Imaging Due to its high, redundant data and restricted ranges, image anomaly detection (HSI) faces various challenges such as lack of a common standard for manufacturing of hyperspectral sensors, insufficient labeled data for training, high volume of produced data and
180
U. A. Usmani et al.
the high cost of satellites and hyperspectral technologies. To overcome these issues, the authors propose a novel unattended feature representation technique based on a spectrum limiting methodology in adverse (AAE) that requires no previous information. To improve hidden node discrimination, we developed SC AAE, a method based on HSI characteristics. The current method [79] employs a spectral angle distance to the AAE’s loss function to attain spectral precision. Due to the differences in contribution levels of each hidden node to anomaly detection, they fuse the hidden nodes individually using an adaptive weighting approach. The BKG is removed using a two-layer design while retaining its unique features. Our proposed method outperforms the current procedures, according to the testing results. For the first time, one of the generative models, AAE, is depicted in this article. A spectral restriction (SC AAE) approach is suggested to guarantee that deep-layer hidden nodes appropriately characterize both the anomalies and the BKG, given the anomalous and BKG pixels in the original feature space shown in Fig. 11.
Fig. 11. Proposed SC_AAE-based anomaly detection method described in HSI [79].
Because each hidden node contributes to anomaly detection differently, the method is combined with an adaptive weighting approach to give capacity. BKG removal, in addition to feature identification, is critical for success in anomaly detection since it is an effective method for maximizing the distance between the anomaly and the BKG. The fused node is utilized to create a two-layer design that decreases BKG while preserving anomalous characteristics. Finally, this study contributes to four significant contributions: (1) A SC AAE anomaly detection framework that prioritizes detection while limiting false alarm rates; and (2) a WGAN-GP-based SC AAE that performs the spectral mapping from a high-dimensional spectral input vector to low-dimensional low profiles. The method devises a bi-layer architecture that reduces BKG while boosting anomalous properties. The proposed SC AAE-based anomaly detection technique is divided into
A Review of Unsupervised Machine Learning Frameworks
181
four phases: The projected SC AAE is utilized to represent features, with the caveat that anomalies are sometimes injected into the local smooth BKG. The first map BKG elimination, uses node fusion and the constructed non-linear function. RL, a space-based HSI with L spectral bands and pixels, is represented by Y. Y = [y1, y2,…,] may be expressed as an L-dimensional vector, for example. Y = [Y1, Y2,… Y =], YL], may also be seen as a collection of L 2D photographs. The authors propose SC AAE, a robust feature representation technique for anomaly detection that differentiates the fundamental properties that induce. The suggested SC AAE technique fully leverages spectrum information and adequately reflects the properties of a wide spectral vector by using a spectral restriction loss. The hidden nodes are joined together to help in the discovery process. A two-layer technique is developed based on hidden node fusion to minimize BKG volatility while preserving anomalies. By taking advantage of a considerable difference between BKG and anomaly, the suggested strategy would outperform current techniques in terms of efficiency. The method also compared the benefits of AAE against AE for identifying abnormalities. Additional testing in the real world demonstrates that the proposed SC AAE anomaly detection technique applies to a diverse set. The methodology shown in the illustration, which indicates that our suggested method is especially promising in monitoring and safety management, may reveal anomalies in certain bands that would otherwise go unnoticed. They method also intends to add geographic data to the SC AAE in the future. The better the detection, the higher the AUC value of (PD , PF ) and the lower the AUC value of (PD , PF ) (PF , τ). The AUC values of (PD , PF ) and (PF , τ) for the test HSIs are shown. (PD , PF ) has an optimal AUC of 1, while (PF , τ) has an optimal AUC of 0. The results shows that in all instances, the proposed technique and the STGF method are close to the ideal value, demonstrating that the SC AAE and STGF approaches can maintain detection capabilities (0.993251 and 0.997928 on average for the SC AAE and STGF methods, respectively). SC AAE can detect more anomalies than SC AE (0.977680 on average), proving AAE’s superiority in hyperspectral anomaly detection. As previously indicated, the AUC value of (PF , τ) is utilized to measure the efficiency of BKG suppression. The results demonstrate that the recommended strategy results in reduced AUC values on average, showing that it suppresses BKG effectively. The AUC of (PF , τ) for the recommended SC AAE approach is 0.013242, which is much lower than the 0.021113 (SC AE method) and 0.038077 (second and third best methods, respectively) (STGF method). Despite the fact that the STGF and SC AAE methods have similar detection accuracy, the STGF technique has around 2.87 times the false alarm rate of the SC AAE method. As a consequence, the proposed strategy reduces false alarms while preserving detection. Furthermore, although SC AE’s performance is usually consistent, its AUC values aren’t the best. The results show a strong correlation between detection maps and AUC ratings. As a consequence, we may deduce that the proposed technique is capable of detecting HSI anomalies.
4 Future Directions Because anomalies often include a huge amount of data, understanding the difficult problem of detecting anomalies in moving data streams [25] is essential. Recognizing
182
U. A. Usmani et al.
data streams with limited memory and time, updating data as it comes, and retaining data in a dynamic way to capture fundamental changes while recognizing them are all examples of external detection challenges [29]. Data evolution algorithms are those that adjust their setup and parameters over time and in response to fresh data. Detection methods, unlike static data, have a hard time adapting to dynamic situations like the ever-changing IoT domain [64]. In addition, the great majority of existing systems are inadequate at detecting anomalies in data streams and have very basic capabilities [15]. In the IoT data stream environment, which is recognised for its continuously changing features, the detection accuracy of anomalies is poor, and the falsepositive rate is high [43]. In the context of IoT anomaly detection, the dynamic data stream is a problem that must be handled [24, 65]. Dealing with the difficulty of anomaly detection with a feature-evolving data source is another stumbling block. The issue is that data, as well as its quality, deteriorates with time. On the other hand, new and old data dimensions appear and disappear throughout time. Outlier detection in IoT systems where sensors alternately turn on and off (indicating the number of dimensions) [31] is an interesting topic with many applications. Because of the short data processing time based on fixed interval timing [59], the accuracy (windowing) is reduced. Because the majority of available approaches employ fixed interval timing, identifying the appropriate frequency for retraining the models is also a challenge [59, 66]. Ensemble approaches are well recognized for their ability to improve anomaly detection by detecting and running the accuracy of time [41]. Ensemble deviation detection is another fascinating area of study, with the potential to greatly improve algorithm detection accuracy. For resolving undiscovered areas, more specialized models are suggested. For finding anomalies in the data stream’s environment, preliminary ensemble studies are advised. However, this field of research is largely untapped, necessitating the construction of more complete models. There are many existing IoT anomaly detection challenges that must be solved. Because anomalies do not often occur, labeled data availability is a major barrier in IoT anomaly identification. Obtaining real system data is likewise time-consuming and time-consuming [19]. Between formalizing the acquisition of knowledge logs and sensory data flow, developing a model, and testing it in real-world settings, there is a significant gap. Throughout the evaluation, many tests were carried out, the bulk of them were connected to the system’s usual functioning [19]. The most advanced systems are based on typical behavior training, with anything that deviates from the norm being considered abnormal. To deal with complex datasets from real-world scenarios, more precise and reliable procedures are necessary. Furthermore, while training and assessing real-time anomaly detection algorithms, the availability of a good dataset for public anomaly detection is often a critical factor [68]. To avoid the creation of new forms of anomalous behavioral hazards, such databases must include a broad range of new normal and anomal behaviours, and they must be appropriately labeled and updated on a regular basis. The great majority of anomaly detection datasets is mislabeled, lack attack variety, and are unsuitable for real-time detection [69]. A realistic context with a range of normal and abnormal occurrences is required for new data sets for anomaly detection. Furthermore, while evaluating a new anomaly detection system, the key truth that integrates anomalies must be produced in order to boost the dataset’s trustworthiness. Data complexity, which includes skewed datasets, unexpected
A Review of Unsupervised Machine Learning Frameworks
183
sounds, and data redundancy [40], is one of the most challenging difficulties in creating an anomaly detection algorithm. For gaining meaningful information and knowledge, well-developed methodologies for curating datasets are essential. The choosing of an acceptable set of model parameters for anomaly identification is hampered by the fact that IoT data streams are often created from non-stationary settings with no previous knowledge of the data distribution [25]. The anomaly analysis display indicated a hole. For the use of visual system analysis, new methodologies and solutions are required. As a result, the flaws in the anomaly detection approach must be investigated [8]. Light, temperature, humidity, noise, electric current, voltage, and power are just a few of the environmental elements that IoT sensors and devices exhibit in their data streams [28]. Such a data stream demands speedy processing in order to handle urgent and severe circumstances, such as patient monitoring and environmental safety monitoring. With a large number of connected devices, a common data processing infrastructure to handle billions of incoming events per day may be required [71]. The daily inflow of vast amounts of data is a significant component of the data stream, necessitating real-time algorithm execution. However, since accuracy and time complexity are always a trade-off, the time complexity of identifying anomalies would be a major concern [14, 72, 73]. Despite learning algorithms’ ability to identify and categorize anomalous behavior in real time, they must be tweaked to increase accuracy, such as by lowering the rate of false positive detection, particularly in large-scale sensor networks. Because many algorithms lose efficiency when dealing with large amounts of data, scalability is another important feature for anomaly detection systems. When dealing with high-dimensional data, most existing data stream techniques for anomaly detection lose their efficacy [25]. As a result, current models will need to be tweaked in order to identify outliers more consistently and efficiently. When a large number of features are recognised, a cluster of outliers in a restricted number of dimensions can appear at any moment. This collection of outliers seems to be natural in terms of the numerous subgroup dimensions and/or time period. Anomaly detection algorithms have a tougher difficulty discovering the most essential data characteristics due to the large number of variables [37]. As a result, when selecting the most significant qualities to characterize the whole data set, feature reduction is necessary.
5 Conclusions As wide range of industries grow more automated (e.g. industrial warehousing [84], textile industries [85], Human Resources activities [86], supply chain in general [87] and the connectivity technologies advance, a wide range of systems are generating massive amounts of data. The huge amount of data has driven principal indicators method development for the entire system state modeling have been developed. The principal indicators are used to prevent potential accidents and economic losses through detection of anomalies and outliers as signs of possible near future equipment failure, system crash, human actions errors etc. In the anomaly detection field, the multivariate time series data is especially experienced to be highly complex task due to the simultaneous consideration of temporal
184
U. A. Usmani et al.
dependencies and variables cross relationships matters. Deep Learning methods are especially adept at detecting anomalies and constructing unsupervised representations of large-scale data sequences. The great majority of them, however, are focused on a specific use case and need domain knowledge to develop. Because of the historical interest in anomaly detection in time-series data, we briefly explored various traditional approaches and uncovered significant issues in this domain. This research work has explored the anomaly detection in time series context and explained the popular frameworks used in real-world applications. The need for unsupervised deep learning-based time series anomaly detection continues as the system’s complexity grows, yet the refined data and labels for analysis remain insufficient. Finally, we also describe how we can appropriately select the model and the training strategy for deep learning-based anomaly detection.
References 1. Ghoreishi, M., Ari, H.: Key enablers for deploying artificial intelligence for circular economy embracing sustainable product design: three case studies. In: AIP Conference Proceedings. vol. 2233, issue 1. AIP Publishing LLC (2020) 2. Yuwono, M., Moulton, B.D., Su, S.W., Celler, B.G., Nguyen, H.T.: Unsupervised machinelearning method for improving the performance of ambulatory fall-detection systems. BioMed. Eng. OnLine 11(1), 1–11 (2012) 3. Inoue, J., Yamagata, Y., Chen, Y., Poskitt, C.M., Sun, J.: Anomaly detection for a water treatment system using unsupervised machine learning. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE (2017) 4. Jaeger, S., Fulle, S., Turk, S.: Mol2vec: unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58(1), 27–35 (2018) 5. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: A reinforcement learning algorithm for automated detection of skin lesions. Appl. Sci. 11(20), 9367 (2021) 6. Usmani, U.A., Roy, A., Watada, J., Jaafar, J., Aziz, I.A.: Enhanced reinforcement learning model for extraction of objects in complex imaging. In: Arai, K. (ed.) Intelligent Computing. LNNS, vol. 283, pp. 946–964. Springer, Cham (2022). https://doi.org/10.1007/978-3-03080119-9_63 7. Hu, W., Rajiv, R.P.S., Richard, T.S.: Discovering phases, phase transitions, and crossovers through unsupervised machine learning: a critical examination. Phys. Rev. E 95(6), 062122 (2017) 8. Cai, Y., Guan, K., Peng, J., Wang, S., Seifert, C., Wardlow, B., Li, Z.: A high-performance and in-season classification system of field-level crop types using time-series Landsat data and a machine learning approach. Remote Sens. Environ. 210, 35–47 (2018) 9. Ayodele, T.O.: Types of machine learning algorithms. New Adv. Mach. Learn. 3, 19–48 (2010) 10. Chandola, V., Mithal, V., Kumar, V.: Comparative evaluation of anomaly detection techniques for sequence data. In: 2008 Eighth IEEE international conference on data mining. IEEE (2008) 11. Lane, T., Brodley, C.E.: Temporal sequence learning and data reduction for anomaly detection. ACM Trans. Inf. Syst. Secur. (TISSEC) 2(3), 295–331 (1999) 12. Eskin, E.: Anomaly detection over noisy data using learned probability distributions. In: Proceedings of the International Conference on Machine Learning, pp. 255–262. Morgan Kaufmann (2000) 13. Shon, T., Kim, Y., Lee, C., Moon, J.: A machine learning framework for network anomaly detection using SVM and GA. In: Proceedings from the sixth annual IEEE SMC information assurance workshop. IEEE (2005)
A Review of Unsupervised Machine Learning Frameworks
185
14. Lane, T.D.: Machine Learning Techniques for the Computer Security Domain of Anomaly Detection. Purdue University (2000) 15. Lane, T., Carla, E.B.: An application of machine learning to anomaly detection. In: Proceedings of the 20th National Information Systems Security Conference, vol. 377. Baltimore, USA (1997) 16. Shon, T., Moon, J.: A hybrid machine learning approach to network anomaly detection. Inf. Sci. 177(18), 3799–3821 (2007) 17. Usmani, U.A., Haron, N.S., Jaafar, J.: A natural language processing approach to mine online reviews using topic modelling. In: Chaubey, N., Parikh, S., Amin, K. (eds.) COMS2 2021. CCIS, vol. 1416, pp. 82–98. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-767 76-1_6 18. Usama, M., et al.: Unsupervised machine learning for networking: techniques, applications and research challenges. IEEE Access 7, 65579–65615 (2019) 19. Maruhashi, K., Guo, F., Faloutsos, C.: Multiaspectforensics: Pattern mining on large-scale heterogeneous networks with tensor analysis. In: 2011 International Conference on Advances in Social Networks Analysis and Mining. IEEE (2011) 20. Goernitz, N., Kloft, M., Rieck, K., Brefeld, U.: Toward supervised anomaly detection. J. Artif. Intell. Res. 46, 235–262 (2013) 21. Usmani, U.A., Watada, J., Jaafar, J., Aziz, I.A., Roy, A.: Particle swarm optimization with deep learning for human action recognition. Int. J. Innovative Comput. Inform. Control 17(6), 1843–1870 (2021) 22. Choudhary, T., Bhuyan, M.K., Sharma, L.N.: Orthogonal subspace projection based framework to extract heart cycles from SCG signal. Biomed. Signal. Process. Control 50, 45–51 (2019) 23. Zhang, Q., Yang, Y., Ma, H., Wu, Y.N.: Interpreting CNNS via decision trees. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019) 24. Yi, X., Zhou, H., Zhang, Z., Xiong, S., Yang, K.: X-rays-optimized delivery of radiolabeled albumin for cancer theranostics. Biomaterials 233, 119764 (2020) 25. Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted boltzmann machines. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2011) 26. Zhang, N., Ding, S., Zhang, J., Xue, Y.: An overview on restricted boltzmann machines. Neurocomputing 275, 1186–1199 (2018) 27. Salakhutdinov, R, Mnih, A., Hinton, G.: Restricted Boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine Learning (2007) 28. Papa, J.P., Rosa, G.H., Marana, A.N., Scheirer, W., Cox, D.D.: Model selection for discriminative restricted boltzmann machines through meta-heuristic techniques. J. Comput. Sci. 9, 14–18 (2015) 29. Tanaka, M., Okutomi, M.: A novel inference of a restricted boltzmann machine. In: 2014 22nd International Conference on Pattern Recognition. IEEE (2014) 30. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015) 31. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th International Conference on Machine Learning (2008) 32. Burda, Y., Grosse, R., Salakhutdinov, R.: Importance weighted autoencoders. arXiv preprint arXiv:1509.00519 (2015) 33. Zaremba, W., Sutskever, I., Vinyals, O.: Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014)
186
U. A. Usmani et al.
34. Gregor, K., Danihelka, I., Graves, A., Jimenez Rezende, D., Wierstra, D.: Draw: A recurrent neural network for image generation. In: International Conference on Machine Learning. PMLR (2015) 35. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 36. Sun, Y., Liu, Y., Wang, G., Zhang, H.: Deep learning for plant identification in natural environment. Comput. Intell. Neurosci. 2017, 1–6 (2017) 37. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML 2011: Proceedings of the 28th International Conference on International Conference on Machine Learning, June 2011, pp. 689–696, Bellevue, Washington, USA (2011) 38. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015) 39. Guo, X., Chen, L., Shen, C.: Hierarchical adaptive deep convolution neural network and its application to bearing fault diagnosis. Measurement 93, 490–502 (2016) 40. Lo, S.-C.B., Chan, H.-P., Lin, J.-S., Li, H., Freedman, M.T., Mun, S.K.: Artificial convolution neural network for medical image pattern recognition. Neural Networks 8(7–8), 1201–1214 (1995) 41. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014). https://doi.org/10.1007/978-3-31910578-9_23 42. Boureau, Y.-L., Ponce, J., LeCun, Y.: A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10) (2010) 43. Mundlak, Y.: On the pooling of time series and cross section data. Econometrica 46(1), 69–85 (1978) 44. Sahiner, B., Heang-Ping Chan, N., Petrick, D.W., Helvie, M.A., Adler, D.D., Goodsitt, M.M.: Classification of mass and normal breast tissue: a convolution neural network classifier with spatial domain and texture images. IEEE Trans. Med. Imaging 15(5), 598–610 (1996) 45. Mishkin, D., Sergievskiy, N., Matas, J.: Systematic evaluation of convolution neural network advances on the imagenet. Comput. Vis. Image Underst. 161, 11–19 (2017) 46. Traore, B.B., Kamsu-Foguem, B., Tangara, F.: Deep convolution neural network for image recognition. Ecol. Inform. 48, 257–268 (2018) 47. Jianqiang, Z., Xiaolin, G., Xuejun, Z.: Deep convolution neural networks for twitter sentiment analysis. IEEE Access 6, 23253–23260 (2018) 48. Sonnhammer, E.L.L., Von Heijne, G., Krogh, A.: A hidden Markov model for predicting transmembrane helices in protein sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6, 175– 182 (1998) 49. Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E.L.L.: Predicting transmembrane protein topology with a hidden markov model: application to complete genomes. J. Mol. Biol. 305(3), 567–580 (2001) 50. Bahl, L,. Brown, P., de Souza, P., Mercer, R.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11, pp. 49–52. IEEE (1986) 51. Valdes, A., Macwan, R., Backes, M.: Anomaly detection in electrical substation circuits via unsupervised machine learning. In: 2016 IEEE 17th International Conference on Information Reuse and Integration (IRI). IEEE (2016) 52. Mohd, R.Z.A., Zuhairi, M.F., Shadil, A.Z.A., Dao, H.: Anomaly-based nids: A review of machine learning methods on malware detection. In: 2016 International Conference on Information and Communication Technology (ICICTM), pp. 266–270. IEEE (2016)
A Review of Unsupervised Machine Learning Frameworks
187
53. Dhivyaprabha, T.T., Subashini, P., Krishnaveni., M.: Computational intelligence based machine learning methods for rule-based reasoning in computer vision applications. In: 2016 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE (2016) 54. Karimipour, H., Dehghantanha, A., Parizi, R.M., Choo, K.-K.R., Leung, H.: A deep and scalable unsupervised machine learning system for cyber-attack detection in large-scale smart grids. IEEE Access 7, 80778–80788 (2019) 55. Mohammadi, S., Mirvaziri, H., Ghazizadeh-Ahsaee, M., Karimipour, H.: Cyber intrusion detection by combined feature selection algorithm. J. Inf. Secur. Appl. 44, 80–88 (2019) 56. Sakhnini, J., Karimipour, H., Dehghantanha, A.: Smart grid cyber attacks detection using supervised learning and heuristic feature selection. In: 2019 IEEE 7th International Conference on Smart Energy Grid Engineering (SEGE). IEEE (2019) 57. Karimipour, H., Dinavahi, V.: Robust massively parallel dynamic state estimation of power systems against cyber-attack. IEEE Access 6, 2984–2995 (2017) 58. Bhatia, R., Benno, S., Esteban, J., Lakshman, T.V., Grogan, J.: Unsupervised machine learning for network-centric anomaly detection in IoT. In: Proceedings of the 3rd ACM CoNEXT Workshop on Big Data, Machine Learning and Artificial Intelligence for Data Communication Networks (Big-DAMA’19), pp. 42–48. Association for Computing Machinery, New York, NY, USA (2019) 59. Zhang, C., et al.: A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. Proc. AAAI Conf. Artif. Intell. 33, 1409–1416 (2019) 60. Song, D., Xia, N., Cheng, W., Chen, H., Tao, D.: Deep r-th root of rank supervised joint binary embedding for multivariate time series retrieval. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018) 61. Su, Y., Zhao, Y., Niu, C., Liu, R., Sun, W., Pei, D.: Robust anomaly detection for multivariate time series through stochastic recurrent neural network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019) 62. Tariq, S., et al.: Detecting anomalies in space using multivariate convolutional LSTM with mixtures of probabilistic PCA. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019) 63. Yuguang, F., Peng, C., Gomez, F., Narazaki, Y., Spencer, B.F.: Sensor fault management techniques for wireless smart sensor networks in structural health monitoring. Struct. Control Health Monit. 26(7), e2362 (2019) 64. Tsai, F.-K., Chen, C.-C., Chen, T.-F., Lin, T.-J.: Sensor abnormal detection and recovery using machine learning for IoT sensing systems. In: 2019 IEEE 6th International Conference on Industrial Engineering and Applications (ICIEA). IEEE (2019) 65. Struye, J., Latré, S.: Hierarchical temporal memory and recurrent neural networks for time series prediction: an empirical validation and reduction to multilayer perceptrons. Neurocomputing 396, 291–301 (2020) 66. Le, T.A., Nguyen, H., Zhang, H.: EvalSVC—an evaluation platform for scalable video coding transmission. In: IEEE International Symposium on Consumer Electronics (ISCE 2010). IEEE (2010) 67. Bandaragoda, T., et al.: Artificial intelligence based commuter behaviour profiling framework using Internet of things for real-time decision-making. Neural Comput. Appl. 32(20), 16057– 16071 (2020) 68. Khan, S., Liew, C.F., Yairi, T., McWilliam, R.: Unsupervised anomaly detection in unmanned aerial vehicles. Appl. Soft Comput. 83, 105650 (2019) 69. Albusac, J., Vallejo, D., Jimenez-Linares, L., Castro-Schez, J.J., Rodriguez-Benitez, L.: Intelligent surveillance based on normality analysis to detect abnormal behaviors. Int. J. Patt. Recogn. Artif. Intell. 23(07), 1223–1244 (2009) 70. Kind, A., Stoecklin, M., Dimitropoulos, X.: Histogram-based traffic anomaly detection. IEEE Trans. Netw. Serv. Manag. 6(2), 110–121 (2009)
188
U. A. Usmani et al.
71. Huang, J., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7304–7308. IEEE (2013) 72. Shimamura, Y.: Mixed text and image data processing. US Patent 5,204,946, 20 Apr 1993 73. Walker, J.R., Marmora Jr., A.J., Cheek, R.D.: Filtering method to reduce pixel density. US Patent 6,707,572, 16 Mar 2004 74. Takahashi, K.: Print device capable of printing a format sheet in which items about a print device and a document processor are completed. US Patent 5,502,796, 26 Mar 1996 75. Nawaratne, R., Alahakoon, D., De Silva, D., Xinghuo, Y.: Spatiotemporal anomaly detection using deep learning for real-time video surveillance. IEEE Trans. Ind. Inf. 16(1), 393–402 (2020) 76. Zhao, W., Shihong, D.: Spectral–spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach. IEEE Trans. Geosci. Remote Sens. 54(8), 4544–4554 (2016) 77. Li, W., Abtahi, F., Zhu, Z.: A deep feature based multi-kernel learning approach for video emotion recognition. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction (2015) 78. Touati, R., Mignotte, M., Dahmane, M.: Anomaly feature learning for unsupervised change detection in heterogeneous images: a deep sparse residual model. IEEE J. Sel. Top. Appl. Earth Observations Remote Sensing 13, 588–600 (2020) 79. Wen, L., Gao, L., Li, X.: A new deep transfer learning based on sparse auto-encoder for fault diagnosis. IEEE Trans. Syst. Man Cybern. Syst. 49(1), 136–144 (2017) 80. Lee, H., Battle, A., Raina, R., Ng, A.: Efficient sparse coding algorithms. In: Schölkopf, B., Platt, J., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19. MIT Press (2006) 81. Xie, W., Lei, J., Liu, B., Li, Y., Jia, X.: Spectral constraint adversarial autoencoders approach to feature representation in hyperspectral anomaly detection. Neural Netw. 119, 222–234 (2019) 82. Masci, J., Meier, U., Cire¸san, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 52–59. Springer, Heidelberg (2011). https://doi.org/10. 1007/978-3-642-21735-7_7 83. Kodirov, E., Xiang, T., Gong, S.: Semantic autoencoder for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 84. Minashkina, D., Happonen, A.: Decarbonizing warehousing activities through digitalization and automatization with WMS integration for sustainability supporting operations. E3S Web of Conf. 158, 1–7 (2020). https://doi.org/10.1051/e3sconf/202015803002 85. Ghoreishi, M., Happonen, A.: The case of fabric and textile industry: The emerging role of digitalization, internet-of-Things and industry 4.0 for circularity. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology: ICICT 2021, London, Volume 3, pp. 189–200. Springer Singapore, Singapore (2022). https://doi.org/10.1007/978-981-16-1781-2_18 86. Hämäläinen, H., Happonen, A., Salmela, E.: CPFR-technology and automated data flows in technical wholesale supply chain of finnish machinery industry. In: The 3rd International Congress on Logistics and SCM Systems (ICLS 2007), vol. 28, pp. 279–286 (2007). https:// doi.org/10.5281/zenodo.3377590 87. Vatousios, A., Happonen, A.: Transforming HR and improving talent profiling with qualitative analysis digitalization on candidates for career and team development efforts. Intell. Comput. 283, 1149–1166 (2022). https://doi.org/10.1007/978-3-030-80119-9_78 88. Usmani, U.A., Usmani, M.U.: Future market trends and opportunities for wearable sensor technology. Int. J. Eng. Technol. 6, 326–330 (2014)
A Review of Unsupervised Machine Learning Frameworks
189
89. Arbelle, A., Riklin Raviv, T.: Microscopy cell segmentation via adversarial neural networks. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). IEEE (2018) 90. Wang, Z., Li, C., Wang, X.: Convolutional neural network pruning with structural redundancy reduction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021) 91. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)
Causal Probabilistic Based Variational Autoencoders Capable of Handling Noisy Inputs Using Fuzzy Logic Rules Usef Faghihi1(B) , Cyrus Kalantarpour1 , and Amir Saki2 1 University of Quebec, Trois-Rivières, QC, Canada
{usef.faghihi,cyrus.kalantarpour}@uqtr.ca
2 Institute for Research in Fundamental Sciences, Tehran, Iran
[email protected]
Abstract. Researchers and engineers may use inferential logic and/or fuzzy logic to solve real-world causal problems. Inferential logic uses probability theories, while fuzzy logic uses its membership functions and set theories to process uncertainty and fuzziness of the events. To benefit from both logics, some researchers in the past tried to create probabilistic fuzzy logic (PFL). Deep Learning algorithms (DLs) with their incredible achievements such as very high precision results in some specific tasks are at the center of the weak AI. However, DLs fail when it comes to causal reasoning. In order to equip Deep Learning algorithms (DLs) with reasoning capabilities, one solution would be to integrate non-classical logics such as PFL with DLs. In this paper, we will demonstrate the first step toward creating a deep causal probabilistic fuzzy logic architecture capable of reasoning with missing or noisy datasets. To do so, the architecture uses fuzzy theories, probabilistic theories, and deep learning algorithms such as causal effect variational autoencoders. Keywords: Deep learning · Probabilistic fuzzy logic · Causal reasoning · Autoencoders
1 Introduction As human beings, we are always in the search for the causes of events around us. For instance, was it the spicy food I had for my lunch that caused my abdominal discomfort? In causal reasoning, one uses previous information about an event or situation to predict its future state. However, discovering the real causes of events is usually difficult. In order to solve the problem of causality, some researchers use inferential logic, which uses probability theory, while others use fuzzy logic, which outperforms inferential logic [3]. Probability theories deal with the uncertainty of human knowledge about an event. However, there is no gradient possible with probability theories [3]. Fuzzy logic makes it possible to take into account the vagueness of events. The ideal would be to use both inferential and fuzzy logic theories together and create probabilistic fuzzy logic [3]. In [3], Faghihi et al. used causal fuzzy rules belonging to fuzzy rule sets to © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 190–202, 2022. https://doi.org/10.1007/978-3-031-10464-0_12
Causal Probabilistic Based Variational Autoencoders Capable
191
find the influences of confounders on other variables. A confounder variable influences both dependent and independent variables causing fake correlations between variables. However, Faghihi et al.’s model does not have learning capabilities [3]. To equip PFL with learning capabilities, one must integrate them into the DLs [4]. One powerful generative deep learning algorithm that is widely used to deal with different real-world problems is Variational Autoencoders architectures families [2, 5]. In the following, we briefly explain Autoencoders, Variational autoencoders, and Causal Effect Variational Autoencoders [2]. Autoencoder (AE) is a specific type of generative artificial neural network that learns representation for a set of data in an unsupervised manner. An AE [2] has 1) Encoder module (inference network in causality context) which encodes or compresses the input data into a latent space representation (a reduced version of the original data); 2) Decoder module which tries to reconstruct the original input data from the latent encoded space. Variational Autoencoders (VAEs) are similar to the AEs, except they consider a family of Gaussian distributions while sampling from the input data. VAEs work with both continuous and discrete data. Recently, researchers created Causal Variational Autoencoders (CEVAE) [2], which estimate the individual and average treatment effects (ITE, and ATE respectively) for unobserved confounders using proxy variables which are replacements for confounders [3]. In this paper, we first discuss the related works on causality using different DLs such as variational autoencoder families. In order to extract causal relationships from observational data, we then discuss two architectures that use PFL and causal effect variational autoencoders (CEVAEs) architecture [2] which we call FCEVAE-V1 and V2. The first architecture is called FCEVAE -V1: in this architecture PFL and CEVAE each separately applied to the dataset. That is, we cluster and fuzzify the data set, and then, it is feed-forwarded to the CEVAE architecture (Fig. 1). The second architecture is called FCEVAE -V2: we integrated association and causal rules from [3] to the CEVAE loss function (Fig. 2). That is, in the second architecture, we equipped CEVAE with a modified loss function that implements causal fuzzy logic rules from [3]. It must be noted that the initial CEVAE architecture developed by Amsterdam lab1 uses TensorFlow and Edward (deprecated). In this study, we used a CEVAE version equipped with Pyro library2 . Pyro is faster than Eward. Finally, we compare the performance of our architecture with similar architectures and discuss the results and limitations of our work.
2 Related Works Recently, learning causal relationships from observational data received lots of attention in the field of Artificial Intelligence [6, 7]. Moreover, some researchers try to address the identifiability3 issue using neural networks [8]. However, the observational data may contain hidden confounder variables that may not or very difficult to be measured [2]. 1 https://github.com/AMLab-Amsterdam/CEVAE. 2 https://github.com/pyro-ppl/pyro. 3 If the true parameters of a statistical model can be learned after observing sufficient number of
observations, the model is said to be identifiable. Wikipedia.
192
U. Faghihi et al.
Take a study in which we are interested in individualized medicine and where we have to figure out the best medication for a patient from observed data [2]. In this example, the socio-economic status of the patient can influence the type of medication the patient has access to and her general health [2]. That is, the socio-economic status is a confounder, and we cannot compute its value [2]. It is worth mentioning that once we can estimate or calculate the confounder’s value, another hurdle to overcome is to find which element(s) it influences the most. Let’s suppose we cannot measure the confounder which is the socioeconomic status of the patient. Roughly speaking, there are two main approaches to calculating confounders. The first one is a tree-based approach [9], wherein the authors use Bayesian Additive Regression Trees (BART) [10] to estimate average causal effects for solving causal inference problems such as individual treatment effects (ITE). The second approach uses Directed Acyclic Graphs (DAGs) as a causal structure and a Bayesian approach for reasoning. TARnet [11] is one of the first architectures that is used for causal inference. It is based on weighting optimizations and using the feed-forward neural networks. However, TARnet is not robust enough to deal with noisy datasets [2, 12]. In 2017, Louizos et al. [2] created Causal Effect Variational Autoencoders (CEVAE) which estimates the individual and average treatment effects (ITE and ATE) for unobserved confounders using proxy variables. A confounder variable that can be hidden and/or have missing data, influences both dependent and independent variables, causing fake correlations between variables. The model suggested by the authors in [2] outperformed Tree-based approaches such as BART [2]. However, the model in [2] has problems with processing missing data. To improve CEVAE, the authors in [12] created Identifiable VAE (iVAE) architecture. This architecture postulates that different model parameters must lead to the different marginal densities for the data. In 2021, the authors suggested Intact VAE [13], an improved version of iVAE. Intact VAE estimates ATE by using a modified version of propensity score (the probability of a subject receiving treatment) and B-score (The conditional distribution for the covariates receiving or not receiving treatment is the same). However, this study ignores computing confounders. As opposed to current DLs which can only process either noisy or missing data, a robust DL needs to be both tolerant to both noisy and missing data with hidden confounders. We will achieve this by integrating Non-Classical Logics such as probabilistic fuzzy logic rules with DLs. Faghihi et al. [1] argued that in most real-life problems, the communication between nodes is two-way, something DAG does not support. In other words, the mere Bayesian approach to causation cannot answer the following problem: what is the probability that socio-economic status influences the type of medication the patient has access to and her general health, and to what degree? Probabilistic Fuzzy logic (PFL), on the other hand, excels at reasoning with degrees of certainty and in real-life problems [14]. Importantly, this allows for degrees of dependency and membership. In PFL, Zadeh [14] proposes that a given set of elements always has a degree of membership and fits into an interval between [0,1]. PFL processes three types of uncertainty: randomness, probabilistic uncertainty, and fuzziness. PFL can both manage the uncertainty of our knowledge (by the use of probabilities) and the vagueness inherent to the world’s complexity (by data fuzzification) [14]. PFL
Causal Probabilistic Based Variational Autoencoders Capable
193
has been used to solve many engineering problems, such as security intrusion detection [15, 16] and finding the causes of the events. However, PFL cannot learn by itself and needs experts to define intervals before applying fuzzification [3]. In [3], the authors used more than ten PFL rules to discover the causal relationship between variables from observational data. However, logic cannot learn a representation of the data [3]. One solution would be to integrate PFL with Deep Learning algorithms or use them in parallel. In the next section we explain how we used CEVAE architecture [2] with PFLs.
3 Fuzzy Cevae We designed and implemented two versions of the CEVAE architecture [2] which we call FCEVAE: 1) FCEVAE-V1: in this architecture PFL and CEVAE separately applied to the dataset. That is, we clustered and fuzzified the dataset using PFL and then feed forwarded it to the CEVAE architecture (Fig. 1); and 2) FCEVAE-V2: in this architecture, we integrated clustering and causal rules with the CEVAE architecture (Fig. 2). That is, in the second architecture, we equipped CEVAE with a modified loss function that implements causal fuzzy logic rules from [3]. 3.1 Fuzzy Causal Effect Variational Autoencoder (FCEVAE-V1) First Architecture To cluster the dataset into “Low”, “Average”, and “High” clusters, we used the fuzzy c-mean algorithm [1] (Fig. 1A). It is worth mentioning that depending on the problem, one can use more than three clusters if needed. However, the C-mean clustering only gives us the membership belongingness of the dataset elements to every clusters. Thus, C-mean’s output does not include any information about the nature of the dataset. Consequently, we multiplied the clustered
Fig. 1. Part A is a membership vector extractor according to the fuzzy C-mean algorithm [1]. It applies fuzzy c-mean on the dataset and computes memberships vectors for the elements of the dataset. Then, the red circle, which in our context is a ‘switch neuron’, selectively multiplies the memberships vectors calculated in the previous step to the original dataset. Part B is the original CEVAE proposed in [2].
194
U. Faghihi et al.
data with the original dataset. This gave us a weighted fuzzy representation of the dataset elements describing how well each element belongs to the fuzzy clusters. To justify our above multiplication, we briefly explain our Simple Probabilistic Fuzzy Logic Theory (SPFL) theory [17], a classical probability theory mainly useful for the problems with fuzzy concepts in their nature. For instance, the following problem could be solved by using the SPFL theory. Question. Suppose the fuzzy attribute Large for the set X = (1,...,20). In an experiment, what is the probability of randomly selecting 17 from X as Large?”. To answer the question, one can define random variable ξX ,Large so that P (ξX ,Large = 17). Hence, here the distribution of ξX ,Large matters. Randomly selecting the elements of X comes from the nature of the distribution on X, while selecting as Large comes from a two steps procedure consisting of fuzzifying data by some fuzzy attributes including Large, and then the distribution determining the chance of being selected as Large. Another example follows: Question. Suppose the fuzzy attribute Large for the set X = (1,...,20). In an experiment, we are given X = 17. What is the probability of selecting 17 as Large? The answer to this question is P(xisLarge), and it comes from a distribution. Now, a binary random variable is considered as follows: x, P(xisLarge) ξx,Large = 0, 1 − P(xisLarge) That is, E(ξx,Large ) = xP(x is Large), and it is interpreted as the quantity of x as being Large. Note that, in this paper we use a model with P(x is Large) = μLarge (x). To calculate Fuzzy Average Treatment Effect (FATE) which will be used in our below FCEVAE (second architecture), we perform as follows: Suppose X , T and Y are the covariate, the treatment and the outcome of an experiment, respectively. Let A be a fuzzy attribute of X . We define the fuzzy individual treatment effect of any X = x with respect to A as: FITEA (x) = ITE(E(ξx,A )). It follows that: FITEA (x) = E( Y|X = xμA (x),do(T = 1)) − E( Y| = xμA (x),do(T = 0)). Now, we define FATE of X with respect to A as AFTEA (X ) = EX (FITEA (X)). Going back to the FCEVAE-V1 architecture (Fig. 1), by multiplying the clustered data with the original dataset, we obtain a weighted fuzzy representation of the dataset elements describing how well each element belongs to fuzzy clusters. As a result, FCEVAE-V1 produces three different average treatment effects values [18] each describing the fuzzy average treatment effect corresponding to the clusters such as “low”, “average”, and “high”. Table 1 shows that FCEVAE-V1 outperforms Microsoft’s DoWhy4 project that implements Pearl’s causal architecture [19] on Infant Health and Development Program (IHDP) [9] dataset. IHDP dataset contains information about the effect of specialists’ home visits on premature infants’ cognitive test scores [6]. In addition, the average number of ATEs obtained by FCEVAE-V1 is 4.006. This value is closer to the real IHDP’s ATE of 4.021 [19]. It is worth noting that in the DoWhy project, the 4 https://microsoft.github.io/dowhy/dowhy_ihdp_data_example.html.
Causal Probabilistic Based Variational Autoencoders Capable
195
ATE value for the IHDP dataset was calculated by subtracting the mean of the treated and controlled groups. Table 1. Comparison of FCEVAE-v1 and DoWhy. Cluster
Low
Average
High
FCEVAE-v1
3.812
4.015
4.192
Microsoft DoWhy
3.928
However, our first architecture has two flaws: 1) similar to the original CEVAE architecture [2], to select the treatment and outcome columns, the architecture needs a human expert. However, in a real-world problem, humans may have no idea about the Treatment and Outcome columns; 2) because we fuzzify the dataset before feeding it to the CEVAE architecture, it cannot tolerate noisy data. We fixed the first architecture’s flaws in our second architecture. Unlike the first architecture that uses fuzzy weighted versions of datasets to create fuzzy-probabilistic-based CEVAE architecture (without using fuzzy rules), the second architecture incorporates fuzzy causal rules from [3, 20] in its loss function. This helps the CEVAE architecture discover the causal relationships between the dataset’s elements. 3.2 Fuzzy Causal Effect Variational Autoencoder (FCEVAE-V2) Second Architecture In order to create an architecture capable of dealing with noisy and missing data, we created FCEVAE-V2 architecture by integrating our fuzzy rules from [3] into the CEVAE’s loss function. Our architecture is divided into two main components: Figure 2, Part A: a conditional autoencoder that randomly generates equally unbiased samples from a dataset. Figure 2, Part B: it takes the input from the previous step and uses fuzzy causal rules integrated into CEVAE’s loss function to extract causal relationships. We will briefly explain our architecture steps in the following: Figure 2, Part A: Before explaining Part A in detail, we briefly explain the difference between Variational Autoencoders (VAE) and the conditional variational autoencoder (CVAE) we used in Fig. 2. Part A. Whereas the VAE architecture does not apply any condition during sampling from datasets, CVAE uses the conditioning method for the sampling process [21, 22]. The main goal behind the step A is to generate unbiased equal samples without missing data. To do so, Conditional VAE (Fig. 2A) takes a dataset with missing data and generates equal amount of sampling from conditional distribution of the dataset’s columns. That is, we create a condition matrix (for which its columns are the output of the Conditional VAE that generates un-biased samples) so that it removes the missing data’s bias ratio. For example, assume that for a given dataset D = (X0 ,...,Xn ), where Xi s are the columns with length l(Xi ). We have M = (m0 ,...,mn ) where mi s are the
196
U. Faghihi et al.
corresponding missing data ratio. We generate a condition matrix C = (C0 ,...,Cn ), such that Ci s are binary vectors with length l(Xi ). If the corresponding dataset’s element is missing, each element of Ci , such as cij , equals 0. Otherwise the value is equal to 1. E[log(X|z,C)]−DKL [Q(z|X,C)P(z|C)]
(1)
Equation (1) is the CVAE’s objective function. Q and P are the conditional distribution of CVAE’s encoder and decoder respectively. KL is the KL divergence. The model learns P and Q given the condition matrix C [21, 22].
Fig. 2. FCEVAE-V2 where probabilistic fuzzy logic rules are integrated with the CEVAE loss function.
Figure 2, Part B: Part A’s output is an unbiased sample S with no missing values. In Part B, we create a matrix W such that its columns will show possible causal relationships in the dataset D‘s columns (see Table 4). That is, once we calculate W, a higher value in a column (i.e., gestat10 in Table 4) shows a higher influence of the column on the outcome (see Table 4). We must emphasize that contrary to the previous works that used gestat10 as the cause, we used all columns to calculate possible causes. To do so, we first initialized the randomly generated matrix W with size (n × n) where n is the number of the D columns. An important note is that since the matrix W ‘s values are randomly generated, for different executions we get slightly different values for ITE, ATE and the values in Table 4. We then multiply W (see our above SPFL theory [17]) by the output from Part A. The result of the previous step feedforward into the CEVAE (Fig. 2, Part B). After encoding the data in Fig. 2, Part B’s encoder section, the resulting data is partitioned using the Fuzzy-C mean [1] algorithm. This partitioning is done to automatically
Causal Probabilistic Based Variational Autoencoders Capable
197
find fuzzy membership intervals without the need for an expert to define them. We then use fuzzy rules from [3] to fuzzify the result. The next step is to add fuzzy rules to the CEVAE loss function (Eq. (2)): X − X2 + KL(N(μ,σ ),N(0,1)) + αVar(Fuzzy Ruleset)−1 (2) In the above equation, the first term is the reconstruction error. The second term is KL divergence. The third term calculates the variance of the fuzzy rule set according to the association and causal fuzzy rules from [1]. α p [0, 1] is the training hyper-parameter. It helps the model include the influences of the fuzzy ruleset from [3] to the loss function. For an α = 0, we have the original CEVAE architecture. The above loss function’s output is passed to the back-propagation algorithm to update the W (Fig. 2 Part B, red rectangle). Table 2. FCEVAE-V2 performance on the noisy IHDP dataset. Noise
N ∼N (0.10,0.5)
N ∼N (0.15,0.5)
N ∼N (0.20,0.5)
ATEFCEVAE−v2
3.27542
2.65458
1.76581
ATECEVAE
3.35354
2.61252
1.91257
ATEDowhy
1.99664
1.37661
1.28898
The updated Ws are multiplied by the output from Part A. Again, the result are passed to the FCEVAE-V2 where the model applies C-mean and fuzzification and calculates fuzzy loss function before using back-propagation algorithm. FCEVAE-V2 continues the above steps until the result converges to a minimum value for the loss function. 3.3 Second Architecture’s Experiments Similar to the CEVAE project [2], and the DoWhy project [19], we tried FCEVAE-V2 with the IHDP [9] and TWINS [2] datasets. With the TWINS dataset, the goal is to find the possible causal relationships between the weight of twins and their death rate. The main difference between FCEVAE-V2, CEVAE, and DoWhy architectures is that while other architectures add noise to one specific column (gestat10 column), we added noises to the whole dataset. We did this to show that DLs equipped with non-classical logic rules are tolerant to multiple noise source. After applying FCEVAE-V2 to the IHDP dataset, we obtained similar ATE and ITE values to the CEVAE and Dowhy’s project’s outputs (see GitHub). To try FCEVAEV2 with noisy data, similar to [2], we applied the Gaussian noise N N (μ,0.5) where μ ∈ (0.10,0.15,0.20) on the IHDP dataset and passed it to FCEVAE-V2 in order to measure the network’s noise tolerance level. Table 2 shows that compared to other architectures, our architecture is more tolerant to noises (a lower ATE is better). We also applied the noise to the TWINS dataset and passed the noisy data to FCEVAE-V2. Table 3 shows that comparing to CEVAE and DoWhy, our model gives lower ATE values.
198
U. Faghihi et al. Table 3. FCEVAE_V2 Performance on Noisy TWINS Dataset
Noise
N ∼N (0.10, 0.5)
N ∼N (0.15, 0.5)
N ∼N (0.20, 0.5)
ATEFCEVAE−v2
−0.02616
−0.02711
−0.05121
ATEDowhy
−0.06901
−0.11760
−0.17351
ATECEVAE
−0.02720
−0.02931
−0.06245
Similarly to the CEVAE [2] and DoWhy projects, we used the TWINS dataset with FCEVAE-V2. It must be noted that in the previous works the authors only used gestat10 column to calculate the possible cause of the twins’ death rate. In this study we used all columns with FCEVAE-V2. Table 4 shows the most important relationships between columns (to see the full result, the reader is referred to5 ). We would like to remind the reader that although we used a heatmap to show the values in Table 4, these values are not correlations and/or covariance matrices. These values are the final values of the matrix W (see above), and they were obtained after using CEVAE’s probability approach and many iterations of the c-mean clustering algorithm, fuzzification, and fuzzy rule integration to the CEVAE cost function. In Table 4, all values belong to the [0, 1] interval. The higher value shows the stronger possible cause between columns. For instance, similar to [2], our model revealed a strong relationship (0.52%) between GESTAT10 and outcome which is one of the highest values in the outcome row. That is, the GESTAT10 column influences many other columns such as adequacy (adequacy of care) and incervix (risk factor, Incompetent cervix). Limitations: Similar to previous work, FCEVAE-V2 is capable of finding the causal relationships between the TWINS dataset columns (TWINS’ description). Since our model uses all columns in TWINS, it also found other possible causal relationships between columns that were not mentioned in previous works (Table 5). We have found some health-related papers that could potentially suggest a scientific foundation for the results generated by our model. For instance: However, this is only the very surface of what needs to be done next. Given that FCEVAE-V2 uses both probability and fuzzy approaches to calculate the casual relationship between columns in the dataset, at this point, we cannot provide an explanation for how these values are calculated precisely. We aim to do so in our future work. We also encourage the readers to contact us, should they find any explanation for our result (the code is on GitHub (see footnote 5)).
5 https://github.com/joseffaghihi/Causal-fuzzy-CEVAE/blob/main/2021-12-14/Arch2/ARC2_F
inal_2021_12_14.ipynb.
Causal Probabilistic Based Variational Autoencoders Capable
199
Table 4. Partial FCEVAE-V2 output for TWINS Dataset. Each Element [0,1] Interval is the causality level of the associated columns and rows from matrix W. The dark blue color shows possible causal relationships. to see the full result, the reader is referred to6 . gestat10 pldel
0.083956
biraƩnd
0.689061
brstate
0.077812
stoccfipb
0.083477
mager8
0.490378
ormoth
0.082689
mrace
0.08223
meduc6
0.135758
dmar
0.082715
mplbir
0.077774
mpre5
0.650457
adequacy
0.917762
orfath
0.080428
frace
0.080537
birmon
0.076525
gestat10
1
csex
0.081649
anemia
0.112826
cardiac
0.323684
lung
0.380819
diabetes
0.127854
herpes
0.110825
hydra
0.138176
hemo
0.082113
chyper
0.176568
phyper
0.151857
eclamp
0.174859
incervix
0.942312
pre4000
0.732697
preterm
0.705165
(continued) 6 https://github.com/joseffaghihi/Causal-fuzzy-CEVAE/blob/main/2021-12-14/Arch2/ARC2_F inal_2021_12_14.ipynb.
200
U. Faghihi et al. Table 4. (continued) renal
0.083556
rh
0.080052
uterine
0.082438
othermr
0.07918
tobacco
0.309554
alcohol
0.345166
cigar6
0.258615
drink5
0.323664
crace
0.081194
data_year
0.077217
nprevistq
0.64534
dfageq
0.080462
feduc6
0.080072
infant_id
0.078493
dlivord_min
0.512333
dtotord_min
0.424431
bord
0.075763
brstate_reg
0.078398
stoccfipb_reg
0.07851
mplbir_reg
0.080129
wt
0.083445
treatment
0.47436
outcome
0.520736
Table 5. TWINS data set columns and their description according to TWINS’ Description TWINS dataset Column name and description mpre5 (trimester prenatal care begun)
adequacy (adequacy of care)
[23]
mpre5 (trimester prenatal care begun)
Eclamp (risk factor, Eclampsia)
[24]
mpre5 (trimester prenatal care begun)
Incervix (risk factor, Incompetent cervix)
[25]
Causal Probabilistic Based Variational Autoencoders Capable
201
4 Conclusion In this paper, we have shown that Deep Learning algorithms (DLs) equipped with nonclassical logics such as PFLs are capable of reasoning with multiple sources of missing data or noise. This had not been done in previous works; only one source of noise was previously used. To do so, we created two architectures: 1) First, after applying probabilistic fuzzy logic association and causal rules (PFLs) to the dataset, the architecture feedforwarded the output to the Causal Variational Autoencoders (CEVAE) architecture [2]; 2) Second, we integrated PFLs into the CEVAE’s loss function. Compared to the Microsoft DoWhy, and the original CEVAE architecture, our FCEVAE-V2 is more tolerant to datasets with missing data and multiple sources of noise. In contrast to the original CEVAE architecture which relies heavily on the treatment column to be determined by human experts, our FCEVAE-V2 does it automatically. That is, in order to reveal possible causal relationships between columns, our model applies causal rules to all columns. To prevent combinatorial problems when selecting treatment FCEVAE-V2 uses the CEVAE compression technique. Much work remains to be done. An important limitation of our work is explaining the calculations of the causal relationships between columns, and their interpretation in real-life scientific contexts. Acknowledgments. We thank Sioui Maldonado Bouchard for kindly proofreading this paper.
References 1. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984) 2. Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R., Welling, M.: Causal effect inference with deep latent-variable models. In: Advances in Neural Information Processing Systems, p. 6446–6456 (2017) 3. Faghihi, U., Robert, S., Poirier, P., Barkaoui, Y.: From association to reasoning, an alternative to pearl’s causal reasoning. In: Proceedings of AAAI-FLAIRS 2020. North-Miami-Beach (Florida) (2020) 4. Faghihi, U., Maldonado-Bouchard, S., Biskri, I.: Science of data: from correlation to causation. Springer (2021) 5. An, J., Cho, S.: Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2(1), 1–18 (2015) 6. Yao, L., Li, S., Li, Y., Huai, M., Gao, J., Zhang, A.: Representation learning for treatment effect estimation from observational data. Adv. Neural Inform. Process. Syst. 31 (2018) 7. Guo, R., Cheng, L., Li, J., Hahn, P.R., Liu, H.: A survey of learning causality with data: problems and methods. ACM Comput. Surv. (CSUR) 53(4), 1–37 (2020) 8. Roeder, G., Metz, L., Kingma, D.: On linear identifiability of learned representations. In: International Conference on Machine Learning. PMLR (2021) 9. Hill, J.L.: Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 20(1), 217–240 (2011) 10. Chipman, H.A., George, E.I., McCulloch, R.E.: BART: Bayesian additive regression trees. Ann. Appl. Stat. 4(1), 266–298 (2010)
202
U. Faghihi et al.
11. Shalit, U., Johansson, F.D., Sontag, D.: Estimating individual treatment effect: generalization bounds and algorithms. In: International Conference on Machine Learning. PMLR (2017) 12. Khemakhem, I., Kingma, D., Monti, R., Hyvarinen, A.: Variational autoencoders and nonlinear ica: A unifying framework. In: International Conference on Artificial Intelligence and Statistics, p. 2207–2217. PMLR (2020) 13. Wu, P., Fukumizu, K.: Intact-VAE: estimating treatment effects under unobserved confounding. arXiv preprint arXiv:2101.06662 (2021) 14. Yager, R.R., Zadeh, L.A.: An introduction to fuzzy logic applications in intelligent systems, vol. 165. Springer Science & Business Media (2012) 15. Zhao, D.-M., Wang, J.-H., Wu, J., Ma, J.-F.: Using fuzzy logic and entropy theory to risk assessment of the information security. In: 2005 International Conference on Machine Learning and Cybernetics. IEEE (2005) 16. Cheng, P.-C., Rohatgi, P., Keser, C., Karger, P.A., Wagner, G.M., Reninger, A.S.: Fuzzy multi-level security: An experiment on quantified risk-adaptive access control. In: 2007 IEEE Symposium on Security and Privacy (SP’07). IEEE (2007) 17. Saki, A., Faghihi, U.: Fuzzy Rule Based Probability Theory (IN PREPARATION) (2022) 18. Ng, A.: O’Reilly, and Associates, AI is the New Electricity. O’Reilly Media (2018) 19. Sharma, A., Kiciman, E.: DoWhy: An end-to-end library for causal inference. arXiv preprint arXiv:2011.04216 (2020) 20. Robert, S., Faghihi, U., Barkaoui, Y., Ghazzali, N.: Causality in probabilistic fuzzy logic and alternative causes as fuzzy duals. In: Hernes, M., Wojtkiewicz, K., Szczerbicki, E. (eds.) ICCCI 2020. CCIS, vol. 1287, pp. 767–776. Springer, Cham (2020). https://doi.org/10.1007/ 978-3-030-63119-2_62 21. Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. Adv. Neural. Inf. Process. Syst. 28, 3483–3491 (2015) 22. Doersch, C.: Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016) 23. Tayebi, T., Zahrani, S.T., Mohammadpour, R.: Relationship between adequacy of prenatal care utilization index and pregnancy outcomes. Iran. J. Nurs. Midwifery Res. 18(5), 360 (2013) 24. Herrera, J., Chaudhuri, G., López-Jaramillo, P.: Is infection a major risk factor for preeclampsia? Med. Hypotheses 57(3), 393–397 (2001) 25. Nicolaides, K.H.: Turning the pyramid of prenatal care. Fetal Diagn. Ther. 29(3), 183–196 (2011)
Multi-Object On-Line Tracking as an Ill-Posed Problem: Ensemble Deep Learning at the Edge for Spatial Re-identification Vasanth Iyer1(B) and Asif Mehmood2 1
Grambling State University, Grambling, LA 71245, USA [email protected] 2 JAIC, Leesburg, VA, USA [email protected]
Abstract. Online tracking is a feature of many state-of-the-art object trackers. When learning online, the data is limited, so the tracker learns a sketch of the object’s features. For a tracker to successfully re-identify the same object in the future frames in many different contexts, including occlusions, the tracker has to keep meta-data over time. In multiobjective inferences, this can exponentially increase the costs and is an ill-posed problem. This paper introduces a model-based framework that combines an ensemble of offline pre-trained models cascaded with domain-specific context for spatial tracking. Our method is efficient in reidentifying objects detected by any camera detector as there is minimal online computation. The second model uses a cosine similarity ranking of the label detected by the first model to find its corresponding set of raw images from the domain training set. A high score means model one has previously seen the object, and a low score amounts to a new detection. By using a two-stage AI-trained ensemble at the edge device, we show that the proposed tracker can perform 10 times faster with its precise detection, and the reidentification at the second stage is accurate, avoiding ID flipping for longer durations on video streams. Keywords: Vehicle re-identification · Real-time tracking loss · Cosine similarity · Edge computing
1
· Triplet
Research Innovation and Objective(s)
Machine learning models are more accurate because they are trained with a large amount of data. The computationally expensive training can be done offline, and the final model weights can be deployed to an edge device, thus avoiding online learning. We combine a general-purpose object single-shot detector with a model that has been trained specifically for the multi-camera re-identification task. In this study, we track vehicles captured by multiple cameras. We then learn a new vehicle trajectory domain embedding. When a random test vehicle is used, it c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 203–213, 2022. https://doi.org/10.1007/978-3-031-10464-0_13
204
V. Iyer and A. Mehmood
will produce a high score if it is part of any of the training trajectories learned from multiple cameras. A low score prevents close duplicates that may be alike in model and color but is not seen during training. To acquire datasets for reidentification domain research, a combination of pre-trained object detectors and OpenVINO zoo re-identification models was used, which was then further optimized by fast-reid framework.
2
Introduction
Due to reliance on very small datasets for learning, online learning poses a problem for near real-time trackers. Edge Computing devices have cameras built into them, so they can operate at higher frame rates and track multiple cameras at once. Deep-learning detectors like MobileNet and YOLO have the highest detection accuracy thanks to multi-layer networks trained using massive datasets like ImageNet and COCO. Despite their accuracy, these detectors are not intended for real-time inference. Recent advances in vision processor (VCPU) architecture have enabled the deployment of trained models into real-time inference devices, allowing neural networks to be accelerated. As a result of the ensemble model, edge VCPUs can track high frames per second and take action based on AI models without connecting to the cloud. AI-based models can infer their spatial human-like worlds in real-time with neural-accelerated hardware architectures. The paper discusses a multi-camera tracking AI application that can be used to re-identify vehicles. Figure 1 shows CAM-A detecting a vehicle passing through a tunnel. Vehicles need to be identified again when they emerge from the second tunnel. Figure 1(a) depicts different perspectives in a multicamera system that may be difficult to track, with an entirely different visual marking as seen by CAM-B. The study was inspired by two datasets: (i) the XView satellite dataset [13] to develop techniques and aerial object detection research, and (ii) the ESCAPE multi-modal dataset [18] to track vehicles with the same visual queue but different acoustic signatures. During the Summer Faculty Fellowship 2019–2021, the datasets were made available. Combining multiple cameras allows you to create a signature based on all the extracted information about the truck, such as its color, shape, etc. It will then be possible to reidentify the same truck when it emerges from the tunnel using a different camera. On Fig. 1(b), we present the block diagram architecture of the ensemble used for the re-identification task with a baseline model that handles the localization and time-stamping of the truck without visual clues. The modules and camera connections for baseline trackers are displayed in blue. Modules for ensemble re-identification are implemented using red and green connections with the edge deployed camera. Currently, the baseline fast-tracking infrastructures are provided by open software libraries that are shown in blue in Fig. 1(b). The baseline uses a wellknown polynomial-time trajectory assignment algorithm called centroid tracker. We will use the baseline tracker to compare performance in our ensemble implementation running on the edge device. Before describing the re-identification
On-Line Tracking
205
(b) A Re-Identification Algorithm Deployed in an Edge Device. The Green Ar(a) Re-Identification of Vehicles Entering row Indicates a Cascaded Ensemble, while Tunnel (CAM-A) and Leaving a Tunnel the Dotted Line Indicates a Tracker only (CAM-B). Ensemble.
Fig. 1. Multi-camera system for vehicle re-identification.
(a) Previous Frame: Initial Clustering of (b) Current Frame: Online Clustering of Centriod Tracks. Centriod Tracks.
Fig. 2. Centriod based clustering applied to two frames (a) and (b).
Fig. 3. Tracking in (a) and (b) overlaps with the previous frame. (c) Shows no overlap and clustering fails.
206
V. Iyer and A. Mehmood
task, we need to define some of the building blocks and how they communicate with open source camera modules shown in Fig. 1(b). An object detector is trained offline. The model is deployed into the edge device using the open-source spatial camera kit to run natively on the camera. When the camera detects multiple objects in real-time, the detector outputs their bounding box locations. During the first frame, all bounding boxes are assigned a unique id. When the detector processes the next frame, new bounding box positions are generated. A centroid clustering algorithm is used to match the tracker’s previous and current sets of bounding boxes. We get two sets of centroid positions, one of the prior frame and one for the current frame. Figure 3 illustrates red and blue-bounding boxes for the online tracking [2,4,6,8,9,11]. The blue bounding boxes are not tracked, and the red-bounding boxes are tracked for one or more frames: a unique id and a colored path showing its trajectory. The transition from blue-bounding box to red-bounding box happens, which is shown in Fig. 2(a) and Fig. 2(b). Active tracked cars change when the centroid of the two sets converges when using k-means clustering. A k-mean clustering is used to find closely located centroids from the two sets. If all the pairs of centroid’s form are part of a cluster assignment, current tracking IDs are not changed as they belong to the exact vehicle. If a centroid is not part of any cluster assignment, then it denotes a new object and is assigned the following available ID. When things have been assigned a tracking state, the active vehicle centroids are passed to a Hungarian [3] path assignment algorithm which finds the best match to assign the current IDs to minimize errors in assigning new detection to existing tracks. To minimize errors in assigning new detections to existing tracks, the Hungarian algorithm finds the best match between the current ID and the current track. Until the object cannot be detected for some frames, the steps are repeated until the object is removed from the tacking list. In the next section, we will compare the basic use-cases shown in Fig. 3 of how the tracker assigns unique IDs with the final use-case (c), when the two sets do not overlap.
Fig. 4. ImageNet pre-trained model with 1000 classes trained with softmax loss.
On-Line Tracking
(a) Cars Dataset with 196 Classes.
207
(b) Aerial Object Detection Task with XView2 80 Classes.
Fig. 5. Softmax loss
3
Method
In this article, we discuss how the training process is different in the ensemble, which has a model for the detector and another for re-identifying vehicles. The first model is trained to be a general-purpose detector using ImageNet, as shown in Fig. 4. We train the model the same way as the base model, which is pretrained with many images from SPIE-2015, SPIE-2017, SPIE-2019, and SPIE2020. Using negative samples as the final layer of batch training, a softmax [14] loss function increases the separation boundary between intra-classes. As shown in Fig. 4, the trained model is now ready for deployment to the device. 3.1
Ensemble Training
As opposed to general-purpose ImageNet classes, the second model for our tracker requires different training images. Domain-specific datasets are collected specifically for a re-identification task and have fine-grained overlapping classes. Figure 5(a) and 5(b) illustrate some of the examples. Using the same softmax to separate fine-grained vehicle datasets [7] results in poor boundary separation. In order to handle fine-grained dataset classes, a different loss function is used. As the re-id algorithm’s goal is to find a close match among the new detections, unlike classification, it uses a triplet loss function [17]. It uses three training samples: two are of the same class, and one is a negative example. During each batch of training, the training process randomly selects images from the training set based on the above criteria and minimizes the training loss. If no such triplet combination is found, the loss is held to zero until the next batch finds a suitable triplet pair. Through ensemble design, it is possible to distinguish fine-grained
208
V. Iyer and A. Mehmood
(a) Example Query for a Black Sedan.
(b) Example Query for a White Sedan.
(c) Example Query of a Truck RE-iD with Multiple Cameras.
Fig. 6. Triplet loss embedding ranking for the Re-Id query image: (red) boxes show true positives and (blue) boxes show false positives from the VeRi dataset.
classes and images from multiple cameras of the same vehicle without requiring online clustering. Since their loss function is fine-tuned to separate fine-grained classes, they can be used to reidentify tasks such as trucks and pedestrians. Therefore, we train the vehicle domain VERi-Wild dataset using the triplet loss. Ids are provided for multiple cameras in [7]. Figure 1 illustrates some of the design goals of deploying multiple camera systems for vehicle re-identification. Figure 3(c) illustrates one of the challenges in handling long delays between reidentifications of vehicles that do not overlap. Currently, the baseline design uses online clustering between frames, which will fail if there is no overlap. Using offline trained models, we use the triplet loss function to cluster similar vehicles from multiple cameras. In Fig. 6, a test query is displayed with an offline clustering. Figure 6(a), 6(b), and 6(c) illustrate various clustering results from a simple color-attribute query to a more complex multi-camera truck re-identification. Consequently, the triplet loss function can find and rank the matching query images based on re-inforced learning from multiple cameras.
On-Line Tracking
209
Fig. 7. Current model design extension to handle occlusion in video streams.
3.2
RE-ID Occlusion Use Case
As part of the design of a robust tracker, it is important not to rely on online clustering when the tracked object may be occluded and unable to provide position updates to the tracker. Figure 7 shows the architecture of an offline trained occlusion tracking model. Figure 7 displays the previous frame with a successful detection and a vehicle ID of 1, and the re-id step saves the car’s attributes by cropping the predicted bounding boxes for future queries. In the next frame, there is no detection due to occlusion, which is shown in red, and the re-ID step flags the active tracked car as missed. Finally, the current frame successfully detects the exact vehicle again, losing position information. The new detection is cropped and sent to the re-identification step, in which visual attributes are compared with previously tracked objects. Therefore, the re-identification algorithm has to determine if the current detection is new. The query returns a higher cosine similarity without requiring positional information. Cosine similarity in re-id is not based on the position information used in the baseline algorithm, but rather a similarity ranking of all ground truth images available in the offline training model. In the event of a miss caused by occlusion, the existing tracking id is reassigned after ranking similarity is established. Offline model-based trackers can re-identify vehicles for longer periods of time without any fine-tuning than other known trackers. By training offline, it is able to track occluded objects in real-time.
4
Results and Discussion
Section 3 outlines the tracker and Re-ID inference modules. The deployment setup for the tracker and Re-ID modules will be discussed separately since they
210
V. Iyer and A. Mehmood
(a) Transfer Learning Loss with Lest than (b) Triplet Training Loss for Veri-Wild 100 Images. Dataset.
(c) Testing Drone Footage with Transfer Learning Detector.
Fig. 8. XView2 model tested with drone footage (generic) for vehicle re-identification.
are two-stage processes. Tests are performed using video with occlusion, and future versions will use synthetic data to add challenging use cases for occlusions testing. 4.1
Transfer Learning
The first stage tracker can be used to learn aerial images in order to automate drone tracking. YOLOv3 is trained using the COCO dataset and fine-tuned using the XView satellite dataset that contains similar small vehicles. Table 1 shows the training error and average IOU. The optimal model was selected after 20,000 iterations. In Fig. 8, you can see how generic drone footage [12] was used to identify trucks. 4.2
Baseline Tracker
We need to optimize the model for speed before deploying. Offline models are typically built using TensorFlow, Caffe, and YOLO. Weights must first be converted into an open format called Open Neural Network Exchange (ONYX) and then optimized for the edge VCPU processor. On the edge device, the tracked code is tested for speed more than accuracy, so there are two ways to deploy
On-Line Tracking
211
Table 1. Transfer learning error for the satellite dataset Iterations/Epoch Average IOU mAP 10,000
25.66 %
14.28 %
20,000
30.56 %
13.17 %
50,000
29.53 %
13.77 %
it to the OAK-D [15,16] spatial camera. In the first generation, the model is compiled into OAK-D, but the tracker is not optimized, and it runs separately. Second-generation devices run the tracker as a node at the edge. Which are then benchmarked for the tracker’s frame rates by using the drone dataset [12]. The results are in Table 2, which shows that edge devices can perform neural network inference approximately 10x faster than a camera connected to a host computer. The most suitable model for OAK-D cameras is the MobileNet model. You can find the current baseline code here: https://github.com/viyer-research/ReIDBaseModel. Table 2. Bench-marking baseline pretrained models at the edge device. Tracker MobileNet architecture Running at
4.3
Detector model
Camera fps Inference fps
Intel OpenVINO mobilenet-ssd.blob
30
0.8–3.75
Host
mobilenet-ssd.blob
30
Host
vehicle-detection-adas-0002 30
0.7–1.4
Baseline camera
Vehicle-adas-0002
30
14.2–17.6
Baseline camera
mobilenet-ssd
30
34.56–36.4
0.8–3.75
Ensemble Re-identification
The ensemble [10], as shown in Fig. 1(b), uses two models. The re-identification model is used to enhance the domain-specific tracking tasks at the edge. We use two different pre-trained models for vehicle re-identification which are provided by fastreid [7] and Intel OpenVINO [5]. Both are trained using the triplet loss, and currently, only the OpenVINO zoo models [1] can be translated to OAK-D hardware. So we initially run the baseline tracker on the host to see how many ID have been not uniquely identified also called ID flipping. As expected, the baseline relies on detection position information and will have a high percentage of flips. When testing with MobileNet detector to identify vehicles, there are 30% ID flips while identifying vehicles from a video stream. We tested the ensemble version described in Sect. 3 with various re-id models available from the Model Zoo. As ID generation has real-time constraints, we show that when the ensemble is running on the OAK-D device, the percentage of flipping is the lowest. When
212
V. Iyer and A. Mehmood
using the gen2-pipeline, the inference speeds reach close to the camera’s native speeds, as seen in our initial deployment edge tests. The fastest and the best model in terms of speed and accuracy is when using the gen-1 pipeline and using vehicle-reid-001 [1] and MobileNet in the ensemble. However, the most optimal model of 2% flips is achieved when using the Veri-wild dataset trained with multiple cameras. Table 3 results also show that when running the baseline tracker at the node in the gen2-pipeline without a re-id model, the flipping rate falls from 10% to 5%. As there is some percentage of flipping due to frame rates mismatch and not due to model accuracy, it seems possible to run at higher fps when running with gen2-pipeline. Some of the initial results suggest that running inference edge will decrease the flipping rate during re-identification in some cases. Due to the rapid infrastructure for edge computing to enable AI models, the newer tracking algorithms can use offline training and minimal online tuning and maintaining high accuracy. Ensemble optimized models for gen-1 pipeline can be found here: https://github.com/viyer-research/DepthAIReidentification-EnsembleModel. Table 3. Natively supported Mobilenet-ssd on OAK-D performs with least flips when using re-identification ensemble. Re-identification architecture Detector model
Re-ID model
Host hardware OAK-D hardware
mobilenet-ssd
–
30%
5%
mobilenet-ssd
vehicle-reid-0001 10%
5%(gen-1 pipeline)
Vehicle-adas-0002
vehicle-reid-0001
12%
5%(gen-1 pipeline)
“pedestrian-and-vehicle- veriwild bot R50 detector-adas-0001”
2%
–
mobilenet-ssd
5%
5%
–
(gen-2 pipeline)
Acknowledgments. We wish to thank JD AI Research’s collaboration during the pandemic on multi-camera with 776 vehicle types captured by 20 cameras, and how spatial OAK-D helped accelerate the prototyping process for understanding spatial edge use-cases. We are grateful for the grant from Google to develop our machine learning course using Google Colab at the undergraduate level. A grant from FIU DOD #3301959-1534-310403, prog20, funded the travel of the first author. First author thanks AFRL mentors for continuing to support object detection using aerial drone images.
On-Line Tracking
213
References 1. OpenvinoTM toolkit - open model zoo repository. https://github.com/ openvinotoolkit/open model zoo. Accessed 30 Aug 2021 2. Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: CVPR (2009) 3. Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468 (2016) 4. Bolme, D.S., Ross Beveridge, J., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2544–2550 (2010) 5. Demidovskij, A., et al.: OpenVINO deep learning workbench: towards analytical platform for neural networks inference optimization. In: Journal of Physics: Conference Series, vol. 1828, no. 1, p. 012012, February 2021 6. Grabner, H., Grabner, M., Bischof, H.: Real-time tracking via on-line boosting. In: Chantler, M.J., Fisher, R.B., Trucco, E. (eds.) BMVC, pp. 47–56. British Machine Vision Association (2006) 7. He, L., Liao, X., Liu, W., Liu, X., Cheng, P., Mei, T.: FastReID: a pytorch toolbox for general instance re-identification. arXiv preprint arXiv:2006.02631 (2020) 8. Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regression networks. CoRR arXiv:1604.01802 (2016) 9. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583– 596 (2015) 10. Iyer, V.: Ensemble stream model for data-cleaning in sensor networks. Ph.D. thesis, USA (2013) 11. Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2012) 12. Krajewski, R., Bock, J., Kloeker, L., Eckstein, L.: The highD dataset: a drone dataset of naturalistic vehicle trajectories on German highways for validation of highly automated driving systems. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2118–2125 (2018) 13. Lam, D., et al.: xView: objects in context in overhead imagery (2018) 14. Liu, W., Wen, Y., Yu, Z., Yang, M.: Large-margin softmax loss for convolutional neural networks (2017) 15. Luxonis: DepthAI: embedded machine learning and computer vision API (2020). https://www.luxonis.com/ 16. Luxonis: OAK-D: Stereo camera with edge AI. Stereo Camera with Edge AI capabilites from Luxonis and OpenCV (2020) 17. Qian, Q., Shang, L., Sun, B., Juhua, H., Li, H., Jin, R.: SoftTriple loss: deep metric learning without triplet sampling (2020) 18. Zulch, P., Distasio, M., Cushman, T., Wilson, B., Hart, B., Blasch, E.: Escape data collection for multi-modal data fusion research. In: 2019 IEEE Aerospace Conference, pp. 1–10 (2019)
An Ensemble-Based Machine Learning for Predicting Fraud of Credit Card Transactions Tahani Baabdullah(B) , Danda B. Rawat, Chunmei Liu, and Amani Alzahrani Data Science and Cybersecurity Center (DSC2), Department of Electrical Engineering and Computer Science, Howard University, Washington, D.C. 20059, USA {Tahani.baabdullah,Amani.Alzahrani}@bison.howard.edu, {Danda.Rawat,Chuliu}@howard.edu Abstract. Recently, using credit cards has been considered one of the essential things of our life due to its pros of being easy to use and flexible to pay. The critical impact of the increment of using credit cards is the occurrence of fraudulent transactions, which allow the illegal user to get money and free goods via unauthorized usage. Artificial Intelligence (AI) and Machine Learning (ML) have become effective techniques used in different applications to ensure cybersecurity. This paper proposes our fraud detection system called Man-Ensemble CCFD using an ensemblelearning model with two stages of classification and detection. Stage one, called ML-CCFD, utilizes ten machine learning (ML) algorithms to classify credit card transactions to class 1 as a fraudulent transaction or class 0 as a legitimate transaction. As a result, we compared their classification reports together, precisely precision, recall (sensitivity), and f1-score. Then, we selected the most accurate ML algorithms based on their classification performance and prediction accuracy. The second stage, known Ensemble-learning CCFD, is an ensemble model that applies the ManEnsemble method on the most effective ML algorithms from stage one. The output of the second stage is to get the final prediction instead of using common types of ensemble learning, such as voting, stacking, boosting, and others. Our framework’s results showed the effectiveness and efficiency of our fraud detection system compared to using ML algorithms individually due to their weakness issues, such as errors, overfitting, bias, prediction accuracy, and even their robustness level. Keywords: Fraud detection · Imbalanced datasets fraud · Fraudulent transaction · Ensemble learning
1 1.1
· Credit card
Introduction Credit Card Fraud
The majority of people of different ages have their credit cards and use them daily in most of their purchases. Using credit cards is considered one of the essential things of our life due to its pros of being easy to use and flexible to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 214–229, 2022. https://doi.org/10.1007/978-3-031-10464-0_14
An Ensemble-Based Machine Learning for Predicting Fraud
215
pay. Figure 1 shows the approval procedure for a credit card transaction from swipe/use a credit card by cardholder until complete or cancel that transaction, as the following steps: cardholder, merchant’s payment system (point-of-sale (POS) terminal/software or e-commerce website), merchant’s bank, payment brand, cardholder’s bank to authorize that transaction then route it back to the same points with its authorization number/code until arriving the merchant to finalize the transaction with the customer. Therefore, using our credit cards frequently everywhere, including online shopping, will increase the probability of fraud risk and for being used by an unauthorized party. The critical impact of the increment of using credit cards is the occurrence of fraudulent transactions, which allow the illegal user to get money and free goods via unauthorized usage. Credit card fraud (CCF) has become the main issue for financial institutions, the credit card industry, the community, and cardholders. Thus, governments, businesses and companies, and financial institutions pay more attention to this security issue and apply different security detection systems to detect and suspend fraudulent transactions [1,13,17,21].
Fig. 1. Credit card transaction approval procedure
Many academics and researchers proposed fraud detection systems and technical methods to solve credit card fraud. Still, there are a lot of issues and challenges with these detection systems and solutions, such as skewed datasets and imbalance classification, lack of public datasets, anomaly/outliers detection, and concept drift. Artificial Intelligence (AI) and Machine Learning (ML) have become effective techniques used in different applications to ensure cybersecurity and build security prevention and detection systems. Thus, credit card issuers should improve their security systems by implementing advanced Credit Card Fraud Prevention and Fraud Detection methods. Machine Learning-based techniques can constantly improve their performance and prediction accuracy of fraud prevention and detection based on cardholder’s behavior analysis. However, the changing of the cardholder’s behavior and profile and fraud techniques used negatively affect the classification performance and prediction accuracy of CCFD systems [3]. There are many challenges faced by CCFD systems, such
216
T. Baabdullah et al.
as lack of public real-world datasets, skewed data and imbalanced classification, anomaly/outlier detection, supports real-time detection, concept drift, and reduction of a large number of data [1,7,9,10]. 1.2
Ensemble Learning
As known, each ML technique has its pros and cons, such as errors, overfitting, bias, prediction accuracy, and even their robustness level. Also, the classification performance and prediction accuracy for each ML technique vary depending on the datasets. Thus, we can not generalize the best ML-CCFD system or method used with any datasets. During our research to find the single bestperforming model to detect fraudulent transactions, we realized the benefit of combining several ML techniques to ensure the performance and accuracy of detection, as mentioned in many research papers as in [14]. Hence, ensemble learning combines many ML classifier models trained on the same datasets to get the final prediction by all of them instead of depending on the single bestperforming model, as shown in Fig. 2.
Fig. 2. Ensemble learning
The main advantages of using ensemble learning algorithms are better robustness and predictions. Many ML techniques have variance in the predictions or the model stability. Therefore, ensemble learning techniques provide stable predictions than an individual model; also, ensemble learning presents the best predictive skill than an individual model.1 Using ensemble learning to combine simple ML algorithms is better than applying complex algorithms with multilayers. 1
https://machinelearningmastery.com/ensemble-learning-algorithms-with-python/.
An Ensemble-Based Machine Learning for Predicting Fraud
217
This paper proposes our credit card fraud detection Man-Ensemble CCFD system using an ensemble-learning model with two stages of classification and detection. Stage one (ML-CCFD) utilizes ten machine learning (ML) algorithms then trains them individually to classify credit card transactions to either class 1 as a fraudulent transaction or class 0 as a legitimate transaction. As a result, we compared their classification reports together, precisely precision, recall (sensitivity), and f1-score. Therefore, we selected the most accurate ML algorithms based on their classification performance and prediction accuracy. The second stage is an ensemble-learning CCFD that applies our method Man-Ensemble on the most effective ML algorithms chosen from stage one. The output of this stage is to get the final prediction instead of using common types of ensemble learning, such as voting, stacking, boosting, and others. The results of our framework showed the effectiveness and efficiency of our fraud detection system compared to using ML algorithms individually due to their weakness issues, such as errors, overfitting, bias, prediction accuracy, and even their robustness level. To summarize, the contributions of this paper are as follows: – Applying our proposed CCFD system on two different datasets, real-world and synthetic datasets, to ensure the accuracy of the final prediction of our framework. – Training the most common ML algorithms used in credit card fraud detection individually. – Using our detection method to find the final prediction of our ensemblelearning model. The rest of this paper is organized as follows; Sect. 2 presents the literature review of some related works that used ensemble-learning in credit card fraud detection. In Sect. 3, we describe our methodology and proposed framework. Section 4 displays the results assessment and performance evaluation. Section 5 summarizes the conclusion of our work and our final results.
2
Literature Review
Many academics and researchers presented detection systems and methods to detect fraudulent credit card transactions using ML techniques. Using ensemble methods (EMs) for classification purposes is considered one of the interesting areas of research in ML. Hence, a lot of recent research mentions the importance of using EMs to improve the classification performance of classifier models by predicting suitable classes. Using ensemble strategies in unsupervised outlier detection is a common method to improve the estimation of the outlier scores [22]. Also, combining supervised and unsupervised outlier detection algorithms were performed by using sequential [15], and parallel [18] ensemble strategies. An ensemble machine learning approach on real-world datasets was presented by Sohony et al. in [16], which was a combination of random forest and neural network. It worked appropriately to predict the label of a new sample with high accuracy and confidence.
218
T. Baabdullah et al.
Carcillo et al. proposed a Hybrid technique that combines supervised and unsupervised methods to improve fraud detection accuracy. Thus, unsupervised learning techniques support the fraud detection systems to find anomalies. They computed unsupervised outlier scores in various levels of granularity. They used a real-life credit card dataset for fraud detection. The final results showed the effectiveness of the combination and its improvement of the detection accuracy [6]. Carcillo et al. in [5], a fraud detection open-source platform (SCARFF) was designed, implemented, and tested to detect fraudulent transactions of credit card transactions. The framework used big data tools (Kafka, Spark, and Cassandra) with two ML classifiers (Feedback Random Forest classifier and Delayed classifier). The results displayed the framework’s scalability, efficiency, and accuracy over a significant stream of transactions. Motwani et al. applied several ML techniques and evaluated them on actual credit card datasets; then, they proposed a predictive ensemble model for credit risk detection. The proposed model was evaluated based on different performance metrics, and the results of the proposed model showed the improvement of the prediction accuracy and correlation coefficient [11]. Polikar et al. introduced the definition of ensemble learning as an ML paradigm that combines multiple base learners (individual ML algorithms) to resolve the same problem [12]. Arya et al. proposed a predictive framework (DEAL) based on extra-tree ensemble and deep neural network that represents each transaction as a tensor to reveal the latent and inherent relations between spending patterns and fraudulent transactions [2]. Young et al. presented a deep super-learning approach with high log loss and accuracy results than deep neural networks. The results of their deep super-learner showed that the performance of the deep super-learner was better than the performance of the individual base learners and, in some cases, deep neural networks [19]. Hamori et al. made a study to compare the effectiveness of using ensemble learning or deep learning of default payment data and analysis their prediction accuracy and classification ability. The study included three ensemble-learning methods (bagging, random forest, and boosting) and different neural network methods with varying activation functions. The results illustrated that the classification ability to boost ensemble learners is better than other ML methods, including neural networks [8]. Zareapoora and Shamsolmoalia proposed their experiment of training various data mining techniques performed on real-life credit card transactions dataset, and they evaluated each methodology based on specific design criteria. The observed results showed that the bagging ensemble classifier is the best classifier to construct the fraudulent transaction detection model [20].
3 3.1
Proposed Method CCFD System
ML algorithms are considered the most valuable cybersecurity and fraud detection techniques, but many have weaknesses, such as errors, overfitting, bias, prediction accuracy, and even their robustness level. Ensemble learning is an
An Ensemble-Based Machine Learning for Predicting Fraud
219
assemble method of multi ML algorithms and allows them learned, then use common types of ensemble learning to get the final prediction. Hence, this paper uses many ML algorithms to ensure our model’s classification performance, prediction accuracy, robustness, and results. We propose our credit card fraud detection, called the Man-Ensemble CCFD system, based on using an ensemble-learning model that has two stages of prediction, as illustrated in Fig. 3.
Fig. 3. Our proposed man-ensemble CCFD system
We utilized ten ML algorithms to classify transactions to fraud (class 1) and non-fraud (class 0) in Stage one (ML-CCFD). The ten ML algorithms used in stage one are as follows: Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), K-Nearest Neighbor (KNN), Gaussian Naive Bayes (GNB), Support vector clustering (SVC), eXtreme Gradient Boosting (XGBoost), Stochastic Gradient Descent (SGD), Gradient boosting classifiers (GBC), and Light Gradient Boosting (LGB). We used five of these ML techniques in our work [4] to compare between them and find the most accurate one with our data to detect fraudulent transactions accurately. Our old work showed that the RF classifier model was the best ML method due to its highest accuracy, sensitivity, and
220
T. Baabdullah et al.
AUPRC. Therefore, we have doubled the number of ML techniques used in this current experiment; also, we used an ensemble-learning technique to improve the performance and prediction accuracy. 3.2
Datasets and Features Extractions
These classifier models are trained individually on two datasets (CreditCard.csv, FraudTrain.csv and FraudTest.csv) available on Kaggle datasets, as described in Table 1. We applied our experiment on two different datasets to ensure the accuracy of our learner model’s performance and prediction. The first dataset was a real-world dataset, CreditCard.csv, which has 31 features, 28 features (V1, V2, V3 . . . V28) were transformed with Principal Component Analysis (PCA) for confidentiality issues. Thus, the rest of the features not transformed with PCA are ‘Time’ and ‘Amount’, representing the seconds elapsed between each transaction and the first transaction and the exact amount. The last feature is ‘Class’, which has the value 1 in case of a fraudulent transaction and 0 in the original transaction. This dataset has credit cards transactions that occurred in two days in September 2013 by European cardholders to contain 492 frauds out of 284,807 transactions. Therefore, this dataset is imbalanced because the fraudulent transactions (positive class) are 0.172% of all transactions.2 Table 1. Our datasets CreditCard datasets
Fraud datasets
Real-world credit card transaction dataset
Simulated/synthetic credit card transaction dataset
Transformed data with PCA for confidentiality
Available to include 1000 customers and 800 merchants
Made by credit cards in September 2013 Created by Brandon Harris from Jan 1st, by European cardholders for researches 2019 to Dec 31th, 2020 in University Libre de Bruxelles (ULB) 492 frauds out of 284,807 transactions
9,651 frauds out of 1,852,394 transactions
The second dataset was the FraudTrain.csv and FraudTest.csv datasets; we merged them into 1,852,394 credit card transactions and 24 features. It is a synthetic (simulated) credit card transaction dataset from January 1, 2019, to December 31, 2020, generated using Sparkov Data Generation tool by Brandon Harris.3 It contains 1,842,743 legitimate and 9,651 fraud transactions for 1000 customers dealing with a pool of 800 merchants. The features are as follows: transaction index, transaction date and time, credit card number, merchant 2 3
https://www.kaggle.com/mlg-ulb/creditcardfraud. https://github.com/namebrandon/Sparkov Data Generation.
An Ensemble-Based Machine Learning for Predicting Fraud
221
name, category of merchant, amount of transaction, cardholder’s name, cardholder’s gender, cardholder’s address, cardholder’s latitude and longitude location, cardholder’s city population, cardholder’s job, cardholder’s date of birth, transaction number, UNIX time of the transaction, merchant’s latitude and longitude location, and target class.4 Thus, it is evident that our two datasets are imbalanced data since the positive class is the minority compared to the other class in binary classification example. Figure 4 and 5 show the data ratio before and after adjusting imbalanced data by undersampling it to solve the skewed distribution issue between the fraudulent transactions to the original transactions.
Fig. 4. CreditCard dataset
3.3
Machine Learning and Ensemble Techniques
As shown in Fig. 6, stage one (ML-CCFD) trains ten ML algorithms to classify transactions to fraud (class 1) and non-fraud (class 0) and compare their classification performance to choose the most accurate learner models. The ten ML algorithms used are as follows: Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), K-Nearest Neighbor (KNN), Gaussian Naive Bayes (GNB), Support vector clustering (SVC), eXtreme Gradient Boosting (XGBoost), Stochastic Gradient Descent (SGD), Gradient boosting classifiers (GBC), and Light Gradient Boosting (LGB). The stage processing will be as follows, importing datasets, preparing features, under-sampling our data, training models, evaluating classifier models, measuring performance, finally, selecting the most effective learners.
4
https://www.kaggle.com/kartik2112/fraud-detection?select=fraudTrain.csv.
222
T. Baabdullah et al.
Fig. 5. Fraud dataset
Fig. 6. Stage1 ML-CCFD
We compare their classification reports, precision, recall (sensitivity), f1-score, and false negative. Then, we choose the most accurate classifier models based on their performance and prediction accuracy to proceed to the next stage. The second stage, as illustrated in Fig. 7, is the Ensemble-learning CCFD model that receives the effective learners from the previous stage.
An Ensemble-Based Machine Learning for Predicting Fraud
223
Fig. 7. Stage2 Ensemble-learning CCFD
Fig. 8. Prediction accuracy for all ML algorithms for fraud datasets
A transaction will be classified as fraud or non-fraud based on the final prediction via our method instead of using common types of ensemble learning, such as voting, stacking, boosting, and others. Our method aims to find the final prediction depending on the prediction probabilities of the effective classifier models, as explained in algorithm 1. In our experiment, we got the efficient ML algorithms and accurate classifier models in stage one, RF, XGBoost, GBC, and LGB, which are the input of stage two to get the output as the final prediction of the transaction. By receiving the efficient learners in the second stage, the learner’s prediction probability is computed for all efficient learners to find the average prediction probability, round it, find the predicted value, and then compare it to the actual value to evaluate our method’s accuracy.
224
T. Baabdullah et al.
Fig. 9. False negative for all ML algorithms for fraud datasets
Fig. 10. Prediction accuracy for all ML algorithms for CreditCard datasets
Fig. 11. False negative for All ML algorithms for CreditCard datasets
An Ensemble-Based Machine Learning for Predicting Fraud
225
Algorithm 1. Man-Ensemble Method 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26:
4
Input: Most Accurate ML learners Output: Final predication BestM odels = [M1 , M2 , M3 ] totalP rediction = 0 n = numberof datarecords k = numberof bestmodels threshold = 0.5 for i = 1 → n do for i = 1 → k do f indpredict probaMk (X T est)for class 0 totalP rediction+ = predict probaMk (X T est) end for avgP rediction = totalP rediction/k if avgP rediction ≥ threshold then avgP rediction = 1 else avgP rediction = 0 end if P redictedV alue = 1 − avgP rediction allP redictedV alue[i] = P redictedV alue end for compare predictedV alue == actualV alue f ind accuracy and other metrics
Results Assessment and Performance Evaluation
The results of the first stage applied on our datasets are displayed in Tables 2 and 3 to compare the classification reports and the prediction accuracy among the ten ML algorithms, including Random Forest (RF), Logistic Regression (LR), Decision Tree (DT), K-Nearest Neighbor (KNN), Gaussian Naive Bayes (GNB), Support vector clustering (SVC), eXtreme Gradient Boosting (XGBoost), Stochastic Gradient Descent (SGD), Gradient boosting classifiers (GBC), and Light Gradient Boosting (LGB). The stage one results show that the most effective ML algorithms and accurate classification models are RF, XGBoost, GBC, and LGB. Then, the second stage of finding final prediction results is displayed in Tables 4 and 5 for the most effective learning models from stage one.
226
T. Baabdullah et al. Table 2. Stage1 results for fraud datasets Model Precision Recall F1-score RF
0.95
0.87
0.90
LR
0.40
0.50
0.44
DT
0.92
0.86
0.89
KNN
0.74
0.75
0.74
GNB
0.74
0.50
0.45
SVC
0.10
0.50
0.17
XGB 0.97
0.95
0.96
SGD
0.10
0.50
0.17
GBC 0.98
0.97
0.97
LGB
0.97
0.97
0.98
Table 3. Stage1 results for CreditCard datasets Model Precision Recall F1-score RF
0.99
0.94
0.96
LR
0.40
0.50
0.45
DT
0.97
0.93
0.95
KNN
0.67
0.63
0.64
GNB
0.94
0.84
0.88
SVC
0.53
0.52
0.27
XGB 0.95
0.94
0.95
SGD
0.40
0.50
0.45
GBC 0.96
0.94
0.95
LGB
0.95
0.96
0.96
Table 4. Stage2 results for fraud datasets Precision Recall F1-Score 0.98
0.97
0.97
Table 5. Stage2 results for CreditCard datasets Precision Recall F1-score 0.96
0.94
0.95
The results of our ensemble model show the improvement of the prediction accuracy and models performance, as shown in Fig. 8, 9, 10 and 11.
An Ensemble-Based Machine Learning for Predicting Fraud
227
Our method improves the number of false negatives (fraud transactions), which is very important to reduce cost and detect more fraud instances. Therefore, our framework’s results emphasize the effectiveness and efficiency of our fraud detection system compared to other ML algorithms used individually due to their errors, overfitting, bias, prediction accuracy, and even their robustness level.
5
Conclusion
The critical impact of the increment of using credit cards is the occurrence of fraudulent transactions, which allow the illegal user to get money and free goods via unauthorized usage. Credit card fraud (CCF) has become the main issue for financial institutions, the credit card industry, the community, and cardholders. Thus, governments, businesses and companies, and financial institutions pay more attention to this security issue and apply different security detection systems to detect and suspend fraudulent transactions, such as Artificial Intelligence (AI) and Machine Learning (ML). Our paper aims to propose our credit card fraud detection (Man-Ensemble CCFD) system based on using an ensemblelearning model with two prediction stages. Stage one (ML-CCFD) utilizes ten machine learning (ML) algorithms to classify credit card transactions to class 1 as fraudulent or class 0 as a legitimate transaction. As a result, their classification reports were compared together, precisely precision, recall (sensitivity), and f1-score, the most accurate models will proceed to the second stage. These ML algorithms were selected based on their performance and prediction accuracy. The second stage is an Ensemble-learning CCFD that assembles the most effective ML algorithms chosen from stage one to get the final prediction instead of using common types of ensemble learning, such as voting, stacking, boosting, and others. The results of our framework showed the effectiveness and efficiency of our fraud detection system compared to using ML algorithms individually due to their weakness issues, such as errors, overfitting, bias, prediction accuracy, and even their robustness level. Indeed, the results of our ensemble method applied on two different datasets show the improvement of the prediction accuracy and classification performance. Also, it provides the minimum number of false negatives (fraud transactions) compared to single ML learners. It is essential to reduce errors and costs and to detect more fraud instances.
6
Future Work
Our method proved its accuracy and effectiveness in detecting fraudulent transactions on transformed real-world and synthetic credit card transaction datasets. Thus, in the future, it is recommended to apply it with neural network and deep learning techniques to check their accuracy and efficiency, also to use it with real-world credit card transactions datasets and in a real-time detection system. As known, most of the research studies and projects were offline fraud detection systems, on transformed data, or private datasets for confidentiality issues.
228
T. Baabdullah et al.
Acknowledgment. This work was supported by NSF under grant agreement DMS2022448, and the Center for Science of Information (CSoI), an NSF Science and Technology Center, under Grant Agreement CCF-0939370.
References 1. Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: a survey. J. Netw. Comput. Appl. 68, 90–113 (2016) 2. Arya, M., Sastry, G.H.: DEAL-‘deep ensemble algorithm’ framework for credit card fraud detection in real-time data stream with google TensorFlow. Smart Sci. 8(2), 71–83 (2020) 3. Awoyemi, J.O., Adetunmbi, A.O., Oluwadare, S.A.: Credit card fraud detection using machine learning techniques: a comparative analysis. In: 2017 International Conference on Computing Networking and Informatics (ICCNI), pp. 1–9 (2017) 4. Baabdullah, T., Alzahrani, A., Rawat, D.B.: On the comparative study of prediction accuracy for credit card fraud detection with imbalanced classifications. In: 2020 Spring Simulation Conference (SpringSim), pp. 1–12 (2020) 5. Carcillo, F., Dal Pozzolo, A., Le Borgne, Y.-A., Caelen, O., Mazzer, Y., Bontempi, G.: SCARFF: a scalable framework for streaming credit card fraud detection with spark. Inf. Fusion 41, 182–194 (2018) 6. Carcillo, F., Le Borgne, Y.-A., Caelen, O., Kessaci, Y., Obl´e, F., Bontempi, G.: Combining unsupervised and supervised learning in credit card fraud detection. Inf. Sci. 557, 317–331 (2021) 7. Dighe, D., Patil, S., Kokate, S.: Detection of credit card fraud transactions using machine learning algorithms and neural networks: a comparative study. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), pp. 1–6. IEEE (2018) 8. Hamori, S., Kawai, M., Kume, T., Murakami, Y., Watanabe, C.: Ensemble learning or deep learning? Application to default risk analysis. J. Risk Financ. Manag. 11(1), 12 (2018) 9. Jurgovsky, J., et al.: Sequence classification for credit-card fraud detection. Expert Syst. Appl. 100, 234–245 (2018) 10. Modi, K., Dayma, R.: Review on fraud detection methods in credit card transactions. In: 2017 International Conference on Intelligent Computing and Control (I2C2), pp. 1–5 (2017) 11. Motwani, A., Bajaj, G., Mohane, S.: Predictive modelling for credit risk detection using ensemble method. Int. J. Comput. Sci. Eng. 6(6), 863–867 (2018) 12. Polikar, R.: Ensemble based systems in decision making. IEEE Circ. Syst. Mag. 6(3), 21–45 (2006) 13. Popat, R.R., Chaudhary, J.: A survey on credit card fraud detection using machine learning. In: 2018 2nd International Conference on Trends in Electronics and Informatics (ICOEI), pp. 1120–1125 (2018) 14. Randhawa, K., Loo, C.K., Seera, M., Lim, C.P., Nandi, A.K.: Credit card fraud detection using AdaBoost and majority voting. IEEE Access 6, 14277–14284 (2018) 15. Rayana, S., Zhong, W., Akoglu, L.: Sequential ensemble learning for outlier detection: a bias-variance perspective. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1167–1172 (2016) 16. Sohony, I., Pratap, R., Nambiar, U.: Ensemble learning for credit card fraud detection. In: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, pp. 289–294 (2018)
An Ensemble-Based Machine Learning for Predicting Fraud
229
17. Tiwari, P., Mehta, S., Sakhuja, N., Kumar, J., Singh, A.K.: Credit card fraud detection using machine learning: a study. arXiv preprint arXiv:2108.10005 (2021) 18. Veeramachaneni, K., Arnaldo, I., Korrapati, V., Bassias, C., Li, K.: AI 2 : training a big data machine to defend. In: 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS), pp. 49–54 (2016) 19. Young, S., Abdou, T., Bener, A.: Deep super learner: a deep ensemble for classification problems. In: Bagheri, E., Cheung, J.C.K. (eds.) Canadian AI 2018. LNCS (LNAI), vol. 10832, pp. 84–95. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-319-89656-4 7 20. Zareapoor, M., Shamsolmoali, P., et al.: Application of credit card fraud detection: based on bagging ensemble classifier. Procedia Comput. Sci. 48(2015), 679–685 (2015) 21. Zhang, X., Han, Y., Wei, X., Wang, Q.: HOBA: a novel feature engineering methodology for credit card fraud detection with a deep learning architecture. Inf. Sci. 557, 302–316 (2021) 22. Zimek, A., Campello, R.J.G.B., Sander, J.: Ensembles for unsupervised outlier detection: challenges and research questions a position paper. ACM SIGKDD Explor. Newsl. 15(1), 11–22 (2014)
Unsupervised Machine Learning Methods for City Vitality Index Jean-S´ebastien Dessureault1(B) , Jonathan Simard2 , and Daniel Massicotte1 1
2
Universit´e du Qu´ebec ` a Trois-Rivi`eres, Trois-Rivi`eres, Qu´ebec, Canada {sebastien.dessureault,daniel.massicotte}@uqtr.ca Cellule d’expertise en robotique et I.A. C´egep de Trois-Rivi`eres, Trois-Rivi`eres, Qu´ebec, Canada [email protected]
Abstract. This paper concerns the challenge to evaluate and predict a district vitality index (VI) over the years. There is no standard method to do it, and it is even more complicated to do it retroactively in the last decades. Although, it is essential to evaluate and learn features of the past to predict a VI in the future. This paper proposes a method to evaluate such a VI, based on a k -mean clustering algorithm. The meta parameters of this unsupervised machine learning technique are optimized by a genetic algorithm method. Based on the resulting clusters and VI, a linear regression is applied to predict the VI of each district of a city. The weights of each feature used in the clustering are calculated using a random forest regressor algorithm. The results are applied to the city of Trois-Rivi`eres. Each VI is defined using a magnitude of vitality and a cluster that can be used to compare districts. The consistency of the clusters are presented using a Silhouette index (SI). The results show the VI and a clustering membership for each district. Many tables and graphics display different analysis of the data, drawing the conclusion that this method can be a powerful insight for urbanists and inspire the redaction of a city plan in the smart city context. Keywords: Smart city · Intelligent urbanism · District vitality index k-Mean Algorithm · Random forest algorithm · Genetic algorithm
1
·
Introduction
Cities are constantly evolving. Too often, several districts in a city have been forgotten for years and without warning, they are devitalized. It is often too late to act. People and businesses are leaving this district because of many factors such as the disuse of the houses and the buildings, the bad economic activities, the criminality rate, and so on. Even though it is easy to note when a district is already devitalized, urbanists do not have some good tools to predict which district will be devitalized, and when. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 230–246, 2022. https://doi.org/10.1007/978-3-031-10464-0_15
ML for City Vitality Index
231
In the large spectrum of smart cities is the intelligent urbanism [5–8]. This specific part of smart cities aims to help urbanists to read better their city features and to predict how the urban territory will change. There are some paper studying different city district indexes. They all study some geographical region and geolocalized features from the past to predict what will happen in the future. The case of Attica (Athens area) in which they predict urban growth was proposed in [9]. Their work aims to presents an artificial intelligence approach integrated with GIS for modeling urban evolution. They use a fuzzy logic system using a c-means clustering algorithm to divide the territory using fuzzy frontiers. In this system, each geographical position has a level of membership defined by a membership function [10]. The clusters represent a specific level of urban growth. This system also uses a multilayers neural network (MNN) to learn and predict urban growth in the Attica area, by analyzing population changes over time and by building patterns. All the geographical data are managed by a GIS. Amongst many features, 9 has been selected to feed the system: population, population growth in the decade, number of buildings, number of building growth in the last decade, use of residential sector, use of commercial sector, use of industrial sector, use of public sector, other uses. The results shown a clear profile for each district. Some other papers are doing a similar work, but they model the territory using a cellular automaton [3,4]. This is a good approach when features can be precisely geolocalized. In this case, it is possible to extract information and place it into a two-dimensional grid. From this grid, some simulation can run according to some previously learned rules. The results of those simulation can give some hints of what the territory will look like in the future. Obviously, the cellular automaton is used combined with some other machine learning techniques to give some more complete results [1,2]. There is no standard for the evaluation and the prediction of a vitality index (VI). Some papers [11–14] refer to VI, defining their own set of features (both qualitative and quantitative) and methods. Since each city does not archive the same data over the years, it is difficult to establish a standard set of features for the evaluation of a VI. Each city must use the consistent data available from the last decades. The mains contributions of this paper are: i) Proposed a method based on ML algorithms to define, evaluate, and predict a VI in city. Validation is done on Trois-Rivi`eres city. It shows the methodology for data preprocessing such as normalize and represent features, and to fill some gap. The proposed city vitality index is declined in to parts: a letter representing its membership to a district profile (found in the clustering process), and a number representing the level of vitality (from 0 to 100). For instance, a VI of C67 would means “a district having a C cluster profile and a normalized vitality level of 0.67”. This notation is important to have not only a level of vitality, but also to know which districts are similars. Having this information, districts can easily be compared.
232
J.-S. Dessureault et al.
ii) Defined how unsupervised learning is used to calculate the VI and to make predictions through years. The usage of a k-means algorithm (unsupervised learning) that partially defines the index will be proposed. The usage of a feature-weighted inputs with stochastic gradient descent technique, will also be explained. iii) Finally, we will see how a GA is used to optimize the clustering parameters. At the end, this proposed method based on ML algorithms provides some good insights for urbanists. This study aims to answer to some questions from city urbanists to support their urbanism plan. They wanted to have a clear view of the vitality trend for each of the first belt district. They wanted to regroup those districts by feature value similarity. They finally wanted to have a better presentation of the available data. This method has the following advantages: 1) It can be generalized. It means that it can be reused for several types of indexes like criminality indexes, health indexes or economic indexes. 2) It has the advantage of representing the VI through decades, and it allows to make prediction about the future. 3) It also compares each district with others districts using a clustering technique and a Silhouette index which evaluates the clustering consistency. This work is limited by its crispy clustering technique. There is no adaptation of the VI near the borders of the districts like if a fuzzy technique was used. This method does not include a dimensionality reduction technique. This aspect was not even studied, since urbanists wanted all available features of the city of Trois-Rivi`eres. Finally, this method has not been tested on large scale data. The next sections of this paper are organized with the following structure: Sect. 2 describes the proposed methodology. Section 3 presents the results. Section 4 discusses about the results and their meaning and Sect. 5 concludes this research.
2 2.1
The Proposed Method for the Vitality Index Selected Features
According to the Trois-Rivi`eres urbanists, the “vitality” of a district can be defined by the strength of its economy, the health and social status of its citizen. Unlike the urban growth index, which tells if a territory is occupied by urban space, the VI index refers to the economic health of an urban territory and to the social condition of its citizens. This paper aims to define the VI, to evaluate it according to the Trois-Rivi`eres city features and predict this index for each targeted district in the future. In urbanism context, we have access to massive data information. The first step was to prioritize and select the right features needed to calculate a VI. This was done in collaboration with urbanism experts. They have selected the features they thought could have a significant influence on a VI. Table 1 presents the eight selected features and the pre-processing applied on them. The features were collected by dissemination area (DA) and
ML for City Vitality Index
233
later converted to district. In Canada, a DA is formally known to be a small area composed of one or more neighbouring dissemination blocks, with a population of 400 to 700 persons. All of Canada is divided into dissemination areas. As shown in Table 1, pre-processing has been applied to each feature. First, every feature has been normalized using a MinMax function based on the assumption that each feature has same importance in the VI. (1) shows the MinMax normalization formula. It simply normalizes a number to get a 0 to 1 range, associating the smallest value to 0 and the highest to 1. We hypothesize that each feature has an equal importance to the others in the computation of the prediction of the vitality index. z=
x − min(x) max(x) − min(x)
(1)
Table 1. Vitality index features and their pre-processing Features
MinMax Log 10 Inv. Repl.avg.
1. Major renovation permit
Yes
Yes
No
No
2. Prop. dwelling major repairs Yes
Yes
No
No
3. Prop. dwelling minor repairs Yes
Yes
No
No
4. Material deprivation index
Yes
Yes
Yes Yes
5. Social deprivation index
Yes
Yes
Yes Yes
6. Average single-family homes Yes
Yes
No
No
7. Median value per dwelling
Yes
Yes
No
Yes
8. Housing vacancy rate
Yes
Yes
Yes Yes
This is easier way to calculate the index and to present features on the same scale using different graphics. The presentation of the features is especially important since it must be interpreted by urbanists. To have a better feature distribution, a logarithmic function is used to scale the feature in logarithmic scale. Some features have been inverted to keep the consistency: 0 is always the worst feature value and 1 is always the better feature value. Finally, some average values were used when no data were available. 2.2
Framework Design to Predict Vitality Index
The proposed framework design includes several parts to finally predict a VI. First, the model must learn from district’s features the VI. Since the outputs of the past are unknown, an unsupervised learning technique (k -mean algorithm) had to be used. This algorithm’s parameters are optimized with a GA. Afterward, having all the VI for three years 2006, 2011 and 2016 in a 10 year range, the method based on this model architecture can predicted the district evolution in the future. This prediction is made using a linear regression.
234
J.-S. Dessureault et al.
Figure 1 shows the block diagram of the dataflow and the ML used represented in 3 parts: (1) GA and clustering, (2) random forest, and (3) linear regression.
Fig. 1. Proposed framework design in three parts (1) genetic algorithm and clustering, (2) random forest, and (3) linear regression.
To have a whole system to evaluate and predict a VI, several types of ML algorithms must be used. 2.3
Unsupervised Learning – k-Means Clustering
To determine the VI, it is necessary to use an unsupervised learning technique since there is no tagged label for each input data. This algorithm will assign to each district a cluster reference letter, according to a similarity level of their features. A k -means algorithm has been used to determine the clusters. (2) defined the k -means clustering equation where J is a clustering function, (j) k is the number of clusters, n is the number of features, xi is the input (feature i in cluster j ) and cj is the centroid for cluster j. Centroids are obtained by randomly trying some values and selecting the best. J=
n k 2 (j) xi − cj
(2)
j=1 i=1
There are several metrics that allow to measure a clustering performance. Although, every metric is not compatible with every algorithm. Since a genetic algorithm was used to optimize the clustering (using several techniques), we had to make some choice according to the chosen clustering technique. Since k-means algorithm was selected, the clustering performance has been measured by the “Silhouette” metrics. This metric is documented by Kaufman and Rousseeuw [15] and [16]. This metric includes two important equations. The
ML for City Vitality Index
235
distance between each point and the center of its cluster is shown in (2). The distance between the center of each cluster is shown in (3). Finally, (5) uses the result of (3) and (4) to calculate the final Silhouette score that indicate the quality (the consistency) of the clustering. The silhouette ranges from −1 to +1. Values from −1 to 0 indicates that the point is associated to a wrong cluster and from 0 to 1 are associated to a good cluster. The higher the value, the better the cluster consistency [15]. 1 d(i, j) (3) a(i) = |ci − 1| j∈ci i=j
1 d(i, j) k=i |ck | j∈c
b(i) = min
(4)
k
b(i) − a(i) , if |Ci | > 1 (5) s(i) = max (a(i), b(i)) If most elements have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. In our case, we had to create clusters of 5, 6, 7 or 8 dimensions. It is way more complicated to get a high silhouette score than with some 2- or 3-dimensions features. 2.4
Genetic Algorithms
There are some relevant features to calculate a VI. Although, no label can be assigned to each set of features. Therefore, we can not use a supervised learning algorithm. Unsupervised learning algorithms allow a machine to learn without labels, though. There are several techniques to do so, each one using some different parameters. Consequently, there is a numerous of possible configurations. To optimize the results of the clustering, a GA (Cedeno, 1995) is used. Four genes are used in the evolution process: k -mean maximum iteration parameter, k -mean n centroid parameter, the number of clusters to find, and a list of features. Table 2 shows the configuration of the GA. The fitness function was the silhouette score of the clustering. This metric that evaluates the cluster consistency is defined by (4). The GA parameters are the following: 1. Number of generations: number of iterations on the fitness/breeding/ mutation process. 2. Chromosomes: Number of individuals configurations tested by the process. Population. 3. Initial chromosomes initialisation: The method used to initialize the chromosomes at generation 0. 4. Mutation rate: At breeding time, a percentage of the chromosomes that do not inherits from parents but are randomly reinitialized. 5. Percentage of chromosomes fitting well enough to breed: A threshold of the fitness function. The chromosomes ranking better than this value will be breeded in the next generation.
236
J.-S. Dessureault et al.
The configuration maximizing the silhouette score is displayed in Table 2 in the column “Best-Score”, where we maximized the Silhouette score using GA considering different number of clusters, number of features among the 8 features (Table 1). We reach a Silhouette score of 0.26, with Ninit = 14, Maxiter = 196, 10 clusters using the features 2, 5, 4, and 3. Otherwise, in some application cases, the number of clusters is fixed by the urbanism experts considering all features. 2.5
Feature-Weighted Inputs
At the beginning, we considered the hypothesis that each feature has an equal importance to the others in the computation of the prediction of the vitality index. However, one important answer we had to find was the importance of each feature in the clustering process. To do so, a loop evaluating the totality of the feature’s list combination has been processed. This clustering process returned a silhouette score for each feature combination. Having a list of features configuration and silhouette index, the “random forest regressor” technique was used to determine the importance of each feature. A random forest is an iteration over n “decision trees” (n = 250, in this case). The result was a list of importance ordered features and their weight. 2.6
Linear Regression to Predict VI
The last stage of the proposed framework concerns the prediction of VI in the future for each first surrounding area districts. The VI were available by DA. There can be many DAs in each district. We had to regroup them by district and calculate the linear regression line through the available years to predict 10 years later. Although, in some case, there is not many points in the cloud, and it is hazardous to conclude to a reliable prediction.
3
Results Applied on the First Belt Districts in Trois-Rivi` eres City
The Trois-Rivi`eres territory as we know it exists since an important fusion between six cities and municipalities, in 2002. In this case, data from before this fusion era is considered irrelevant. Trois-Rivi`eres area is 334 km2 and has 136 000 people living on its territory. 40% of its territory is in agricultural area, 20% in rural area and 40% in urban area. It is situated at the junction of St. Laurent River and St. Maurice River, about mid distance from Montreal and Quebec City. Trois-Rivi`eres is also known for its major infrastructures for planes, trains and ships. Figure 2 shows a map of Trois-Rivi`eres. The greyed part is the first belt districts (the important part for this study). The urbanisation of its area happened in three steps. The first one was prior to 1950. This area is called “firsts districts” or “central districts”. The “first belt” or “first agglomeration” was built between 1950 and 1980. Since then, the
ML for City Vitality Index
237
Fig. 2. The City of Trois-Rivi`eres, Quebec, Canada. The grey color defines the first belt area.
new areas are known to be the “second belt” area. Since the important fusion of 2002, residential development is more important than foreseen. The city of Trois-Rivi`eres needed to have some insights to write its urbanism plan. Specially, there was a need to better foresee the vitality of the first belt districts. The reason is that some demographic issues (weak growth, aging of the population, etc.). The city needed to have more information about short-term vitality (5–10 years), average term vitality (10–20 years), and long-term vitality (20–30 years) of the first belt districts of Trois-Rivi`eres. Basically, this study is focusing on this vitality aspect. Many more aspects may be studied in some future work. 3.1
Features Distribution and Representation
First, let us see an example of the distribution of each of the 8 features for DA.24370200 of year 2016. There is a similar distribution for each DA, for each available year. Section 2.1 (Table 1) show to methodology to obtain these values. For each of the 135 DAs, and for every year, all the features are represented on the “radar” graphics. Since all features are normalized, they can be displayed on the same scale. Figure 3 show an example of this radar graphics (DA: 24370200 in year 2016). This type of graphic is useful to get at a first glimpse how high an index is. The more surface covered, the higher the vitality index. In a second time, it is possible the read separately each ax. There is one ax per feature. 3.2
Vitality Index
In the graphics of Fig. 3, we can extract the average of the sum of the features. The result is also a normalized value where a lower value means less vitality and a higher value means more vitality.
238
J.-S. Dessureault et al.
Fig. 3. Typical “radar” graphic used to represent the 8-dimensional features.
Although, this information is incomplete. There are some very different district profiles having the same average of the sum of the features. The best way to visualise the district profile is to regroup them. This was made by using the clustering technique described in Sect. 3.3. For this reason, this research defines the vitality index by the two parts, as following: a letter representing the profile (cluster), and a number representing the average of the sum of the features. For instance, “C38” means a “C” cluster with an average of the sum of the features of 0.38. The 38 value is the average sum of the feature (0.38) multiplied by 100. The inspiration of this classification system comes from the works on star classification by Annie-Jump Cannon in [17]. In this two-dimension notation system, a star could have a G5 type. 3.3
Clustering
Like described in Sect. 3.3, the number of clusters and the number of used features has been determined by a GA including feedback analysis from urbanists. We were to use all the 8 features and to divide the 135 DAs in 10 clusters. Figure 4 shows on the Y-axis the average of the sum of each feature (year 2016), for each DA (X-axis). Vertical red lines divide the clusters, and horizontal dotted lines show the average of each cluster. The best way to visualize and interpret every cluster is to superpose every radar graphic of the same cluster. In Fig. 4, we can see that the clusters B and D have about the same average of the sum of the features (around 0.28). Without having their cluster profile, it would be impossible to see the difference. Figure 5 shows the profile of the cluster C. Having that type of graphic, it is easier to compare every cluster profiles.
ML for City Vitality Index
239
Figure 6 shows the silhouette index distribution of the features between 10 clusters using 8 features. We can note that there are very few clustering errors and a Silhouette score value of 0.19.
Fig. 4. Feature average for 2016 for each DA, cluster division (separated by vertical red line) and cluster average (horizontal dotted lines) for 10 clusters)
Fig. 5. Cluster C (2016) and its 19 stacked radar graphics representing DAs.
As mentioned earlier, the clustering process was optimized by a GA. The results shown in Table 2 for 10 clusters given Silhouette score of 0.19. The Ninit parameter is the number of times the k-means algorithm will be run with different centroid seeds. The Maxiter is the maximum number of iterations of the k-means algorithm for a single run.
240
J.-S. Dessureault et al.
Fig. 6. Silhouette metrics for 10 clusters using 8 features (2016 data). The red dotted line represents the mean of the clustering process Table 2. Chromosome clustering configuration maximizing the silhouette score (bestcluster) and specifying the number of cluster (fixed-cluster).
3.4
Genes
Best-score Using all features
Ninit
14
Maxiter
196
14 196
Number of clusters 10
10
Features list
2, 5, 4, 3
All
Silhouette score
0.26
0.19
Weighting the Features
Urbanists wanted to know which features are the most relevant in the clustering process. Section 2.5 presented a methodology based on a random forest algorithm to weight the 8 features proposed by urbanists. The weights of the features in 2016 are given by Fig. 7. In order of importance, from the most important to the least important (the number in parenthesis is the level of importance): 1. 2. 3. 4. 5. 6. 7. 8.
Feature Feature Feature Feature Feature Feature Feature Feature
2 5 4 3 7 8 6 1
(0.3249) (0.1895) (0.1228) (0.1033) (0.1014) (0.0697) (0.0536) (0.0348)
Proportion of dwelling requiring major repairs Social deprivation index Material deprivation index Proportion of dwelling requiring minor repairs Median value per dwelling Housing vacancy rate Average value of single-family homes Major renovation permit
ML for City Vitality Index
241
Fig. 7. Level of importance of the features.
3.5
Predicting Vitality Indexes
This research can predict the average of the sum of the features part of the VI. At least for the numeric part it is possible to have a regression that learns from the past to estimate future. Since the goal is to predict vitality in the first surrounding area, we have first to regroup DAs. Figure 8 shows a district that includes four DAs. It is usually from 1 to 7 per district. To make some prediction, we must plot the VI (numeric part) of each DA included in each district. Then the regression line must be added and used to make the prediction about the future. In this case, results must be interpreted with caution since there are only
Fig. 8. Map of TR-3 district of Trois-Rivi`eres and its DAs.
242
J.-S. Dessureault et al.
three years of history to predict years 2021. Figure 9 shows an example of such a prediction using past VI (numeric part) computed on 7 DAs (on 3 years 2006, 2011 and 2016). Each DA is defined by a blue dot. In this case, the 7 DAs are those included in the TRO-3 district. The districts of the city of Trois-Rivi`eres are defined in three different sectors: TR stands for Trois-Rivi`eres, TRO stands for Trois-Rivi`eres Ouest and CAP stands for Cap-de-la-Madeleine. In the map of Fig. 8 TR-3 (third sector of Trois-Rivi`eres district) is shown. In the case of Fig. 9, it is the TRO-3 (third sector of Trois-Rivi`eres Ouest district) is presented.
Fig. 9. Regression line and vitality prediction (numeric part) based on 7 DAs of the past (2006, 2011 and 2016), for TRO-3 District of Trois-Rivi`eres City to Predict 2021. Small symbols represent data of the past, and the big symbols (2021) are the predictions.
A summary of the entire process is presented Table 3. For instance based on Fig. 4, cluster A counts 17 DAs in it. For this cluster, the mean of the average of its DA’s 8 features is 0.3057. The vitality index, including the cluster part and the numeric part is A31. 31 is the rounded product of 0.3057 X 100. A vitality index can be assigned to a single DA or to a cluster of multiple DAs. Table 3. A summary of the 10 clusters, the number of DAs included in each one, the mean of the average of the 8 features for each DA in the cluster and the VI of the cluster. Clusters Number of DAs. Mean of feature average Vitality index A
17
0.3057
A31
B
28
0.2658
B27
C
19
0.3608
C36
D
26
0.3031
D30
E
12
0.3810
E38
F
4
0.4854
F49
G
6
0.5384
G54
H
5
0.5155
H52
I
5
0.5030
I50
J
13
0.4064
J41
ML for City Vitality Index
243
Figure 10 shows a clear correlation between the number of DAs in the clusters and the VI (numeric part). Less DAs in the clusters means highers VIs. At the opposite, more DAs in the clusters means lowers VIs. This observation means that there are more DAs sectors with a low VI that are similars. Those are easier to regroup in a consistent clustering process. There are also fewer high VI ADs so they can’t be clustered with many others.
4
Discussions
A method for calculating and predicting a VI has been developed in this paper. The main objective was to help urbanists to have a better understanding of the raw data available in different sources, including their own. The data was collected, then pre-processed to optimize distribution and readability. Four ML algorithms were used to process data: k-mean clustering technique, GA, featureweighted inputs and linear regression. There is no simple way to use a GA on some clustering techniques. For different clustering techniques, we must use different types of parameters. There are also some issues about the evaluation of the clustering’s results. The Silhouette score finally presents a good metric to evaluate the clusters consistency. Obviously, the clustering of some 8-dimensions indexes is a greater challenge than the 2-dimensions points cloud clusters usually presented. Due to this higher dimensionality, there was also some issues about the graphical representation of the features. The “radar” graphic type was very useful. The superposed radar graphic was also helpful to visualize the clusters consistency, maybe in an even better way than using the Silhouette metric. The are some tables of appreciation of the Silhouette scores, but they are based on some 2-dimensions features. It is very difficult to know if some 8-dimensions feature clusters (like the ones in this research) are consistent or not. That is why
Fig. 10. Correlation between the number of DAs in the clusters and the VI (numeric part).
244
J.-S. Dessureault et al.
the superposition of the radar graphics was so important to confirm the clusters consistency. This method is still valid when adding or removing some features. Adding features would decrease clusters consistency. At the opposite, removing features would increase clusters consistency. As mention in the introduction, many of the comparative papers use a cellular automaton. This method can be used only when there is geolocalized data and are not bound to a district. In this study, a cellular automaton can not be used, since this method was incompatible with the data. A better comparison can be made with the paper [9]. To see the differences, we must refer to Table 3 and Table 4. Table 4. Membership values for the municipality by cluster for each decade [9]. Decade
A
B
C
D
61–71
0.114858 0.046508 0.126751 0.640658 0.071224
71–81
0.110278 0.070643 0.12308
81–91
0.087252 0.059034 0.097683 0.251538 0.504493
0.3293
E 0.366699
91–2001 0.067178 0.041567 0.077724 0.222563 0.590990
There are important differences between the proposed methods. i) Their’s evaluates the urban growth and our’s evaluates the VI. ii) Their’s evaluates the data based on different and longer time lapses. iii) The city of Athens is clearly bigger than the city of Trois-Rivi`eres. Understanding that, let’s compare the methodology. The limit of their contribution is that it regroups the similar district without clearly assigning and representing a urban growth value for each district. In Table 4, we can see a level of membership for each decade, for each cluster. This level of membership is based on a fuzzy logic c-means clustering algorithm. There is no clear representation for the urban growth level, as in Table 3, where we can clearly read a clustering part (as “A”), and a level of VI (as “31”). Their prediction is entirely based on the clustering trend. The contribution of our method is that there are two parts composing the VI: 1) The cluster (a letter representing the similarity with others), and 2) a VI level (a number). The higher this VI, the more vitalized this district is. Both parts allow to better represent each district VI using a stacked radar graphics like Fig. 5 (to illustrate the clustering) and a radar graphics like Fig. 3 (to illustrate VI level). Stacking radar graphics are good to evaluate clusters consistency. This paper also include the silhouette metric to have a better evaluation and representation of the clustering consistency as in Fig. 6. This evaluation and representation of the clustering consistency also represents a method improvement of [9]. At the end, this research succeeds in converting some scattered raw data in some valuable knowledge, well presented and useful for the writing of the urban plan of Trois-Rivi`eres city.
ML for City Vitality Index
5
245
Conclusion
All the code written in this research has some great generalization perspectives. In a near future, it could be converted in a more general urban tool to make some prediction about a broader range of urban indexes, such as criminality indexes, health indexes or economic indexes. There are also some improvement possibilities in the clustering part. The clustering method and the metrics could be studied and improve. One way of improving a next version would be to replace the k-mean clustering algorithm by a c-mean algorithm. This would have the benefits of fuzzifying the districts limit. A model based on fuzzy clustering would reflect reality in a better way than a model based on a crispy clustering algorithm. Some improvement can also be done by making some prediction on each available feature, instead of only the numeric part of the VI (which is the average of those features). Since this system must deal with an input that includes multiple features, some algorithms based on dimensionality reduction must be explored. It has not been done in this research since urbanists insisted on keeping the 8 original features. Adding this process would certainly help to have some better clusters consistencies. There could be some improvement possibilities by using this solution in the pre-processing phase. Finally, in a next version, it will be easy to also project the profile part of the VI. It will be certainly possible to predict the shapes of the multidimension VI of the future. Acknowledgment. This work has been supported by the City of Trois-Rivi`eres, The “Cellule d’expertise en robotique et intelligence artificielle” of the C´egep de TroisRivi`eres and IDE Trois-Rivi`eres.
References 1. Soltani, A., Karimzadeh, D.: The spatio-temporal modeling of urban growth using remote sensing and intelligent algorithms, case of Mahabad, Iran. TEMA J. Land Use Mobil. Environ. 6(2), 189–200 (2013). https://doi.org/10.6092/1970-9870/ 1547 2. Li, X., Gar-On Yeh, A.: Neural-network-based cellular automata for simulating multiple land use changes using GIS. Int. J. Geogr. Inf. Sci. 6(4), 323–343 (2002). https://doi.org/10.1080/13658810210137004 3. Tayyebi, A., Pijanowski, B., Tayyebi, A.H.: An urban growth boundary model using neural networks, GIS and radial parameterization: an application to Tehran, Iran. Landsc. Urban Plan. 100, 35–44 (2011). https://doi.org/10.1016/ j.landurbplan.2010.10.007 4. Guan, Q., Wang, L., Clarke, K.C.: An artificial-neural-network-based, constrained CA model for simulating urban growth. Cartogr. Geogr. Inf. Sci. 32(4), 369–380 (2005). https://doi.org/10.1559/152304005775194746 5. Benninger, C.C.: Principles of intelligent urbanism: the case of the new capital plan for Bhutan. Ekistics 69(412), 60–80 (2002) 6. Santoso, S., Kuehn, A.: Intelligent urbanism: convivial living in smart cities, p. 5 (2013) 7. Murgante, B., Borruso, G., Lapucci, A.: Geocomputation and urban planning, pp. 1–17 (2009). https://doi.org/10.1007/978-3-540-89930-3 1
246
J.-S. Dessureault et al.
8. Wu, N., Silva, E.A.: Artificial intelligence solutions for urban land dynamics: a review. J. Plan. Lit. 24(3), 246–265 (2010) 9. Grekousis, G., Manetos, P., Photis, Y.N.: Modeling urban evolution using neural networks, fuzzy logic and GIS: the case of the Athens metropolitan area. Cities 30, 193–203 (2013) 10. Wang, L.X., Mendel, J.M.: Fuzzy basis functions, universal approximation, and orthogonal least-squares learning. IEEE Trans. Neural Netw. 3(5), 807–814 (1992) 11. White, G., Zink, A., Codec´ a, L., Clarke, S.: A digital twin smart city for citizen feedback. Cities 110, 103064 (2021) 12. Kaur, M.J., Mishra, V.P., Maheshwari, P.: The convergence of digital twin, IoT, and machine learning: transforming data into action. In: Farsi, M., Daneshkhah, A., Hosseinian-Far, A., Jahankhani, H. (eds.) Digital Twin Technologies and Smart Cities. IT, pp. 3–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-03018732-3 1 13. Ida vitality index 14. Drewes, J.E., van Aswegen, M.: Determining the vitality of urban centres, pp. 15–25 (2011) 15. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis (2009) 16. Gueorguieva, N., Valova, I., Georgiev, G.: M&MFCM: fuzzy Cmeans clustering with Mahalanobis and Minkowski distance metrics. Procedia Comput. Sci. 114, 224–233 (2017) 17. Cannon, A.J.: Classification of 1477 stars by means of their photographic spectra. Ann. Harvard Coll. Observ. 56, 65–114 (1912)
Machine Learning of a Pair of Charged Electrically Particles Inside a Closed Volume: Electrical Oscillations as Memory and Learning of System Huber Nieto-Chaupis(B) Universidad Aut´ onoma del Per´ u, Panamericana Sur Km 16.3 Villa el Salvador, Lima, Peru [email protected], [email protected]
Abstract. In this paper the problem of two charged particles inside a frustum is faced through the principles of Machine Learning compacted by the criteria of Tom Mitchell. In essence, the relevant equations from the classical electrodynamics are presented. Once the power is derived, then the systematic errors that might be intrinsic are implemented. Thus, these errors drives the evolution of system inside the frustum in both: experience and learning. In the scenario of electrical oscillations because the repulsion forces, the errors would have oscillatory behavior, fact that is favourable to the system in the sense that acquires memory and improves its learning of measurements done in the past.
Keywords: Machine Learning
1
· Physics · Nonlinear systems
Introduction
Electricity is the branch of physics studying the interactions of charged bodies. The Coulomb’s law that dictates the dynamics of charged bodies depends directly of charges content in the pair of interacting charged bodies that namely is given by space |r − r |2 and electric charges Q1 Q2 . Here κ is an universal constant [1,2]. Because one has in general charge densities, then the mathematical formulation of Eq. 1 is written as: dV1 ρ1 × dV2 ρ2 F =κ . (1) |r − r |2 In cases where ρ are densities that are changing in time, then it implies also that the space is not fix and one expects variations along the space-time where the electric force applies. Thus one has a more accurate expression: dV1 ρ1 (t) × dV2 ρ2 (t) F =κ . (2) |r(t) − r (t)|2 c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 247–256, 2022. https://doi.org/10.1007/978-3-031-10464-0_16
248
H. Nieto-Chaupis
The fact that the right-side depends on time, then this is directly linked to the well-known Newton’s second law that establishes a kind of kinematics as well as dynamics in the sense of: dV1 ρ1 (t) × dV2 ρ2 (t) dv(t) . (3) F =κ = M v(t) |r(t) − r (t)|2 ds As known, Eq. 4 establishes that electric forces produces a well-defined dynamics [3,4] on a body of mass M through a velocity v(t) whose vector is under time-variation in according to its acceleration. Clearly one has a welldefined dynamics so that one can make predictions. This is because the inherent character of Newton-like physics that is fully deterministic. In this manner, given a charged body under electrical interaction, one can wonder: the classical systems do exhibit any kind of learning? [5]. While one can note the timevariation fact that defines explicitly a certain trajectory, then one can attribute a set of physical parameters that allows to estimate the past and future of charge body. While the space-time propagation, then one would expect energy expense by the charged body in the sense of: dV1 ρ1 × dV2 ρ2 (4) W = Fds = κ (r(t) − r (t))ds. |r(t) − r (t)|3 From this emerges the question: How to handle the charges and positions of system to find the best value of force and energy? In this paper, the criteria of Tom Mitchell [6–8] given by: – Task: all systems have tasks that justifies its existence; – Performance: in order to solve the task, system would have to apply a wellestablished performance; – Experience; when the task is solved, the system can claim that it was a kind of process de learning [9], fact that should be demonstrated in next tasks [10]. Clearly, one can apply these principles as a strategy of solution in electricity problems. Although one can find more physics equations, one should to identify the best one in order to improve the usage of resources of system either theoretically or computationally processes that must confluent in a concrete solution to the problem [11]. In this manner one funds available the implementation of a computational algorithm that allows either an approximate or closed-form solution without the lost of generality. Thus one can see the feasibility of passing from a mathematical solution to one based on algorithms. It has advantages from the point of view of the computation theory in the sense that one can impose conditions that the energy cannot be reached for some values that might be against to the efficiency of system. In second section, all related the electricity theory and main equations, is presented [12]. In third section, the Machine Learning implementation is done. Finally the conclusion of paper is drawn.
Hamiltonian Mechanics
2
249
The Physics Model
Consider a well-defined electrically compound whose diffusion in a space of radial symmetry is dictated by the following equation: dρ(r, t) d2 ρ(r, t) = D∇2 ρ(r, t) = D , dt dr2
(5)
known also as the diffusion equation, with D the diffusion constant. With this equation one might to be searching by: – the best space-time propagation, – the optimal trajectory the minimizes the system energy, – the best algorithm that improves the efficiency of system in terms of energy. A trivial solution of the charge density is given by the integral equation: 2 d ρ(r, t) dt, (6) ρ(r, t) = D∇2 ρ(r, t)dt = D dr2 so that the charge is extracted from the volume integration in both sides of Eq. 6 by the which it can be written in a crude form as: 2 d ρ(r, t) Q = ρ(r, t)dV = D dt dV, (7) dr2 therefore one can form the electrical force equation in a straightforward manner given by: d2 ρ1 (r,t) d2 ρ2 (r,t) dt dV dt dV (r − r ) D2 2 dr dr 2 , (8) F=κ |r − r |3 under the assumption that both ρ1 and ρ2 are under diffusion inside the same medium. 2.1
The Electric Field
From Eq. 7 one can derive the electric field given by: d2 ρ2 (r,t) κD2 dt dV (r − r ) dr 2 E= . |r − r |3
(9)
It should be noted the identity (a mathematical relation that links two different expressions) given by: 1 (r − r ) . (10) ∇ = |r − r | |r − r |3
250
H. Nieto-Chaupis
Thus by introducing Eq. (9) into Eq. (8) one arrives to: ⎤ ⎡ d2 ρ2 (r,t) κD2 dt dV dr 2 ⎦, E = −∇ ⎣ |r − r | step that allows to write down the electric potential in the form as: ⎤ ⎡ d2 ρ2 (r,t) dt dV κD2 dr 2 ⎦, Φ=⎣ |r − r |
(11)
(12)
that constitutes the electric potential generated by a charge distribution given d2 ρ2 (r,t) by dt dV . From Eq. 7 one can write down the simplest form of an dr 2 electrodynamics energy given by: d2 ρ1 (r,t) d2 ρ2 (r,t) D dt dV D dt dV2 2 1 dr dr 2 . (13) E= |r − r | With this one can see that there is a kind of energy density and that can be written as: 2 2 d ρ1 (r,t) d ρ2 (r,t) κ2 D 2 dt dt dV2 dr 2 dr 2 E . (14) = V1 |r − r | In this way one can write down the energy per unit of volume and charge as: 2 d ρ1 (r,t) 2 2 κ D dt 2 dr E . (15) = Q2 V1 |r − r | 2.2
The Usage of Divergence Theorem
One can write down the diffusion equation as: dρ(r, t) dQ dV = = D ∇2 ρ(r, t)dV. dt dt
(16)
One can assume that exists a well-defined geometry that allows to write down the charge density as: ρ(r, t) ≡
q q = , V Ar
so that this in Eq. 16 reads as: dQ q D q = D∇ ∇ dV = ∇ ∇ dV dt Ar A r D D = ∇ ∇Φ(r)dV = − ∇EdV, A A
(17)
(18)
Hamiltonian Mechanics
251
with Eq. 18 is an exact integral of divergence in according to the theorem of divergence, so that a straightforward evaluation yields: dQ D = − Q, dt A that results in a simple differential equation for the electric charge. One can see the left-side obeys to an electric current. However, the solution for the charge electric is given by: D Q(t) = Q0 Exp − (t − t0 ) , (19) A with Q0 at t = t0 . It should be noted that here one is treating with a charge Q(t). The Coulomb force requires of an additional force q(t) that at the same time t both are exerting a force each other. Thus for this charge its solution is given by: D (20) q(t) = q0 Exp − (t − t0 ) , a that passes an certain area a, with a = A. Thus the Coulomb force is then: D D (t) Exp − a (t) κQ0 q0 Exp − A . (21) F(t) = z 2 1 Dt 1 κQ0 q0 (22) Exp − ( + ) . F(t) = 2 z a A With this one can write down the time-derivative of this force dF(t) κQ0 q0 D 1 1 Dt 1 1 =− ( + )Exp − ( + ) . dt z2 a A a A
(23)
With the change: γ=−
1 κQ0 q0 D 1 ( + ), z2 a A
(24)
then Eq. 23 can be written as: dF(t) 1 1 Dt 1 = γ − 2 Exp − ( + ) . dt z a A
(25)
Equation 25 is actually a solution of form γg(z)h(t). When both members are integrated over z one gets: d dz 1 Dt 1 F(t)dz = γ − (26) Exp − ( + ) . dt z2 a A
252
H. Nieto-Chaupis
One can see that Eq. 26 constitutes the instantaneous power of system that is commonly defined as the change of done work by time unit. Then one gets d F(t)dz because W =intF(t)dz. Therefore Eq. 26 is written as: that P = dt dz 1 Dt 1 P =γ − ( + ) , (27) Exp − z2 a A that is solved in a straightforward manner by the which the system carries out a work from the spatial points tom L. Then one arrived to: 1 1 1 Dt 1 − (28) P =γ Exp − ( + ) . L a A
Fig. 1. Sketch of a finite electrical system showing the diffusion of two charges experiencing repulsion forces inside a frustum of well-defined geometry in an instantaneous time t exhibiting the values of areas a and A.
3
The Machine Learning Implementation
From Eq. 23 one can see that the entire system is repulsive as noted by the sign “−”. In Fig. 1, the system is sketched. There is seen that the charges q and Q are under diffusion passing the areas a and A. At time t both experience the repulsion to distance z that separates both charges [13]. Commonly electric systems aim to minimize its energy for searching an acceptable efficiency. In this manner the Mitchell’s criteria for this system can be defined as follows: 3.1
Task
The system might to optimize its power in time so that it requires that the power falls slowly in time. Thus one needs: D A+a ) 1, (29) − ( aA 1 − L 1. (30) L
Hamiltonian Mechanics
253
In other words, the geometrical parameters require to be large. For example from Eq. 29 aA D while L 1. 3.2
Performance
With the conditions done above, one can search for the best strategy as long as the power keeps minimal [14]. One logic exit for this problem suggests to impose the condition that L .
(31)
An interesting scenario is the inclusion of small displacement of charge Q in ΔL that can be perceived as an error. Thus one gets: 1 1 1 Dt 1 − (32) P =γ Exp − ( + ) . L + ΔL + Δ a A The implementation of errors makes also a shift in the values of areas of frustum, so that one has: 1 1 P =γ − × L + ΔL + Δ 1 1 Dt + ) . (33) ×Exp − ( a + Δa A + ΔA 3.3
Experience
For large times one expects that Eq. 33 behaves as: + Δ − L − ΔL P =γ , ( + Δ)(L + ΔL)
(34)
resulting a physical system depending now entirely on the geometrical parameters. The case of small errors that acquire same value leads to write down: −L , (35) P =γ ( + Δ)(L + Δ) 3.4
Electrical Oscillations
Consider the case where the charge q and Q are subject to oscillations because the repulsion character of system (for example to see [15]), thus Eq. 35 can be expressed as: −L , (36) P =γ L + ( + L)Δ + Δ2
254
H. Nieto-Chaupis
so that one can impose the sinusoid functions that depend on the longitudinal variable z, thus: −L , (37) PS = γ L + ( + L)Sin(z) + Sin2 (z) PC = γ
−L . L + ( + L)Cos(z) + Cos2 (z)
(38)
In Fig. 2, the power of electrical oscillations from Eqs. 37 and 38 can now be analyzed from the Mitchell’s criteria. In up the “sin” case exhibits a minimum at π/2 while in down the “cos” case acquires its maximum values. Thus while
Fig. 2. The electric power in arbitrary units of two charged under diffusion for 3 different values of + L inside a frustum as function of sinusoidal errors of charge q. (Up) the case of “sin” and (Down) the “cos”, plots were done with package [20].
Hamiltonian Mechanics
255
the system is going to maximize its power, then the switching to the alternative option can be advantageous in the sense that the system do not expends unnecessary power (see [16]) but it have the choice to tune the best values of z inside the frustum. In this manner while one has the control on the position of charge Q there is a possibility that the problem of best power strategy is done through the variation of errors that might be large as nonlinear systems [17], and the efficient choice of scheme either “sin” or “cos”. The sinusoid functions are relevant in the sense that because their periodicity, the system can “remember” the previous and successful events so that the learning has sense for future measurements of instantaneous power. The choice is dictated by the Mitchell’s criteria in a scenario of Machine Learning, so that Machine Learning can also be seen as a tool to explore anomalies as well as new physics as studied recently in High Energy Physics [18,19].
4
Conclusion
Along this paper it was revisited the classical electrodynamics equations in order to define the problem of two charges inside a frustum that expends electric power. Thus, the theoretical approach has yielded that the efficiency at the power usage might be well dictated by the criteria of Mitchell inside the territory of Machine Learning. The simulations have yielded that the usage of sinusoid errors can be translated in terms of success because the algorithm has the chance to switch in those cases of high expend of power. In a future work, the application of this theory in Atomic Physics dealing with diatomic molecules shall be studied.
References 1. Feynman, R., Leighton, R., Sands, M.: The Feynman Lectures on Physics (1964) 2. Feynman, R.P., Brown, L.M. (ed.): Feynman’s Thesis: A New Approach to Quantum Theory 1942/1948 (2005) 3. Schueler, J.: Green’s Functions and Their Applications to Quantum Mechanics. https://sites.math.washington.edumorrow/33611/papers/jeff.pdf 4. Huang, K.: Quantum Field Theory: From Operators to Path Integrals. Wiley, New York (1998) 5. Chao-Hua, Yu., Gao, F., Chenghuan Liu, D., Huynh, M.R., Wang, J.: Quantum algorithm for visual tracking. Phys. Rev. A 99, 022301 (2019) 6. Bishop, C.M. (ed.): Pattern Recognition and Machine Learning. ISS, Springer, New York (2006). https://doi.org/10.1007/978-0-387-45528-0 7. Mitchell, T.: Machine Learning. McGraw Hill, New York (1997). ISBN 0-07-0428077. OCLC 36417892 8. Wright, K.: Bringing quantum to machine learning. Physics 13, 179 (2020) 9. Haug, T., Dumke, R., Kwek, L.-C., Miniatura, C., Amico, L.: Machine-learning engineering of quantum currents. Phys. Rev. Res. 3, 013034 (2021) 10. Hsu, B.C., Berrondo, M., Van Huele, J.-F.S.: Stern-Gerlach dynamics with quantum propagators. Phys. Rev. A 83, 012109 (2011)
256
H. Nieto-Chaupis
11. Bonato, C.A., Thomaz, M.T., Malbouisson, A.P.C.: Equivalence of the propagator of quasistatical solutions and the quantum harmonic oscillator. Phys. Rev. A 39, 2225 (1989) 12. Alageshan, J.K., Verma, A.K., Bec, J., Pandit, R.: Machine learning strategies for path-planning microswimmers in turbulent flows. Phys. Rev. E 101, 043110 (2020) 13. Hartle, J.B., Laflamme, R., Marolf, D.: Conservation laws in the quantum mechanics of closed systems. Phys. Rev. D 51, 7007 (1995) 14. Hioe, F.T., Eberly, J.H.: N-level coherence vector and higher conservation laws in quantum optics and quantum mechanics. Phys. Rev. Lett. 47, 838 (1981) 15. Mackey, H.J., Sybert, J.R., Hight, R.D.: Magnetomorphic oscillations in the electrical conductivity of cadmium cylinders. Phys. Rev. B 1, 2385 (1970) 16. Fredkin, D.R., Wilson, A.: Collective oscillations in a simple metal. II. Electrical conductivity. Phys. Rev. B 18, 6676 (1978) 17. Teitsworth, S.W., Westervelt, R.M., Haller, E.E.: Nonlinear oscillations and chaos in electrical breakdown in Ge. Phys. Rev. Lett. 51, 825 (1983) 18. Collins, J., Howe, K., Nachman, B.: Anomaly detection for resonant new physics with machine learning. Phys. Rev. Lett. 121, 241803 (2018) 19. Ghosh, A., Nachman, B., Whiteson, D.: Uncertainty-aware machine learning for high energy physics. Phys. Rev. D 104, 056026 (2021) 20. https://www.wolframalpha.com/
Marlo’s Networks of Expectations Marcos Bautista López Aznar1(B) , Guillermo Címbora Acosta1 , and Walter Federico Gadea2 1 Department of Philosophy and Logic and Philosophy of Science, University of Seville,
Seville, Spain [email protected], [email protected] 2 University of Huelva, Huelva, Spain Abstract. Marlo’s networks of expectation have been developed from a heterogeneous reasoning perspective, allowing a perfect integration of visual, linguistic, chromatic, and numerical information. These networks are logical tree diagrams whose structure consists of nested sets formed by three types of logical nodes: Objects, Or, And. Any node can receive a numerical value that goes from −1 to + 1 passing through zero, which can be translated to a chromatic value between red and blue, passing through yellow. These values correspond to natural language expressions: false, probably false, uncertain, probably true, true, or even absurd. That is, the network can be interpreted both through natural language propositions and through rigorous and precise mathematical formulas. In any case, we must bear in mind that cognitive systems, to generate adaptive expectations about resources and threats, have to encode both the qualities of the stimuli and the correlations of their presences (+1) and absences (−1). As a result, we have built a system in which any first-order logical inference can be represented using a very limited number of logic gates. Deductive, inductive, abductive, and even statistical inferences can be explained by the activation and inhibition relationships between the nodes of the sets that we propose here. Keywords: Logic diagrams · Heterogeneous reasoning · Tree diagrams
1 Presence and Absence of Resources and Threats as the Basis for Decisions The survival of organisms endowed with a cognitive system depends to a large extent on two fundamental decisions: the search for resources and the avoidance of threats. For this reason, the behavior of these complex organisms has been linked for millions of years to two basic emotions: fear and desire. Thus, necessarily, the first communication systems had to have signals that reported the presence of threats and resources, capable of causing the group members to activate in the right direction: approach or flee. It is easy to assume that effective communication systems must have the ability to identify stimuli as dangerous or attractive and, at the same time, the ability to report the degree of presence or absence of such stimuli. Only then can individuals of a species make their decisions considering the information transmitted by their relatives. The original version of this chapter was revised: Author provided belated correction has been incorporated. The correction to this chapter is available at https://doi.org/10.1007/978-3-031-10464-0_64 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 257–277, 2022. https://doi.org/10.1007/978-3-031-10464-0_17
258
M. B. López Aznar et al.
Based on the above, we have organized Fig. 1 around two continuums. The horizontal axis offers a scale that classifies a stimulus as a resource or threat. In it, the midpoint corresponds to neutral stimuli. Of course, the same stimulus can be judged in both categories, generating a conflict in the organism. For example, a large lizard can be considered by an eagle as a threat and a source of protein at the same time. The vertical axis offers a scale that goes from absolute absence to absolute presence, passing through an intermediate point in which an object is not considered by the organism as either present or absent. On our scale, this midpoint means that the stimulus is not related to the current situation and that therefore it will not be considered in decisions now. That is, the value of its influence on the decisions of the organism during the action in progress will be zero. Imagine that we are at home thinking about lions. At this moment lions are just theoretical objects that do not prevent us from going for a walk. However, if we see a lion through the window, then fear will make us decide to stay home safely. In the same way, when a mouse smells the air and does not detect the presence of any cat, the absence of such a predator makes it feel safe and therefore it can get out of the mousetrap. The presence of resources and the absence of threats cause security in the organism, that is, calm, which is represented with the blue color in the figure. However, the absence of resources or the presence of threats causes the cognitive system to start searching or trying to flee. In this case, we have represented the activation with red.
Fig. 1. Emotional activation due to the absence of resources and presence of threats [3].
In Fig. 1 we can see that the more present a resource is and the greater its value, the safer the organism feels. We also see that the more absent this resource is and the greater its value, the more fear this organism has. In the same way, the more present and greater is a threat, the more fearful the organism has, while the greater and more absent is a threat, the safer the organism feels. To create a simple version of the expectation networks we must forget the sentimental value of things, so we will only be left with the numerical value of their presence or absence. The continuum related to the certainty
Marlo’s Networks of Expectations
259
that something is present or absent could be expressed in a heterogeneous way using colors ordered by their wavelength, which, in turn, could be associated with positive and negative numerical values. Figure 2 shows the correspondence of the chromatic continuum with the expressions of natural, formal, and mathematical language. It should be noted that on this scale, the primitive sense of truth corresponds to the presence of the stimulus, while falsehood corresponds to the absence; uncertainty is the intermediate value between the certainty of the “yes” and “no”. This scale offers a heterogeneous logic [21] that allows us to solve inferences by operating indistinctly with numbers or colors. The difference between expressions of natural, chromatic, and formal language (mathematical or logical) are related on this scale to the complexity of the grades that can be transmitted [14]. Thus, the chromatic and numerical language allow a greater number of nuances that correspond to the degree of reliability of the affirmations. In this sense, the use of colors for the expression of different grades of certainty in logic is not something new [8, 11, 18–20], but our proposal has an internal logic that allows us to operate with these colors with the efficiency, precision, and rigor of numerical scales.
Fig. 2. RYB traditional color theory as a basis for chromatic inferences. Chromatic, numerical, linguistic, and formal heterogeneous correspondence scale [3].
We see in Fig. 2 that by combining blue and red we obtain violet, which has an assigned value of zero. Therefore, if someone tells us that object A is present and another person claims that A is absent, then we will not know whether A is present or not. The perplexity caused by a contradiction (violet) is not the uncertainty that accompanied us before asking for A (yellow), but mathematically the certainty value seems the same for both colors (zero). Also, we give the intersection of all colors a non-existence value. We must realize that it is not the same to grant a possibility a value of provisional falsehood (red = −1) as to grant it a value of nonexistence (black = ).
260
M. B. López Aznar et al.
We consider our proposal heir to the Barwise and Etchemendy Heterogeneous Reasoning Project. These authors have emphasized that reasoning research must include non-symbolic forms of representation and they have designed a computer program called “Tarski’s World” that combines visual and verbal information to solve logical problems [5]. For their part, Marlo’s networks of expectations have been developed as tools for the didactics of logic since 2015, as an extension of Marlo’s diagrams [1]. In its development, we combined, among other things, first-order logic, probability, and the way in which our students generated their expectations about the presence or absence of an object from the presence or absence of another object with similar qualities [2–4].
2 Combinatorics as the Basis of Tree Diagrams: Representation of Simple and Complex Propositions Logic is for us the a priori formal structure that allows us to organize and share the experience that we acquire in our daily interactions with the world. We start from a combinatorial structure that allows us to generate and share sets of possibilities with other cognitive systems. These sets of possibilities make up the subjective world in which we live. As long as we can identify and discriminate different types of stimuli, we can learn through association mechanisms. And if we also have a communication system, we can share this information with other cognitive systems. To explain the universe of possibilities that cognitive systems employ when reasoning we distinguish the following components in our networks: 1 Criteria; 2 variables or qualities; 3 Objects nodes (individuals or specific elements); 4 Or nodes; 5 And nodes. We describe variable or quality as any difference between stimuli that can be thought or perceived considering a criterion “a”. For example, from the criterion of beauty “b”, we can distinguish beautiful things (variable = b) from ugly things (variable = ¬b). Variables work like propositional letters, but we find it useful to set the criteria as the first element of the logical trees, and by distinguishing criteria from variables we can affirm propositions such as There is something beautiful (B), There is something that is not beautiful (¬B) and There is something that may or may not be beautiful (B). By combining qualities such as a, ¬b, c, etc., cognitive systems try to generate adaptive representations of the world. For example, an animal that feeds on other animals may be interested in the criteria “be dangerous” and “be nutritious”. For this animal, there would only be four types of organisms to deal with: dangerous and nutritious; dangerous but not nutritious; non-dangerous but nutritious, and, lastly, neither dangerous nor nutritious. However, between not being at all dangerous and being totally dangerous, we can establish multiple degrees of being. And the same happens with the criterion “n” be nutritious (see Fig. 3). An element, individual, specific object, or Object node of the system is a unique and distinctive set of qualities (or variables), which is formed by combining all the criteria that make up the universe of discourse [7]. So, with the criteria a, b, and c considered dichotomously we can generate eight individuals made up of three qualities each: abc, ab¬c, a¬bc, a¬b¬c, ¬abc, ¬ab¬c, ¬a¬bc, ¬a¬b¬c. As more nuances or degrees of being a world possess more complex it is and higher is the cognitive cost of processing information in it [3]. In any case, what we are interested in highlighting
Marlo’s Networks of Expectations
261
now is that any universe of discourse, regardless of its complexity, is generated through the combination of qualities and that this combination results in objects nodes with a unique and distinctive identity. The total number of theoretical elements that make up the universe of the discourse will depend on the number of criteria “c” and the number of divisions “d” we make in them: d c . We call each of the divisions of the criterion variable when solving propositional logic, and term when we solve syllogisms. However, there is no difference between the representation of variables and terms in our diagrams: they are all something like thinking bricks that allow the construction, conversion, and transformation of language propositions [1].
Fig. 3. Measurement scales. Combination of qualities in tree diagrams [3].
In Fig. 3 we can see two tree diagrams that combine the same criteria “a” and “b”. The first establishes two degrees of being, that is, two qualities or variables for each criterion: a, ¬a. The second tree establishes three degrees of being: 1a, 0,5a, and 0a. Hence, the first diagram contains four elements and the second nine. The simplest version of these networks is dichotomous and lacks the Or and And nodes. Based on it, we have designed infographics (see Fig. 4) to teach logic that allow us to solve syllogisms and propositional calculus exercises that only contain simple propositions of the type A → B or conjunctions of simple propositions of the type (A → B) ∧ (C ∨ B). In all these
262
M. B. López Aznar et al.
cases, it is enough to eliminate certain regions, depending on the premises, to reach the conclusions by discarding.
3 Solved Exercises in a Simple Version of the Network In Fig. 4 we can see a propositional calculus exercise with three premises. The third premise affirms the existence of “P” and therefore we have activated the node P with a value equal to +1 = blue. Since the path of p¬q has been eliminated in premise one and the possibility of ¬r¬q has been eliminated in premise two, it follows that the activation asserted in premise three is transmitted from P to PQ and PQR nodes. Therefore, we affirm in the conclusion that it is true QR (PQR = blue = +1).
Fig. 4. Exercise of propositional calculus with simple propositions [3].
If we want to work with what Boole and the tradition called secondary or complex propositions of the kind (A → B) → (C → D) or (A B) → (C D), then we must represent connections that are not established naturally by combinatorics as occurs in simple propositions. These complex propositions are not limited to establishing what things can be, but they lead us to the question of correlations between the presences and absences of elements. That is, we consider the combinatorial structure of simple propositions as the a priori foundation on which connections based on learning can be added. These a posteriori connections would be the basis of the complex propositions. If the reader looks at the nodes of the network in Fig. 5, he will observe that each one of them has a tab at the top and another at the bottom. The upper tab of a node allows us to express a condition (IF). The bottom tab allows us to express a consequence (THEN).
Marlo’s Networks of Expectations
263
Fig. 5. Exercise of propositional calculus with complex propositions [3].
In this way, we can relate nodes not located in the same branch of the network. These tabs serve us to represent the connections that establish complex propositions such as the one that appears in the first premise (p ∨ r) → (q ∨ p). This premise states that if ¬p¬r is false, then ¬p¬q is false. This complex premise does not say that ¬p¬r is false now and therefore we have to add the condition tab IF to the nodes ¬p¬r. The consequence (THEN) of ¬p¬r is false will be that the node ¬p¬q is also false. That is, IF ¬p¬r = −1 (red) THEN ¬p¬q = −1 (red). As the second premise establishes that we do not have ¬r, the IF of the first premise is activated. And when an IF is fulfilled, the value attributed to THEN is transmitted to the node that carries said tab. This leads to the falseness of the node ¬p¬q. The third premise establishes the truth of ¬p, then ¬P = +1 = blue. This truth necessarily propagates towards ¬PQ and from there towards ¬PQR, since the rest of the possible alternatives are false nodes based on premises 1 and 2. We can add conditions to the network such as “it is enough to know that Peter is probably going to the party to infer that it is certain that his ex-wife will not go”. We must accept that the logical conjunction does not eliminate possibilities but rather affirms the existence of a combination of objects or qualities. As we have explained on other occasions [1, 3], the laws of propositional calculus allow us to find out what a thing is, and a thing cannot be A and ¬A at the same time (A ¬A). So, if the conjunction AB is true, then A¬B, ¬AB, and ¬A¬B are false. And because of this, it happens in truth tables that if (A ∧ B) is true, then it is true that (A → B). However, expectation networks allow us to discover what kinds of things are present or absent here and now. And in this case, the conjunction cannot be interpreted as eliminating the rest of the possibilities. On the contrary, it can be true at the same time that A and ¬A are present here and now (A ∨ ¬A). If we want to respect the laws of propositional calculus, we only must establish that the number of elements present here and now is equal to one. Nevertheless, proceeding
264
M. B. López Aznar et al.
in our way, we can translate the propositions into color and numerical codes consistent with common sense.
4 The Alpha System Before discussing the Alpha system, we must clarify that we do not use the term “set” in this article with the precise meaning that it has in axiomatized set theory, but in a pre-theoretical way. Our “naive set theory” pretends to be akin to common sense and the way we construct propositions in natural language. Let us remember that we have called the compound of all the qualities that define an entity as unique and distinct within the universe of discourse object or individual. So, in Alpha networks, we have object nodes that are sets of qualities (Object nodes) and we have sets of objects (And nodes). If we look at Image 1 in Fig. 6, we see the elementary set formed from a criterion, when we consider its qualities dichotomously as “being a” and “not being a”. In this way, we have two elements in the center of the set “a”: the “a element” and the “¬a element”. To the left of the elements, we have the “a Or node”. It is called Or node because it will be true as long as at least one of the object nodes is true and it will be false when both objects are false. The Or node represents any element of the set, but none in particular. To the right of the elements, we have the [a] And node, which will be true if and only if all the elements are true. It will be false when at least one element is false. The meaning of the Or and And nodes of Images 1, 2, and 3 in Fig. 6 is the same. The difference between these sets formed from the “a criterion” is only the number of divisions that we have established between not meeting the criterion at all and fully meeting it.
Fig. 6. Basic components of a set. 1. Elementary dichotomous set. 2. Elementary trichotomous set. 3. Set formed by decimal division [3].
We observe in Fig. 7 the steps we need to build the Alfa system. In this case, we start from three criteria a, b, and c considered dichotomously. For this reason, in the first
Marlo’s Networks of Expectations
Fig. 7. Steps to construct subsets of sets of the alpha system as a universal set [3].
265
266
M. B. López Aznar et al.
step of Fig. 7, we have eight unique and distinct individuals, each one defined by three qualities. The second step groups the objects two by two, attending to the qualities that they share and that give the Or node its name. In the third step, we have generated two large sets: “a” and “¬a”, which in turn contain two subsets each. Steps four and five show Alpha from the perspective of “a criterion” and “b criterion”, respectively. Alpha contains all the elements that we can think of respecting the principle of noncontradiction. These elements can be affirmed or denied with a greater or lesser degree of certainty. Figure 8 represents elementary inferences. If the reader takes the time to review it, he will see that our proposal fits with the Aristotelian Opposition Square in which the relationships between the four basic categorical propositions are represented. We can see that particular affirmative (I) and negative (O) propositions can be true at the same time, but not a universal negative (E) and a particular affirmative (I), or a universal affirmative (A) and a particular negative (O). Of course, (A) and (E) can be false at the same time, but not true. It is true in our networks that (I) and (O) can be uncertain at the same time, but once the net is activated by a stimulus, there must be true that there is something present (I) or something absent (O). Perhaps now is the time to recall the distinction between the Boolean and Jevons principles of duality, which is difficult to appreciate in dichotomous systems. When Boole established that every proposition must be true
Fig. 8. Elementary inferences in fundamental sets. What we know [3].
Marlo’s Networks of Expectations
267
or false, the values 1 and 0 represented the certainty of truth and falsehood respectively [7], similar to how our values between −1 and +1 do. However, the Jevons duality states that anything with the quality A must be B or ¬B [12, 13]. That is not the same. We will see better the differences in a trichotomic system. Starting at 0.5A, Jevons would say that this should be 0.5A_1.B, 0.5A_0.5B, or 0.5A_0.B. However, Boole would still say that 0.5A must be true (1) or false (0). That is why we distinguish between degree of adequacy or not to a criterion (Jevons) and degree of certainty (Boole). To compare Fig. 8 with the Aristotelian Opposition Square, we must forget the Jevons duality and the truth tables, in which being (a or ¬a) and being present /absent are not distinguished. The multiple degrees of being and the multiple degrees of certainty that we accept in our networks bring our proposal close to fuzzy logic. But we think it is closer to probabilistic logic and epistemic modal logic. However, it is difficult to label the logic we do without controversy.
5 Linguistic Ensembles of Cognitive Cells As we all know, organisms have to make decisions in conditions of uncertainty in their daily life. That is why we have also tried to simulate in the expectation networks how the cognitive system could fill the gaps in its knowledge when making decisions. Consequently, the nodes of our networks distinguish between two types of expectations: based on beliefs and based on knowledge. We are not interested in deep epistemological disputes now: we are just trying to distinguish the things we really know from the thoughts we assume. For example, if I flip a coin, you can bet heads or tails with a 50/50 chance of winning. This is Knowledge, an expectation based on current evidence (starting point = activation of the And or Or nodes). On the other hand, if after I ask out nine girls who reject me, I think the tenth will reject me as well, this is a Belief : an expectation based only on previous or similar experiences, but without sufficient reasons to sustain it now (starting point = activation of Object nodes). Beliefs make us suffer from irrational phobias and the like, but they also give us a sense of security and help us make decisions. All information related to beliefs and knowledge is encoded in what we will call cognitive cells. Each cognitive cell is a node of the network that communicates with other nodes forming sets. In fact, the network itself is an assemblage of sets that can be reconfigured to find regularities in the phenomena. The relationships that are established between nodes are the basis of inference, both of knowledge and beliefs. We have established three types of cognitive cells to simulate inferences that occur in natural language and that are fundamental to first-order logic: Or nodes, Object nodes, and And nodes. These connections of nodes, which form linguistic ensembles, allow us to simulate reasonings of cognitive systems, some of which could be considered fallacies. But many times, a fallacy is better than absolute uncertainty. In Fig. 9 we show an example of a basic linguistic ensemble. We can see in the image in which regions of the different types of nodes different types of propositions are encoded. To understand the basis of the network, we must consider now just one aspect of natural language: its ability to name generally, specifically, and collectively. It is not equivalent to saying Some students are absent, Peter is absent, All students are
268
M. B. López Aznar et al.
absent, although, as we said above, there are relationships between the truth of these propositions that have interested logicians since the time of Aristotle [6]. Or nodes contain expectation values about the presence or absence of things considered generically. These propositions usually begin with terms such as “at least one”, “some”, “something”, and the like: at least one of the twins is at home; some angry people are in the pub, etc. In propositional calculus, an Or node corresponds to the inclusive disjunction (a ∨ b), which is true when at least one of its elements is true. The outline of a cognitive cell that functions as an Or node contains beliefs of this type, while the nucleus contains knowledge related to disjunctions (see Fig. 9). Object nodes contain expected values about the presence or absence of specifically considered things. For example: Peter is at home, Your cat is missing, There are onions in the fridge, etc. Of course, the consideration of a term as a particular individual or a generic name is relative and depends on the need for further analysis or not. Generic names work as Or nodes. Finally, the And nodes contain expected values about the presence here and now of all the elements that form sets, wholes, aggregates, and the like.
Fig. 9. Linguistic ensembles of cognitive cells [3].
6 Principles of the Propagation of Truth and Falsehood An element can exist, not exist, or remain uncertain. Uncertain elements may or may not exist, but only elements that exist here and now may be present or absent in the system. Let’s think that both the lack of fuel and the presence of a tree on the road force me to stop the car in a similar way. Each node of the network (Or, Object, And) has a positive activation or true charge >0 and ≤1, and a negative inhibition or false charge 0). In case I we start from knowing that there is not everything (And node = −1 = red) and we conclude that perhaps there is nothing (Or node = −0,5). Cases E and J in Fig. 11 shows second inference cases with assumed values. The positive inference from objects to And nodes, as well as the negative inference from objects to Or nodes, should be considered beliefs based on the particular experience of the cognitive system.
Fig. 11. Deductive, inductive, probabilistic, and statistical reasoning [3].
7 Some Exercises Solved Using the Expectations Network To understand the notation of Figs. 12 and 13, we must bear in mind that our notation is a renewed version of the doctrine of the Quantification of the Predicate followed by authors like Stanhope and Jevons, who were the first to build logic machines as early
272
M. B. López Aznar et al.
as the 19th century. In our system, a proposition expresses the association between a part or the whole of a “term” with a part or the whole of another. We add the subscript x when a term is taken universally. If it is taken in a particular way, we do not add any subscript x. For instance, when we said If you are a mammal then you are a vertebrate we are speaking about all mammals (mx ), but only about part of the vertebrates (v). So, we write mx v. Any formula can be converted and transformed. When we convert, we only change the place of the variables. When we transform, we change the quality of both variables and permute the subscript x (see Table 1). Table 1. Propositions, conversion, and transformation. elemental notation. Marlo diagrams notation Proposition
Conversion
Transformation
Conversion
mx v
vmx
¬vx ¬m
¬m¬vx
If you are a mammal, you are a vertebrate
Only vertebrates can be mammals
If you are not a vertebrate, you are not a mammal
Only what is not a mammal cannot be a vertebrate
¬ax b
b¬ax
¬bx a
a¬bx
(A B)
¬ax bx
bx ¬ax
¬bx ax
ax ¬bx
A⇿B
ax bx
bx ax
¬bx ¬ax
¬ax ¬bx
A NAND B
ax ¬b
¬bax
bx ¬a
¬abx
A∧B
AB
BA
–
–
∧
“;”
–
–
–
A is absent
A
All A is absent
[A]
–
A is present
A
All A is present
[A]
–
A⇾B
(A
B)
We represent the logical connectives of the propositional calculus as elimination of possibilities, that is, as elimination of certain nodes in the network [2, 3]. However, we consider the AB and AB conjunction as an affirmation of existence, that is, a confirmed presence or absent that entails the activation of its qualities (+1 or −1). This activation can be spread across the network following principles very similar to Bayes’ laws. It should be noted that PQ is equal to the expression (p ∧ q) of the propositional calculus. That is, we follow De Morgan’s distinction between two different types of conjunctions. In our networks, an Object node expresses a composition of qualities, while the And nodes express an aggregation of individuals [9]. There are some restrictions in the way we can create compositions. First, we cannot put together A and ¬A (it is impossible to have A¬A), but we can create an aggregate (A; ¬A). We cannot combine two different ways to exist within the same object (it is impossible A B), but we can create an aggregate (A;B). It is also important to note that the exact value of the Or nodes and the And nodes always depends on the value of the central elements of the set, but not on other Or or And nodes. For example, in case 13, the assumed value of node [¬cb] is −0,2e since it contains only one element (¬cb¬a), while
Marlo’s Networks of Expectations
273
Fig. 12. First-order logic exercises solved in a heterogeneous way (1) [3].
the assumed value of node [¬c¬b] is −0,4e because it contains two elements (¬c¬ba; ¬c¬b¬a). The more elements a set has, the easier it is to expect the presence of at least one of them (Or node → true) and the more difficult it is to expect the presence of all of
274
M. B. López Aznar et al.
them (And node → false). It should be easy to understand why ¬b¬a has more weight than ¬ba in case 8. Each of the thirteen cases in Figs. 12 and 13 shows a logic exercise solved in a heterogeneous way using the expectation networks. In cases one to three, the same syllogism with two negative premises is solved, but changing the perspective from a to b and c respectively: No A is B; No C is A. We have interpreted these premises as All A is ¬B and All C is ¬A. Furthermore, we have represented them as if they implied at least the expectation of the existence of beings with the qualities A¬B and C¬A.
Fig. 13. First-order logic exercises solved in a heterogeneous way (2) [3].
Marlo’s Networks of Expectations
275
Therefore, we have drawn a blue line (+1e) around the nodes that remain after considering the premises when they have these qualities, although we have kept their interior uncertain (yellow = 0). In the case of cb¬a and c¬b¬a, the outline is green (+0,5e = probable) because both elements are candidates to be the object with the qualities c¬a. We have used capital letters to denote the presence or absence of A and ¬A. That is, we write in capital letters the variables that have an absolute value of “1” or “−1”. In case number 4 we affirm with certainty the presence of “a”: (a → ¬b) ∧ (¬c ∨ ¬a) ∧ A. In case 5 we can see that the falsehood of “a” (A) does not mean that “¬a” is true (¬A). For example, the fact that it is false that there are Asians on the beach now does not imply that now there are non-Asians. Therefore, not having A does not mean the same as having ¬A in our networks.
8 Limits of the Work and Conclusions Our networks simply allow us to deal with inferences that, according to the distinction made by cognitive psychology, only refer to declarative knowledge, which is opposed to procedural knowledge. In some way, it is possible to translate procedural knowledge into declarative knowledge by making action schemes explicit through conditional structures that associate a stimulus with a response. But if we want to turn our networks into authentic models of thought with learning capacity, then we should improve them with many complex mechanisms. First, we must include needs and threats with an emotional value related to the survival of the system: What should I look for and what should I avoid now? What is urgent here and now? At this level, we would have to integrate values related to the survival of the system itself (egotism) or its community (altruism). Only in this way could we recreate the dilemmas and conflicts that happen to ordinary people. And to understand how people solve these problems, we should also integrate game theory into our system. Second, we should integrate the weight of the experience in the nodes of our network, as it happens in the learning networks: What can I expect now? Furthermore, we should include the ability to reorganize information to find regularities in the situations faced by the system. To that, we must also add that the system is not yet backed by experimental research on how people actually reason and some of the principles that we have established can be questioned from different theoretical perspectives. Remember that the paradigm of our networks is not easy to label. It does not conform exactly to the paradigm of logic in the Principia Mathematica written by Whitehead and Russell, neither to the logic of probability nor to fuzzy logic. In conclusion, we began our investigation because all the logical systems that we knew seemed to us essentially incomplete or too complicated in the task of teaching logic. The same principles of inference should support Aristotelian logic, propositional calculus, predicate logic, modal logic, and even probabilistic and statistical reasoning [16, 17]. Therefore, we were looking for a system capable of dealing not only with the necessary conclusions but also with the probable and uncertain ones. As a result, we build a system in which any first-order logical inference can be represented using a very limited number of logic gates. Deductive, inductive, abductive, and even statistical inferences can be explained by the activation and inhibition relationships between the nodes
276
M. B. López Aznar et al.
of the sets that we propose here. And we did it from a heterogeneous reasoning perspective, allowing a perfect integration of linguistic, chromatic, and numerical information. Marlo’s networks of expectations represent nothing more than elementary processes of reasoning. However, if we have successfully identified these processes, then we have reasons to believe that our networks can become the basis for new neural network models. In fact, Marlo’s expectation networks are heirs to the McCulloch and Pitts (1943) perceptron, as well as other connectionist models with which they share many of their limitations [10, 15, 20]. Combinatorics would be the a priori foundation of an architecture modeled a posteriori through individual and social learning. The information encoded in their cells would serve as the basis for the organism’s decisions and could be communicated with great precision through mathematical propositions or in a more imprecise but simpler way through natural language.
References 1. Aznar, M.B.L.: Visual reasoning in the Marlo diagram. In: Sato, Y., Shams, Z. (eds.) Ceurws.org, vol. 2116 (2018). http://ceur-ws.org/Vol-2116 2. Aznar, M.B.L.: Marlo’s Networks of Expectations in the Classroom: A Tool for Heterogeneous Reasoning. In: Pietarinen, A.-V., Chapman, P., Bosveld-de Smet, L., Giardino, V., Corter, J., Linker, S. (eds.) Diagrams 2020. LNCS (LNAI), vol. 12169, pp. 503–506. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54249-8_44 3. Aznar, M.B.L.: Diagramas lógicos de Marlo para el razonamiento visual y heterogéneo: válidos en lógica matemática y aristotélica. Doctoral dissertation. University of Huelva, Spain (2020). http://rabida.uhu.es/dspace/handle/10272/19769 4. Aznar, M.B.L.: Logic diagrams for heterogeneous reasoning: expectation networks [video file] (2020). Available online https://www.youtube.com/watch?v=rm8sHbKD6U0 5. Barwise, J., Etchemendy, J.: The Language of First Order Logic: Including the Program Tarski’s World 3.0 CSLI Lecture Notes, vol. 23 (1991) 6. Boche´nski, I.M.: A History of Formal Logic. University of Notre Dame Press, Notre Dame, Indiana (1961) 7. Boole, G.: An Investigation of the Laws of Thought, on which are Founded the Mathematical Theories of Logic and Probabilities. Dover, New York (1854) 8. Carroll, L.: The game of logic. Macmillan, London (1886) 9. De, M.A.: On the Syllogism, No. Iii, and on Logic in General. Cambridge [Cambridgeshire: Printed by C.J. Clay at the University Press (1858) 10. Feldman, J., Ballard, D.: Connectionist Models and Their Properties. Cogn. Sci. 6, 205–254 (1982). https://doi.org/10.1207/s15516709cog0603_1 11. Gardner, M.: Logic machines and diagrams. McGraw-Hill, New York (1958) 12. Jevons, W.: Pure logic or the logic of quality. Stanford, London (1864) 13. Jevons, W.: The Substitution of Similars, The True Principle of Reasoning. Derived from a Modification of Aristotle’s Dictum. Macmillan and CO, London (1869) 14. Łukasiewicz, J.: Elements of mathematical logic. Pergamon Press, Oxford (1991) 15. McCulloch, W., Pitts, W.: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biol. 52(1–2), 99–115 (1990). https://doi.org/10.1007/BF02459570 16. Over, D.E.: New paradigm psychology of reasoning. Think. Reason. 15(4), 431–438 (2009). https://doi.org/10.1080/13546780903266188 17. Oaksford, M., Chater, N.: Bayesian rationality: The probabilistic approach to human reasoning. Oxford University Press, Oxford (2007)
Marlo’s Networks of Expectations
277
18. Peirce, C.S.: Prolegomena to an apology for pragmaticism. Monist 16(4), 492–546 (1906) 19. Peirce, C., Library, H.: MS The Charles S. Peirce Papers. Harvard University Library: Harvard University Library, Cambridge, MA, Cambridge. Photographic Service. MF.66. (1966) 20. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review Tomo 65(6), 386–408 (1958). https://doi.org/10.1037/h00 42519 21. Shin, S.-J.: Heterogeneous reasoning and its logic. Bull. Symb. Log. 10(1), 86–106 (2004). https://doi.org/10.2178/bsl/1080330275
Complete Blood Analysis: An Android OCR-Based Interpretation Malik Almaliki1 and Elsayed Atlam1,2(B) 1 College of Computer Science and Engineering, Taibah University, Yanbu, Saudi Arabia
[email protected], [email protected] 2 Faculty of Science, Tanta University, Tanta, Egypt
Abstract. Complete Blood Count (CBC) test is part of the routine medical care for many people. It can uncover serious health problems such as anemia, infection, and even blood cancer. However, CBC results are normally presented in English and contain medical abbreviations. This make CBC results hard to understand especially for patients who do not speak English or lack knowledge of medical abbreviations meanings. This paper aims at developing an Android application that helps patients to view, interpret and understand their CBC result in a userfriendly manner. The application employs Optical Character Recognition technology (OCR) that allows patients to scan their CBC results, extract, interpret and translate to Arabic (if needed) the medical information contained in these results. It can also provide patients with the ability to store records of their CBC results for future retrieval and comparison analysis. This study is meant to maximize patients’ awareness about their health conditions based on their CBC result and suggest the measures to be taken in that regard. Experimental results show the developed system can gain 92.48% accuracy for counting the CBC. Keywords: Complete Blood Count · Optical Character Recognition · Medical information
1 Introduction There are many applications that have been developed to serve patients and those who care about their health life, believing that technology can enhance their way of dealing with a routine medical process such as blood test records. Patients get blood test record, in particular Complete Blood Count [1] lists a number of many important values. Typically, it includes the following: White blood cell count (WBC), WBC differential count, Red blood cell count (RBC or erythrocyte count), Hematocrit (HCT), Hemoglobin (HGB), Mean corpuscular volume (MCV), Mean corpuscular hemoglobin (MCH), Mean corpuscular hemoglobin concentration (MCHC), Red cell distribution width (RDW), Platelet count and Mean Platelet Volume (MPV). (CBC) says a lot about their health. It might not make a sense to the patient at first glance and further search and explanation are usually required from a specialist. Blood test report is part of the routine medical care and for some people, it is an annual process. A person, who investigates their blood test (CBC), either ignores some © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 278–293, 2022. https://doi.org/10.1007/978-3-031-10464-0_18
Complete Blood Analysis: An Android OCR-Based Interpretation
279
important factors or search or look for extra information and explanation in their blood test and may face problems due to their lack of knowledge. This test is very serious as it can uncover anemia, infection, and even blood cancer [2], but people face some difficulties in understanding it and may even put their own health at risk by not doing certain actions that needed to be taken. Optical Character Recognition technology OCR [3] is used to scan printed text and then convert it to a digital/electronic form in a computer. It’s widely used in recent years as it made it easier to entry the data and stores it in a digital form. There is an extensive use of mobile applications and many of them have start to come up with health monitoring feature which helps users to track their own health on daily basis. Our investigations show many applications and websites that offer services related to blood tests but it seems most of them from our point of view after testing them personally doesn’t fulfill user needs in a convenient way some applications are hard to understand, some are only giving information and some were useful. Some limitation has been founded in those applications during investigating them. Most applications don’t support user interaction as they only display information that is hard for a regular user to understand, we aim to develop an Android application where user can enter results of his CBC report and get useful information from the Android application regarding his results. It focuses on delivering information regardless user personal health details. Android application gives a unique service to the user that puts an end to the difficulties related to the interpretations of CBC tests results. This research aims to deliver a useful Android application in the user’s hands that helps users to understand blood test records and track health conditions based on personal CBC results. It also puts in our considerations to develop the Android application with easy interface and friendly as these are main aspects in success of any application. It provides the user with accurate and reliable as user will see his real results captured into the application and also regarding results we will meet with doctors, lab technicians and people during the development to get all information needed and our Android application will be tested by doctors and their opinions will be included in the appendix of the documentation,our goal is to develop an android application that help users to understand and get the best value out of their blood test records. This leads to the need to come up with an easy application that extracts CBC information in a simple way from the user, using a mobile camera to upload the lab result or by adding test results to the mobile application manually. This study develops an Android Application that serves patients’ needs in this matter. The application main function is to extract your blood test record data and analyze it. Then, it provides users with information clarifying their CBC results and the consequences of these results. It also integrates the camera to capture or upload the blood test records and a recognition system to read and analyze information. Experimental results show the developed system can gain 92.48% accuracy for counting the CBC. This paper organized as follows: Sect. 2 provides the reader with an overview of the literature review. Section 3 provides the system methodology. Section 4 shows the architectural models to demonstrate overall structure view of the system design and experimental results and evaluations. Section 5 gives the conclusion and the proposed future work.
280
M. Almaliki and E. Atlam
2 Related Works This section presents related literature in the same area of interest, reviewing features available and shows the weakness and strength points in each of the reviewed works. 2.1 Blood Test Guide Blood test guide [4] is an application for IOS tablets developed by O Clock Software Pvt Ltd which is a full-service web and mobile app development agency; they created this app back in Jan 17, 2012 as shown in Fig. 1. The application helps users to understand their blood test report. It includes 47 blood elements that are represented as categories. Each element shows them the clinical, optimal and red flag adult range for this element and the common cause of its decrease. It also gives them some elements two kinds of information related to the element they choose, clinical and nutrition.
Fig. 1. Blood Test Guide Application Interface and Features
2.2 Smart Blood Pressure (SmartBP) BP Tracker It’s a smart application [5] developed by Evolve Medical Systems, LLC in Jul. 24, 2012. The application is made to give a smarter way of managing the blood pressure measurements and tracking progress as shown in Fig. 2. It allows user to record, track, analyze and share their information. In addition, connects with Apple Health Kit and Microsoft HealthVault. Main reason of the application is to help you track and improve blood pressure as shown in Fig. 3.
Complete Blood Analysis: An Android OCR-Based Interpretation
Fig. 2. Smart blood pressure application interface and features
Fig. 3. Blood pressure graph in the application
281
282
M. Almaliki and E. Atlam
2.3 iCare Health Monitor iCare health monitor [6] could measure patient blood pressure, heart rate, vision, hearing, SpO2, breath rate, psychology and other physical data by the phone. Also allow them to record the results of measurements into “Health” application so they can monitor their health data. The more they test, the more precise the result will be as shown in Fig. 4.
Fig. 4. iCare health monitor interface and features
2.4 Medical Lab Tests Samp Lab Values Application [7] by iMedical application. It gives users the differentials for high and low lab values, and has a vast amount of high yield information. Medical Lab Tests covers the most common laboratory tests and their interpretation as shown in Fig. 5. All reference values in US and SI units. Health care professionals often find it difficult to interpret lab tests and remember lab values. This application will help them!
Complete Blood Analysis: An Android OCR-Based Interpretation
283
Fig. 5. Medical lab test application features and graph
2.5 Lab Tests Online Website Its free web site [8–11] descends to help users to understand several areas of laboratory medicine and CBC is one of them as shown Fig. 6. It’s produced by AACC, a global scientific and medical professional organization dedicated to clinical laboratory science in collaboration with 16 other laboratory professional societies in the United States and Canada volunteers provided by members of our Editorial Review Board, nor the financial support of our sponsors. It has also mobile application users can download it on their smart phone or tablet for only $0.99.
Fig. 6. Lab test online website
As a conclusion, although many theories are proposed to study Complete Blood Count (CBC) test is part of the routine medical care for many people. Only a few contributions have been made to android application that helps patients to view, interpret and understand their CBC result in a user-friendly manner. In addition, this is the first study to use android application to provide patients with the ability to store records of their CBC results for future retrieval and comparison analysis. This study is meant to
284
M. Almaliki and E. Atlam
maximize patients’ awareness about their health conditions based on their CBC Result and suggest the measures to be taken in that regard
3 Methodology 3.1 Agile Model Agile methodology is considered one of the best methodologies to develop a software. Software’s developed using agile methodology are more efficient than others produced by other methodologies according to the article 8 benefits of Agile software development written by Segue Technologies [9]. The Agile method splits the project to iterations. Each phase or iteration is developed, tested as a single task [10]. Figure 7 describes the main steps in this model was taken from [11].
Fig. 7. Agile model
3.2 Conceptual System Components Figure 8 shows system components, where CBC users request services from the CBC application as scanning or uploading the CBC report to be analyzed, then the Application responds by accepting client requests, first users must register on the application to login to their accounts.
Fig. 8. System components
Complete Blood Analysis: An Android OCR-Based Interpretation
285
This application works with a local database, meaning users are created locally so if the user doesn’t want to use the application he/she simply uninstall the application.
4 Experimental Evaluation 4.1 Techniques Used to Collect Requirements First, we conducted 1:1 online interview with one physician as an endocrinologist and other physician working at College of Nursing. Interviews were conducted with them to get information that will help us in analyzing the application data it was essential to understand more about CBC tests and the advices that should be displayed to the users on the application in case of low and high CBC readings, though we could have got the information online, but we needed information from credible sources. In addition, the interviews were done with permissions of both doctors and they are aware that we will publish the interviews in this document and we had their agreement before moving forward. 4.2 Interviews First our interview with one physician, he recommends an annual CBC test for healthy people as he stated while interviewing him, and he highlighted that the CBC tests are helpful in checking for Anemia, Leukemia, Fever, weakness, and blood sugar. When we asked for an advice to keep us healthy, his advice was healthy, balanced diet to keep CBC in the normal range. Also, avoid smoking, junk food. Active lifestyle is vital as well. Furthermore, he added that age is very important factor in CBC results as well as family medical history and medications taken by patients. Second our Interview with other physician, her opinion was a bit different than the first one regarding CBC checkups as she recommended it is better to perform a CBC results every 6 months. Her answers were same as the first one regarding factors affecting CBC results and what CBC results are helpful in checking for, besides, how to maintain a healthy life. She explained the symptoms where people should take a CBC result for example, if a person feels fatigue and has a loss of concentration, pale, easy weakness this may be an anemia condition and it’s really recommended to perform a CBC test. She added that if the case was severe that may lead to an increase in the heart rate. WBC may be affected if the person is infected; bleeding is shown in platelet result which is a very huge indicator if bleeding is high or low. 4.3 Survey A survey was conducted on Google Docs; the aim of the survey is to see people’s opinions about CBC Analyzer application and if it’s worth doing or no, the survey was then distributed to all our friends on WhatsApp, Telegram and Instagram to collect a wide variety of user types and ages to get the maximum information form different perspectives and views, the survey to help us in gathering information for modeling the application’s requirements and, Furthermore, the survey was analyzed by Google Docs and the analysis results is explained in detail in this section.
286
M. Almaliki and E. Atlam
Survey responders were 1005, around 90% of responders were females, and the age that was most dominant were people aged from 21 to 30 in our survey as they were around 44% from the total responders, followed by a 22.7% for age range between 31 and 40 as shown in Fig. 9; as blue indicates male participants and orange indicates female participants.
Fig. 9. Age and gender of responders
Around 53% of our responders performed a CBC test as the Fig. 9 shows versus a percentage of around 47% did not. This is really important to highlight the importance of performing CBC test for healthy people. Blue indicates yes and orange indicated no in the Fig. 10.
Fig. 10. Performed CBC test
Figure 11 shows that most responders around 84.5% did not understand the CBC results and had to see the Doctor to explain them; this indicated that having an application for understanding the results in really important. Blue indicates yes and orange indicated no in the Fig. 11.
15.50% Yes No 84.50% Fig. 11. Can you understand CBC results?
Complete Blood Analysis: An Android OCR-Based Interpretation
287
More than 65% of the responders replied that they don’t about their CBC tests analysis any information as in Fig. 12. Blue indicates yes and orange indicated no in the figure.
34.50%
Yes No
65.50%
Fig. 12. Do you have enough information about CBC test?
And as we expected a majority of people do not understand the medical abbreviations as more than 92% replied between yes and sometimes as shown in Fig. 13. Blue indicates yes and orange indicated no in the figure while gray indicates sometimes.
Yes
44.10% 49.60% 5.30%
No Someme
Fig. 13. Medical abbreviations is hard?
Asking about people’s opinion to have an application that translates the results in an easy way to be understood by the patient, more than 92% agreed about the idea of application as in Fig. 14. On another note 97% said that they prefer credible application to give credible information. Blue indicates yes and orange indicated no in the figure while gray indicates sometimes.
2.00%
5% 92.60%
Yes No Someme
Fig. 14. Opinion about application
288
M. Almaliki and E. Atlam
4.4 Experimental Results Main Screen As shown in the Fig. 15, here is the start of the program, showing application logo, after opening the application; users just have to press Start to start the application either in English or in Arabic. Then the application will ask for permission requests to use take pictures.
Fig. 15. Main screen
Scan Camera As shown in Fig. 16(a) and (b), after selecting the scan test option, a window appears of selecting a picture from devices storage, or taking a new picture using the camera.
Complete Blood Analysis: An Android OCR-Based Interpretation
289
Fig. 16. Scanning/album importing
Analysis Now, after taking several test results, we can gather it all, and the application will display a graph showing the patient’s history including dates and time for inspection. Figure 17 illustrates the analysis.
290
M. Almaliki and E. Atlam
Fig. 17. Results analysis
Deleting a Record Users can delete old test stored in the application by clicking the delete button as shown in following figure, user clicks on test he wishes to delete and click delete a pop up message appears when user confirms he wants to delete the test is deleted from the application as shown in Fig. 18.
Fig. 18. Deleting a record
Complete Blood Analysis: An Android OCR-Based Interpretation
291
Accuracy Evaluation The performance of the system was evaluated based on the 15 photos, where these images are captured from the educational type blood smear provided by the supplier. For the former assessment, the quantitative measurement is performed using confusion matrices, which are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) parameters. Eventually the accuracy, precision and recall are calculated using the Eq. (1). Accuracy provides evaluation regarding how well the overall system performance with respect to the ground truth data. Precision shows information about how many of detected CBC are correctly whereas recall gives how many CBC is correctly counted from the whole photo, and these three parameters will be analyzed on CBC as in Eq. (1) [12–16]. TP + TN ; TP + TN + FP + FN TP ; precision = TP + FP TP recall = TP + FN
Accuracy =
Table 1. Result analysis of CBC counting Photo
TP
TN
FP
FN
Accuracy%
Precision%
Recall%
1
105
1
6
2
91.22
92.71
97.25
2
79
1
0
4
94.27
99.00
94.35
3
80
1
15
6
78.15
82.74
92.37
4
117
1
6
0
94.22
94.31
99
5
81
1
6
2
90.12
91.23
96.88
6
99
1
3
2
94.22
96.16
97.45
7
126
1
3
5
91.13
96.56
95.78
8
102
1
3
7
96.83
96.55
92.95
9
131
1
2
1
94.14
97.85
98.60
10
109
1
5
2
91.17
94.78
97.70
11
103
0
8
2
90.78
91.98
97.45
12
118
0
8
3
91.12
92.87
96.33
13
93
0
5
3
96.56
93.88
95.60
14
108
0
2
1
96.69
97.88
98.25
15
99
0
3
0
96.55
96.19
99.25
Average
92.48
94.33
96.62
Table 1 summarizes the CBC counting result, and it illustrates that the average accuracy, precision and recall for the analysis system achieve 92.48%, 94.33% and
292
M. Almaliki and E. Atlam
96.62% respectively, in counting the CBC from blood smeared photos. It shows that the system is capable and well performed to detect the CBC with few errors. In conclusion, this smart application will help the user to understand the blood CBC test and provides a huge benefit by analyzing and storing the reports, but no matter how big the service this application will provide, the doctors’ role would not be eliminated as it is an essential part for critical health evaluation.
5 Conclusion The Smart CBC analyzer, is an Android application developed for helping people analyze and understand their CBC blood tests in an easy way, by uploading their results (either manually or using OCR technology) to the application and viewing the analysis of the results, the application proposed to overcome the limitations that has been founded during investigating related work and to serve the users in understanding their CBC test and getting visual results that are listed in a categorized easy way with colors for each category, normal, high or low. It’s designed to give the user a simple way to store the tests’ results for future retrieving and analysis using a diagram of a curve to help the user in tracking personal health condition and improving health lifestyle. There are some constrains in this application that is found due to the short amount of time for developing it. Future work could be summarized in the following points: 1) Support IOS platform. 2) Support medical test other than CBC. 3) Provide the user with advices and possible diagnosis. 4) References Multiuser Support to allow more than one person to be added to the application to be used by CBC labs.
References 1. MedicineNet.com: [Online]. Available: https://www.medicinenet.com/complete_blood_ count/article.htm. Accessed October 2018 2. Web MD: [Online]. Available: https://www.webmd.com/cancer/lymphoma/symptomswatchfor#1. Accessed October 2018 3. Wikipedia: [Online]. Available: https://en.wikipedia.org/wiki/Optical_character_recogn ition#Character_recognition. Accessed October 2018 4. O.C.S.P. Ltd.: App Store Preview. [Online]. Available: https://itunes.apple.com/us/app/bloodtest-guide/id491681195?mt=8. Accessed October 2018 5. L. Evolve Medical Systems: App Store Preview. [Online]. Available: https://itunes.apple. com/us/app/blood-pressure-smart-blood/id519076558?mt=8. Accessed October 2018 6. I.F. Studio: iCare Health Monitor. [Online]. Available: http://www.icarefit.com/. Accessed October 2018 7. M. Apps: App Store Preview. [Online]. Available: https://itunes.apple.com/us/app/medicallab-tests/id307829594?mt=8. Accessed October 2018 8. A.A.f.C. Chemistry: Lab Tests Online. [Online]. Available: https://labtestsonline.org/. Accessed October 2018 9. Seuge Technologies: [Online]. Available: https://www.seguetech.com/8-benefits-ofagile-sof tware-development/. Accessed October 2018 10. L. SEO: “LINCHPINSEO” [Online]. Available: https://linchpinseo.com/the-agilemethod/. Accessed October 2018
Complete Blood Analysis: An Android OCR-Based Interpretation
293
11. Antaes: [Online]. Available: https://www.antaes.ch/en/news/antaes-asia-is-trained-inagile/. Accessed October 2018 12. Google: Google Developers. [Online]. Accessed November 2018 13. Hamouda, S.K.M., Wahed, M.E., Abo Alez, R.H., Riad, K.: Robust breast cancer prediction system based on rough set theory at National Cancer Institute of Egypt. Comp. Methods and Programs in Biomedicine 153, 259–268 (2018) 14. Hamouda, S.K.M., Abo El-Ezz, R.H., Wahed, M.E.: Enhancement accuracy of breast tumor diagnosis in digital mammograms. J. Biomed. Sci. 6(4), 28 (2017). ISSN 2254-609X 15. Hamouda, S.K.M., Abo El-Ezz, H.R., Wahed, M.E.: Intelligent system for predicting, diagnosis and treatment of breast cancer. Int. J. Biomed. Data Mining 6, 2 (2017). https://doi.org/ 10.4172/2090-4924.1000128 16. Atlam, E.-S., Fuketa, M., Morita, K., Aoe, J.: Document similarity measurement using field association term. Info. Proce. Manage. J. 39(6), pp. 809–824 (2003)
Refined Optimal Control Problem and Its Solution Using Symbolic Regression Askhat Diveev(B) Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, Moscow 119333, Russia [email protected]
Abstract. The new refined setting of the optimal control problem is presented. Now a solution of the new problem can be realized in control object directly. Previously, the solution of the classical statement of the optimal control problem could not be realized on the object directly because this led to an open control system. The new refined setting of the problem includes additional conditions. The optimal trajectory in the state space must have an attraction property in some neighborhood. To solve the new refined problem the synthesized control method is proposed. According to the method initially the control synthesis problem is solved. The control object becomes a stable in some equilibrium point in the state space. Secondly the optimal control problem is solved by moving positions of stable equilibrium point. An example of solving the new optimal control problem with complex phase constraints for mobile robot is presented. For comparison this problem was solved by directly method without new additional conditions. Both solutions are simulated with perturbations of the model. Experiments shows that the solution of the new problem is less sensitivity to perturbations than the solution of the classical one.
Keywords: Optimal control Control synthesis
1
· Attractor · Symbolic regression ·
Introduction
The authors of the optimal control problem in their monograph [1] have not once stated that the optimal control problem, in contrast to variation calculus, has a clear applied orientation. But as further studies of this problem have shown, the obtained solution in the form of control as a function of time cannot be directly applied in a real object. In order to apply the found optimal control in a real object, it is first necessary to determine the optimal trajectory using the found optimal control. Then it is needed to build a system to stabilize the movement of the object along the optimal path. The motion stabilization system [2] should take into account not only the position of the object relative to the optimal trajectory, but also link this position with time, which requires additional control c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 294–305, 2022. https://doi.org/10.1007/978-3-031-10464-0_19
A Refined Optimal Control
295
resources. Here, it should also be borne in mind that the mathematical model of the control object with the system for stabilizing the movement of the object along a given trajectory should differ from the mathematical model of the object without the stabilization system. Therefore, the optimal control found for an object without a stabilization system will not be optimal for an object with a stabilization system. In the paper, initially the classical setting of the optimal control problem is presented. Then a definition of the realizability of the mathematical model of the control object is entered and some theorem is proved about this. Then the refined setting of the optimal control problem with additional requirement is presented. To solve the new optimal control problem to use the synthesized control method is proposed. The paper includes description of the synthesized control method. According to this method, firstly the control synthesis problem is solved. As a result, the control object becomes a stable in some equilibrium point in the state space. After that the optimal control problem is solved by moving position of this stable equilibrium point. For solving the control synthesis problem, the machine learning of control by symbolic regression is proposed. For solving the optimal control problems evolutionary algorithm are used. In experimental part an example of solving the new optimal control problem with complex phase constraints in the form bottle necks for mobile robot is presented. The problem is solved by the synthesized control method. At this the synthesis problem is solved by the network operator method. For comparison this problem in classical statement is solved by direct method. Both solutions are simulated with perturbations of the mathematical model of control object. Computational experiments shown, that the solution of the new optimal control problem less sensitivity to perturbations, than the solution of the classical optimal problem.
2
A Refined Optimal Control Problem
Initially, consider the classical optimal control problem. After then we will add one condition, that to show how the classical problem statement is difference from the refined problem one. 2.1
Classical Formulation of the Optimal Control Problem
The mathematical model of control object in the form of ordinary differential equation system is given x˙ = f (x, u), (1) where x is a space state vector, x ∈ Rn , u is a vector of control, u ∈ U ⊆ Rm , U is compact set, m ≤ n. For the system (1) Initial conditions are given x(0) = x0 .
(2)
The terminal conditions are given x(tf ) = xf ,
(3)
296
A. Diveev
where tf is defined by achievement of terminal condition, tf is limited t, if t < t+ and xf − x(t) ≤ ε0 tf = , t+ , otherwise
(4)
where t+ and ε0 are given positive values. The quality criterion is given tf f0 (x, u)dt → min .
J0 =
(5)
0
To solve the problem it is necessary to find a control function u = v(t) ∈ U ⊆ Rm .
(6)
If to substitute the control function (6) in the system (1), then the system without a control vector in the right part is obtained. x˙ = f (x, v(t))
(7)
The partial solution x(t, x0 ) of this system (7) from initial condition (2) will achieve the terminal condition (3) with optimal value of the quality criterion (5). Problem of this statement consists of that the found optimal solution is a control without feedback, and it cannot be used in the real object directly. Any small perturbation or deviation a real object from the mathematical model (1) will lead to other trajectory in the state space, that will not hit to terminal state (3) and will have other value of the quality criterion (5). The concept of realizability or feasibility of mathematical model should be discussed here. There are always perturbations in the real object and there are always differences between the data obtained from the model and measured on the real object. Definition 1. A mathematical model of real object has feasibility property if value of mistake between the model and the real object isn’t increased in time. For example, a mathematical model of a stable object in the neighborhood of a stable equilibrium point is always feasible. Even if the position of the stable equilibrium point is determined with error. If the mathematical model of the object is also stable relative to the equilibrium point, then over time any partial solution from the neighborhood of the equilibrium point obtained from the model will be close to the equilibrium point. The real object over time will also be at the equilibrium point. The error of determining the equilibrium point will not increase over time. The model is feasibility. Theorem 1. If mathematical model of real object is a contracting mapping in a some domain of the state space, then this model has feasibility property in this domain.
A Refined Optimal Control
297
Proof. Assume x˙ = f (x)
(8)
is a mathematical model of a real object. Let X0 ⊆ Rn
(9)
is a domain in the state space, where the model (8) is a contracting mapping. Then any two partial solutions from two initial conditions x0,1 , x0,2 ∈ X0
(10)
x(t, x0,1 ) − x(t, x0,2 ) ≥ x(t + Δt, x0,1 ) − x(t + Δt, x0,2 )
(11)
have property
for any Δt > 0. ˜ = x0,2 is Let x(0) = x0,1 is an initial position of mathematical model and x a real object position. An error is δ(0) = x0,1 − x0,2 . ˜ (Δt). According to Then after Δt time a real object will move to position x contracting mapping properties (11) we obtain δ(Δt) = ˜ x(Δt) − x(Δt) ≤ δ(0).
An error is not increasing. 2.2
Refined Formulation of the Optimal Control Problem
In new problem instead of the control function (6) of time we search for control function of the following form u = g(x, t) ∈ U.
(12)
Substitution of the control function (12) in the model (1) leads to the following differential equation x˙ = f (x, g(x, t)). (13) The control function (12) must provide for differential Eq. (13) the following properties: a partial solution x(t, x0 ) from the initial condition (2) getting into the terminal state (3) with optimal value of quality criterion (5) and for other partial initial solution from any initial condition x(t, y) the conditions are hold. If for t > 0 (14) x(t, x0 ) − x(t , y) ≤ Δ then ∃t > 0 such that ∀ε > 0 x(t , x0 ) − x(t , y) ≤ ε
(15)
298
A. Diveev
Now the refined optimal control problem includes the following formulas (1)– (5), (12), (14), (15). Conditions (14) and (15) requires, that the optimal trajectory in the state space had an attracting property. This additional requirement for optimal trajectory is associated with synergetic control [3,4]. Meeting this requirement will cause the system of differential equations to be a contraction mapping. Therefore a found optimal control will be feasibility in a real object directly. Consider one of approaches to solving the refined optimal control problem.
3
A Synthesized Optimal Control
A synthesized control contains of two stages [5,6]. In the first stage a control synthesis is solving and a mathematical model control object becomes a stable in some equilibrium point of the state space. In the second stage positions of equilibrium points is finding, that to solve a control problem. Consider this approach more detail. For control synthesis problem the mathematical of control object (1) is given. The domain of initial conditions is given X0 ⊆ Rn
(16)
It is necessary to find a control function in the form u = h(x∗ − x) ∈ U ⊆ Rm ,
(17)
where x∗ is a some point in the state space. x˙ = f (x, h(x∗ − x)).
(18)
˜ (x)∗ This system (18) for every value x∗ has an equilibrium point x ˜ (x∗ ))) = 0. f (˜ x(x∗ ), h(x∗ − x
(19)
This equilibrium point is a stable on the first approximation A=
˜ (x∗ ))) ∂f (˜ x(x∗ ), h(x∗ − x , ∗ ˜ (x ) ∂x
(20)
det (A − λE) = (−1)n (λn + an−1 λn−1 + . . . + a1 λ + a0 ) = (−1)n
n
(λ − λj ) = 0,
(21)
j=1
where i =
√
λj = αj + iβj , j = 1, . . . , n,
(22)
αj < 0, j = 1, . . . , n,
(23)
n
−1, E = diag(1 . . . 1).
A Refined Optimal Control
299
In the second stage the optimal control problem for the model of control object (18) with control function x∗ is solving. It is necessary to find a control function (24) x∗ = w(t) ∈ X0 ⊆ Rn , in order to hit in the terminal state (3) with optimal value of the quality criterion tf J1 = f0 (x, h(x∗ − x))dt → min . (25) ∗ x ∈X0
0
Theorem 2. If for the optimal control problem (1)–(5), it is found the control function (17), that satisfies the conditions (20)–(23), then a function (24) of time always can be find such that it will realize in real object directly. Proof. Consider the system x˙ = f (x, h(w(t) − x)).
(26)
The partial solution x(t, x0 ) of this the system hits terminal state (3) with optimal value of quality criterion (25). Let x(t, y) is an other partial solution of this the system (26). Then in some moment of time t s(t ) = x(t , x0 ) − x(t , y). (27) According to condition (20)–(23) in the moment t the system (26) has a stable equilibrium point. Therefore ∀δ > 0 s(t + δ) ≤ s(t ).
(28)
This means that a mistake of the model (26) is not increased in time and according to the Definition 1 the system (26) has feasibility property. Therefore the control function (24) has feasibility property too. What and it was required to prove. 3.1
Maximum Principle for Synthesized Control
For the optimal control problem (26), (2), (3), (25) Hamiltonian can be constructed H(ψ, x, x∗ ) = −f0 (x, h(x∗ − x)) + ψ T f (x, h(x∗ − x)), (29) where ψ is a vector of costate variables dψi ∂H(ψ, x, x∗ ) =− . dt ∂x
(30)
According to maximum principle optimal solution satisfies of condition x∗ = arg max H(ψ, x, x∗ ).
(31)
300
4
A. Diveev
The Optimal Control Problem with Bottleneck Phase Constraints
In the optimal control problem there are given a mathematical model of control object (1), an initial condition (2), a terminal condition (3), (4) and a quality criterion (5). A control object in during motion from initial state to the terminal one has to pass through set number of bottlenecks, at this the order of passage of these bottlenecks is not determined. It is known only their quantity. Object before hit to the terminal condition has to pass all bottlenecks. For description of these bottleneck phase constraints in equations are used. φi (x) ≤ 0, i = 1, . . . , M
where φi (x(t)) =
0, if Δi (x(t)) ≤ ri , i = 1, . . . , M, Δi (x(t)), otherwise
Δi (x(t) = min x − yi , i = 1, . . . , M, t
(32)
(33) (34)
ri are given small positive value determining sizes of bottlenecks, yi is vector of coordinates determining location of i bottleneck in the state space, yi ∈ Rn , i = 1, . . . , M , M is the number of bottlenecks.
5
An Example
Consider an example of solving the optimal control problem with bottleneck phase constraints. A control object mathematical model has the following description x˙ 1 = 0.5(u1 + u2 ) cos(x3 ), x˙ 2 = 0.5(u1 + u2 ) sin(x3 ), x˙ 3 = 0.5(u1 − u2 ),
(35)
− 10 ≤ ui ≤ 10, i = 1, 2.
(36)
where The initial condition is given x(0) = x0 = [0 0 0]T .
(37)
The terminal condition is given x(tf ) = xf = [10 10 0]T , where tf is determined by Eq. (4) with t+ = 4.8, ε0 = 0.01.
(38)
A Refined Optimal Control
301
The quality criterion is given J2 = tf + p1
M
φi (x(t)) → min, u∈U
i=1
where p1 is a weight coefficient, φi (x) is determined by Eq. (33) with
Δi (x(t) = min (x1 (t) − x1,i )2 + (x2 (t) − x2,i )2 , t
(39)
(40)
x1,i , x2,i , are coordinates of bottleneck centres, i = 1, . . . , M , εi is a diameter of necks, M is a number of bottlenecks. In our example M = 4, εi = 0.1, i = 1, . . . , 4, x1,1 = 2, x2,1 = 5, x1,2 = 5, x2,2 = 2, x1,3 = 8, x2,3 = 2, x1,4 = 2, x2,4 = 8, p1 = 4. In the first stage for control synthesis problem a symbolic regression can be used. We use the solution from the work [7] ⎧ + ⎪ ˜ i > u+ ⎨ui if u i ∗ − , i = 1, 2, (41) ui = hi (x − x) = ui if u ˜ i < u− i ⎪ ⎩ u ˜i otherwise where u ˜1 = sgn(q3 Δ3 ) exp(−|q3 Δ3 |) + a−1 +
√ 3
a + sgn(Δ3 ) + μ(b),
u ˜2 = u ˜1 + sin(˜ u1 ) + arctan(h) + μ(b) + c − c3 , 3 a = tanh(d) + b + 3 Δ1 + c + sin(q3 Δ3 ), b = g + sgn(sgn(Δ1 )q2 Δ2 ) exp(−|sgn(Δ1 )q2 Δ2 |)+ sin(Δ1 ) + tanh(g) + Δ1 , c = g + sgn(sgn(Δ1 )q2 Δ2 ) exp(−|sgn(Δ1 )q2 Δ2 |) + sin(Δ1 ), d = h + c − c3 + sgn(q1 Δ1 ) + arctan(q1 ) + ϑ(Δ3 ), g = sgn(Δ1 )q2 Δ2 + q3 Δ3 + tanh(q1 Δ1 )), h = arctan(q1 Δ1 ) + sgn(w) |w| + w + v + 2sgn (w + tanh(v)) +
3 w + tanh(v) + 3 Δ1 + sgn(Δ1 ) |Δ1 | + 3 Δ1 + tanh(v), w = sgn(Δ1 ) + sgn(q2 Δ2 )sgn(Δ1 ) tanh(Δ1 ), v = q3 (Δ3 ) + sgn(Δ1 )q2 (Δ2 ) + tanh(Δ1 ), μ(α) = sgn(α) min{1, |α|}, tanh(α) =
1 − exp(−2α) , 1 + exp(−2α)
Δ1 = x∗1 − x1 , Δ2 = x∗2 − x2 , Δ3 = x∗3 − x3 ,
(42) (43)
302
A. Diveev
q1 = 14.7288, q2 = 2.0271, q3 = 4.0222. In the second stage the optimal control problem is solving with new control vector x∗ . To solve the optimal control problem with bottleneck phase constraints by direct approach the terminal condition is added to the functional (39) J3 = tf + p1
M i=1
φi (x(t)) + p2 xf − x(tf ) → min , ∗ x ∈X0
(44)
where p2 is a weight coefficient, p2 = 4. To use direct approach the time interval is introduced Δt = 0.4 s. A control function is searched in the following form x∗i (t) = qi+(k−1)n , (k − 1)Δt ≤ t < kΔt, i = 1, . . . , n, k = 1, . . . , K, where K is the number of intervals + t 4.8 K= = = 12, Δt 0.4
(45)
(46)
qi+(k−1)n - is a component of constant vector q = [q1 . . . qKn ]T .
(47)
For search parameter vector (47) particle swarm optimization (PSO) algorithm [5,8] was used. As a result, the following optimal solution was obtained q = [5.8117 1.5506 0.0224 6.8956 2.5304 − 0.1016 0.8679 4.9312 0.0937 0.4934 5.1583 − 0.2151 6.0216 8.5573 − 0.0807 7.0616 8.0004 0.0925 10.2021 4.1811 0.1259 9.1453 4.4015 0.1261 13.7554 9.1819 − 0.1291 14.4733 10.2643 − 0.0425 9.9206 11.0372 −0.0659 9.9788 10.3128 − 0.0198]T .
(48)
In the Fig. 1, the optimal trajectory on horizontal plane {x1 , x2 } is presented. In the Fig. 1, small circles are bottlenecks, black squares are projections of the optimal control vector x∗ on the horizontal plane. It is seen from the Fig. 1, the control object reached the terminal condition (3) and at this it passed all bottlenecks. The optimal value of functional (44) was J3 = 4.1445. For the stabilization system used (41), the components of control vector x∗ are coordinates of stable equilibrium points. But if the equilibrium point is placed in a bottleneck place, then when the object comes nearer to the equilibrium point, the speed of the object slows down. Therefore, equilibrium points are located near the bottleneck places but don’t coincide with them. To check the feasibility property for optimal solution, we compare this solution obtained by the method of synthesized optimal control with a direct solution of the same optimal control problem.
A Refined Optimal Control
303
Fig. 1. Projection of trajectory on horizontal plane for synthesized optimal control
For a numerical direct solving the optimal control problem (44)–(53), we use a piecewise linear approximation of the control function ui = q˜j+(i−1)K˜ +
− q˜j+(i−1)K˜ q˜j+(i−1)K+1 ˜ ˜ (t − j Δt), ˜ Δt
(49)
˜ ≤ t < (j + 1)Δt, ˜ j = 1, . . . , K, ˜ Δt ˜ is a time interval for direct where i = 1, 2, j Δt ˜ ˜ solution, Δt = 0.2, K is the number of intervals + t 4.8 ˜ K= = = 24, (50) ˜ 0.2 Δt T ˜ = [˜ q q1 . . . q˜Km ˜ ] .
(51)
For direct solving the optimal control problem the PSO- algorithm is the same used. PSO-algorithm found the following solution: ˜ = [3.1380 18.6514 12.0526 12.3363 5.9403 3.0994 − 0.4734 q 1.1257 17.3474 11.7941 15.5363 3.4148 12.3872 4.8053 −0.9897 0.1898 9.2312 0.6822 − 0.8845 13.4516 6.5511 13.7926 18.2337 0.9019 11.3110 14.5370 − 6.1641 19.9382 1.2918 7.8812 5.8397 2.5193 4.1420 − 4.4246 − 1.9570 11.1309 0.9642 3.4756 18.6883 3.3201 15.4159 9.5764 12.4498 8.7645 19.0028 9.0285 − 3.2251 16.1549]T .
(52)
The optimal trajectory of control object movement on the horizontal plane is presented in the Fig. 2. Value of the functional (44) for found optimal solution is J3 = 4.81072. Now for checking feasibility property it is necessary to consider the influence of small perturbations of the model on the functional value. x˙ 1 = 0.5(u1 + u2 ) cos(x3 ) + βξ(t), x˙ 2 = 0.5(u1 + u2 ) sin(x3 ) + βξ(t), x˙ 3 = 0.5(u1 − u2 ) + βξ(t),
(53)
304
A. Diveev
Fig. 2. Projection of trajectory on the horizontal plane for direct optimal control
where ξ is a random function, that returns at every call a random value from −1 to 1, β is a given constant. We introduce perturbations also for initial condition xi (0) = x0i + β0 ξ(t), i = 1, . . . , n,
(54)
where β0 is a given constant. Results of experiments are represented in the Table 1. In Table 1 J˜3 is an average value of the functional (44) for a direct optimal control (52) on ten experiments, σ(J˜3 ) is a standard deviation of the functional J˜3 for direct optimal control, J¯3 is an average value of the functional (44) for synthesized optimal control (48) on ten experiments, σ(J¯3 ) is standard deviation for this value. Results of experiments from the Table 1 show that both type solutions the same sensitive to perturbances of mathematical model, but the synthesized optimal control practically insensitive to initial conditions disturbances. At this that direct optimal control very strongly depends on value of initial conditions perturbance. For level of perturbances β0 = 0.1 average value of functional for direct optimal control is increasing almost in three times. In this time an average value of functional for synthesized optimal control is increasing not more than on 8%. Table 1. Functional value for perturbed mathematical model ˜3 J¯
σ(J˜3 )
J¯3
β0
β
0
0.01
4.8489 0.04157 4.4267 0.4533
σ(J3 )
0
0.02
4.9145 0.06047 4.7955 0.4477
0
0.05
4.9704 0.0855
4.9745 0.2972
0.01 0
5.1113 0.2863
4.1422 0.0078
0.01 0.01
5.3405 0.3933
4.2352 0.2854
0
5.2961 0.5796
5.0164 0.3167
6.7005 1.2266
4.1447 0.0007
11.0234 5.4735
4.4467 0.4691
0.1
0.05 0 0.1
0
A Refined Optimal Control
6
305
Conclusions
In the work it is continued studies of synthesized optimal control for solving the optimal control problem. It is presented new refined formulization of the optimal control problem. In order to a solution of the optimal control problem was applied directly in real object it is necessary that the optimal solution had attractor property. It is proved the theorem that, if the optimal solution has attractor property, then any small perturbances of the mathematical model isn’t increasing in time. This means that a solution of the optimal control problem can be used directly in a real object. It is shown, that the synthesized optimal control allows to solve the refined optimal control problem. An example of the solving the optimal control problem with complex phase constraints of the bottlenecks form is presented. It is shown experimentally, that the synthesized optimal control less sensitive to uncertainties of the mathematical model of control object than the direct optimal control. For example, the synthesized optimal control is insensitively to perturbances of initial condition at all.
References 1. Pontryagin, L.S., Boltyanskii, V.G., Gamkrelidze, R.V., Mishchenko, E.F.: The Mathematical Theory of Optimal Process. Gordon and Breach Science Publishers, New York, London, Paris, Montreux, Tokyo (1985). Pontryagin, L.S.: Selected works, vol. 4, p. 360 2. Avenda˜ no-Jurarez, J.L., Hernr` andez-Guzmr` an, V.M., Silva-Ortigoza, R.: Velocity and current inner loops in a wheeled mobile robot. Adv. Robot. 24(8–9), 1385–1404 (2010) 3. Kolesnikov, A.A.: Introduction of synergetic control. In: 2014 American Control Conference (ACC), Portland, Oregon, USA (2014) 4. Haken, H.: Synergetics. Introduction and Advanced Topics. Springer, Cham (2004). https://doi.org/10.1007/978-3-662-10184-1 5. Diveev, A.: Numerical method of synthesized control for solution of the optimal control problem. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2020. AISC, vol. 1228, pp. 137–156. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52249-0 10 6. Diveev, A., Shmalko, E.Yu., Serebrenny, V., Zentay, P.: Fundamentals of synthesized optimal control. Mathematics 9, 21 (2021). https://doi.org/10.3390/math9010021 7. Diveev, A., Shmalko, E.Yu.: Control synthesis as machine learning control by symbolic regression methods. Appl. Sci. 11, 5468 (2021). https://doi.org/10.3390/ app11125468 8. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of IEEE International Conference on Neural Networks, Perth, Australia, IV, pp. 1942–1948 (1995). https://doi.org/10.1109/ICNN.1995.488968
Influences of Coating and Spandex Compositions of Conductive Textiles Used as Strain Sensors Using an Automated Test System Stefan Wohlrab, Phillip Petz(B) , Florian Eibensteiner, and Josef Langer Embedded Systems Lab, Univeristy of Applied Sciences Upper Austria, Softwarepark 11, 4232 Hagenberg, Austria {stefan.wohlrab,phillip.petz,josef.langer}@fh-hagenberg.at, [email protected] https://www.fh-ooe.at/campus-hagenberg/
Abstract. The combination of fabrics with unique electrical properties and miniaturized electronics make it possible to integrate textile sensors into wearable clothing. The applications are diverse and range from sweat detection and the measurement of vital parameters via textile electrodes to measuring the applied pressure and tension on the textile. In this paper, we present a procedure for the characterization and evaluation of conductive textiles as strain sensors. We compare the behaviour of commercially available conductive textiles with our own solutions based on jersey and screen printed conductive particles. The coating composition and the effect on the resistance behaviour under tensile load is evaluated by long-term tests and defined load tests on a custom test bench. The measurements show that the number of conductive particles in the coating influences the resistance of the unstretched patch. No correlation was found between the spandex content and the characteristics of the textiles. Keywords: Textile sensors Wearable sensor
1
· Smart textile · Sensor characteristics ·
Introduction
Textile sensors are fabrics which are knitted or woven from yarns. A yarn, for its part, consists of individual fibers. There are various ways of producing a sensor from the fabric. Either the fibers or the yarn can have properties dependent on the environment. Combining conventional threads with conductive yarn in the weaving or stitching process can also result in sensor behaviour of the finished fabric. In addition, fabrics can be embroidered, printed or coated with a conductive material [7]. As [5] shows, even a commercial 3D printer can be used to print on the textiles to bond conductive filament to the fabric. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 306–321, 2022. https://doi.org/10.1007/978-3-031-10464-0_20
Textile Strain Sensor
307
The textile sensor patches used in this work, are coated with a mixture of water, graphite and binder. This process was chosen because the manufacturing of the sensors is cheap and simple. In [15], the tested sensors are sewn into a glove to record the position of the fingers. For this purpose, the patches are placed over the joints of the hand. If the finger is bent, the length of the sensors changes and thus their resistance. The change is recorded by measuring the voltage between the textile and the fixed resistance of a voltage divider. To determine the position of the finger, the minimum length of the sensor with the finger extended and the maximum length with the finger flexed are determined during the calibration phase. In the evaluation phase, the current sensor value is normalized using the boundary values and thus the position of the finger is determined. In order to record the position of the joints as accurately as possible, it is essential that the sensors have a linear relationship between extension length and resistance. In addition, the values should remain reproducible and stable over time. The mechanical and the consequent electrical behavior of textile sensors, however, is very complex due to the superposition of thread and fabric effects [8], mechanical properties of coated structures [3], as well as non-linearity in resistance to length variation and large relaxation times [14,16], which is why measurements of the developed samples on a uniform test rig is the preferred method for the evaluation and comparison of different material compositions and sensor geometries. The characteristics of textile sensors are evaluated either with reference products [13], testing equipment from materials engineering [1,4], specially developed [12] or modified [6] test systems. While devices used in materials science have very good accuracy, they are usually limited in their range of applications. In contrast, proprietary developments can be used more flexibly, but have lower performance. The material samples are created with different compositions of water, graphite and binder and are mounted in a self-developed test rig in order to obtain information about the response of different coatings in combination with different carrier substrates. This test rig is capable of subjecting the textile to predefined load cycles with varying velocity and forces.
2 2.1
Related Work Smart Textiles
In [2], in addition to the aging of textile sensors, it is also described why a change in resistance occurs when the coated fabrics are stretched. Two cases can be distinguished, depending on whether the resistance increases or decreases. If the resistance decreases when the textile is stretched, this is because the threads become thinner as a result of the stretching, which causes the conductive particles to move closer together, thus conducting better and lowering the resistance. In addition, the tensile stress allows the fibers to form better connections with each other. In contrast, increasing resistance when the material is stretched is due to conductive particles moving away from each other.
308
S. Wohlrab et al.
As the material gets longer, the conductive layer can crack and become brittle, causing the individual parts to also move away from each other. 2.2
Textile Testsystems
In [6], the properties of sensitivity, linearity, stability, and hysteresis of textile tensile sensors are investigated using a modified cargo strap testing machine. Strips with conductive coating and weaving of conductive threads are compared with each other as textile sensors. The test rig used tensions the samples with forces from 100 N to 2500 N and a speed of 50 mm/min. A voltage divider with a reference resistor is used for the resistance measurement. The sampling rate of the digital multimeter was 9 Hz. The characterization load test setup in [4] consists of an Instron 5565A Load Frame and a Keysight 34465A Digital Multimeter. The textile sensors were tested in the load frame at a rate of 1 mm/min. For longer and faster load tests, a separate test set with electric motor, cam, spring and force gauge was developed. While the load frame can apply very accurate forces, the setup with electric motor is better suited for faster cyclic load tests at 30 cycles per minute. 2.3
Applications
Textiles that act as tension or pressure sensors have many applications. For example, in [9] and [11] pressure-sensitive socks were developed from conductive textiles and pressure sensors for shoe insoles were created from pressure-sensitive films. Movement patterns and individual gait sequences were detected and monitored by continuously measuring the pressure load and pressure distribution on the underside of the feet. This allows independent use in sports, medicine and rehabilitation after injuries. Another field of application is the use of textile sensors as traction sensors. As shown in [10], these are placed over the joints and can thus detect the flexion of fingers. The textile sensor EeonTex Conductive Stretchable Fabric LTT-SLPA from Eeonyx is used here. It has a resistance of 20 kΩ per square inch. Measurements showed that the strip changes its resistance from 450 kΩ to 350 kΩ when stretched from 3 cm to 4.5 cm, making it a good strain sensor. Furthermore, it was found that the resistance varies by up to 50% due to dermal contact, which is why the sensor must be isolated from the skin. Three patches are installed per finger, which are connected to the electronics with conductive threads. Together with a fixed resistor, the textile sensor forms a voltage divider. If the resistance of the sensor decreases due to stretching, the voltage at the voltage divider also decreases. The voltage is measured with an ADC measurement and a sampling rate of 50 Hz. With this setup it is possible to detect three predefined hand gestures with a reliably up to 97.8%.
Textile Strain Sensor
3 3.1
309
Automated Testing System Test System
We developed an automated test system to achieve reproducible measurements at predefined load forces. The individual sensor patches are mounted in the setup shown in Fig. 1 and automatically stretched to a predefined length. The resistance measurement during the stretching takes place every 50 ms, allowing the resistance curve and the stretching curve to be correlated. The left side of the test fixture is rigid and holds one end of the sensor. On the right side, the strip is clamped into the carriage. The carriage can be moved by the motor, which allows the sensor to be stretched. The motor is controlled by an Arduino Uno equipped with a CNC-Shield. The sensor strip is contacted on both sides and forms a voltage divider together with a second fixed resistor. By measuring the voltage value through an ADC measurement and the known resistance value, a calculation of the sensor resistance is possible. A force sensor is used to measure the tensile force acting on the belt in order to relate the characteristics of the sensors not only to the physical elongation and steps of the stepper motor. Since the force sensor is not calibrated, it does not provide absolute force values in newtons, but only serves to classify the values in relative terms.
Conductive textile
Mounting sled
Force sensor
Driving sled
Stepper motor
Fig. 1. Test system for tensile and compression tests.
The sampling of the force sensor, the ADC measurement of the voltage value and the conversion into the sensor resistance value is done via a microcontroller. An ESP32 is used here. It is also responsible for controlling the motor by sending the necessary commands to the Arduino. The carriage of the test setup can be moved at different speeds. The ESP32 receives the next position from a computer via a serial interface. On the computer a Python application is running which reads the predefined
310
S. Wohlrab et al.
test patterns from a CSV file and sends the next position to the ESP32 at the given time. The resistance values determined by the ESP32 are transferred to the PC, where they are written to a database by the Python application. Grafana can then be used for visualization during the tests. The measurement is then downloaded and postprocessed using MATLAB. The complete data processing chain is shown in Fig. 2.
CSV-File
Database
MATLAB
PC Python-App
Grafana
Calculate Resistance ESP32 ADCMeasurement
Read Load Cell
Drive Motor
Voltage Divider
Load Cell Shield
Arduino + CNC Shield
Textile Sensor
Load Cell
Motor
Testing Machine
Fig. 2. Data processing chain.
3.2
Test Scenarios
The textile sensor strips are later sewn into garments in which they are exposed to various movements and deformations. In order to simulate different usage scenarios, different stretching patterns are executed by the test system. To determine the characteristics of the sensor strips, they are stretched from a certain starting length to a predefined length, then held and released again after a certain time. The speed of the stretching process as well as the time in the individual phases can be varied. In addition, the strip can either be brought from an unstretched to a stretched state or from one stretching length to another. Many different test scenarios can be derived, the ones used are shown in Fig. 3. The measurements in [15] show that the mounting length of the patches does not play a significant role, since the resistance increases proportionally and linearly with the sensor length.
Textile Strain Sensor
311
Fig. 3. Stretching Patterns. (a) Ramp, (b) Stretching with Medium Speed, (c) Stretching with High Speed, (d) Stretching with Low Speed, (e) Frequent Stretch, (f ) Stretching from Starting Point to a Certain Position, (g) A Stretch over a Longer Period of Time.
Therefore, a length of 100 mm is selected for the patch in the rest position. It is stretched by a maximum of 40%. The test pattern in Fig. 3a describes a staircase ramp, as the patch is stretched progressively by 2 mm until the maximum is reached. Afterwards, relaxation also takes place in 2 mm intervals. Figure 3b, 3c and 3d describe stretching at different rates. Figure 3e tests immediate relaxation after elongation. In the tests in Fig. 3f, the patch is always stretched to a certain length starting from the rest position. Figure 3g describes stretching and then holding the extension for several minutes. To simulate resistance drift when the patches are frequently stretched and relaxed, the tests are performed over several hours. Each test pattern is executed several times. The resistance curve provides information on how the value changes with frequent use and whether a trend in a particular direction can be detected.
312
4
S. Wohlrab et al.
Test Samples
4.1
Patches
Several sensor patches with different properties are available for testing. Figure 4 shows the sensors, Table 1 describes the most important properties.
Fig. 4. Patches 1 to 8 from Left to Right. Patches 4 to 8 are glued to a substrate.
All sensor patches consist of a jersey fabric, which is coated with a mixture of water, fixture and conductive particles. The composition of the coating varied between the sensors. Patches 1 to 3 are made of the same core material. The only difference is coating with the number of conductive particles in it. Thus the question should be answered, which difference in the characteristic of the sensor Table 1. Characteristic values of textile sensor patches. The numbering corresponds to the textiles shown in Fig. 4 numbered 1 to 8 from Left to Right. Patch # Mass in g/m2 Spandex in % Mesh orientation
Coating quantity of conductive particles
1
110
10
90◦ to pulling direction Low
2
110
10
90◦ to pulling direction Medium
3
110
10
90◦ to pulling direction High
4
110
10
90◦ to pulling direction Low
5
180
22
90◦ to pulling direction Low-medium
6
180
22
Pulling direction
7
190
40
90◦ to pulling direction Low-medium
8
190
40
Pulling direction
Low-medium Low-medium
Textile Strain Sensor
313
results from the different recipes. The assumption to be checked is whether the number of conductive particles has an influence on the resistance range of the sensor. The patches 1, 2 and 3 were also characterized in [15], there numbered 1, 3 and 6. However, the strips are used again in order to be able to compare them with the other sensors. Furthermore, the patches 5 and 7 were manufactured with the same coating, but the substrate material differs in spandex content. This is to test the assumption whether the spandex content of the material influences the decay behavior of the resistance value after a movement. The idea is that the higher the spandex content, the faster the textile jumps to its initial position and the shorter the fading phase. A higher spandex content may therefore possibly lead to better transient response. The patches 6 and 8 are used to test whether it makes a difference if the sensor strip meshes run in the longitudinal direction or in the transverse direction to the strain. 4.2
EeonTex
In order to compare the custom-made sensor patches with a commercially available textile sensor, the EeonTex Conductive Stratchable Fabric was also measured with the test system. The strips shown in Fig. 5 consist of 72% nylon and 28% spandex. For comparison with the patches 1 to 8, the EeonTex sensors are labeled E1 and E2. E1 has the meshes in the direction of elongation, E2 transverse to the direction of tension. 4.3
Ideal Behavior
For a strain sensor, it is important that its resistance increases as linearly as possible with the strain and decreases again in the same ratio. This also means that the resistance values are reproducible for a given strain length. In addition, it is also important that the resistance value is independent of time and also of aging and previous strains. Figure 6 shows the ideal case of the resistance value at elongation.
E1
E2
Fig. 5. EeonTex conductive fabric, cut in warp (E1) and weft (E2).
314
S. Wohlrab et al.
Stretch
Resistance
Fig. 6. Ideal behavior of the textile sensor at elongation.
5
Test Results
5.1
Patches
In [15] it was observed that the patches 1, 2 and 3 behave very similarly. This finding can also be made for patches 4 to 8. All patches have an overshoot during a movement, both during a strain and a relief. In addition, after a movement there is always a decay phase in which the resistance value slowly approaches the final value. Therefore, after a deformation of the patch, a certain time is needed until a stable resistance value is reached. The biggest difference between the different sensor strips is the resistance at rest and the resistance range in which they operate. Patch 3 has the lowest resistance in the unexpanded state with 7 kΩ. On the other hand, patch 8 already
40
6 30
1.5
20
10
40
30 5 4
20
3
Stretch in mm
2
1
50 Patch 5 Patch 6 Patch 7 Patch 8 Stretch
7
Stretch in mm
2.5
Normalised Resistance
8
50 Patch 1 Patch 2 Patch 3 Patch 4 Stretch
Normalised Resistance
3
10 2
0.5
0
0 0
100
200
300
Time in s
(a)
400
500
-10 600
0 1 0 0
100
200
300
400
500
-10 600
Time in s
(b)
Fig. 7. Overshoot during loading and unloading operations of the material for patches 1 to 4 (a) and 4 to 8 (b). All patches have an increase in their normalised resistance during load. The dotted line shows a relative stretch of 40%.
Textile Strain Sensor
315
has a resistance greater than 2 MΩ in the resting state. On the one hand, this is due to the direction of the mesh - the strips with the meshes in the direction of tension have a significantly higher resistance - on the other hand, it is due to the coating of the fabric. The increased resistance in patches 4–8 is due to the fact that a recipe with a high water content was used for the coating. The proportion of conductive particles is low, which is why the resistance is higher than with other recipes. This can be seen well in patches 1, 2 and 3, since they only differ in the coating. The more conductive particles there are in it, the lower the resistance of the sensor. Thus, the resistance of textile patch 1 is higher than that of patch 3. However, the ratio of the composition of water and graphite in the coating only influences the resistance ranges. It has little influence on the basic characteristics. Furthermore, the resistance value can depend on the material used. Depending on which fibers are used, the conductive coating adheres differently. If more conductive particles can be absorbed, the resistance of the patch also decreases. It is suspected that the formulation adheres better to rough fibers, such as cotton, than to smooth or already coated fibers, e.g. spandex. After coating, stretching the strip breaks up the conductive surface as the material becomes longer. Depending on the breakage and location of the conductive particles, the resistance range will be different. Inconsistent manufacturing and coating process also cannot be ruled out. If the overshoot of the strips is analyzed, it can be seen that it depends on the stretching speed. In most cases, the resistance value rises sharply even at a low tensile load, which means that further stretching does not increase the value much since it is already at a high value. There is also an overshoot when the strip is relaxed. If the tensile force applied to the strip is reduced, the resistance increases for a short time. Only when there is no more movement of the fabric and the textile is no longer under tension the resistance decreases. This behavior is undesirable, since the overshoot could be wrongly interpreted in the software as further stretching. It is therefore necessary to wait for the decay phase before determining the elongation of the strip. Since the overshoot is present during both a strain and an relaxation, it can be assumed that it occurs whenever the sensor is moving. The resistance value is therefore higher during a movement than when the sensor is at rest. An advantage compared to many other textile strain sensors is that the strips used here do not charge over time. After the resistance value has settled, it also remains constant over time. If the sensor 4, which is pre-stretched on a carrier material, is examined, it can be seen that the application of the sensor to a carrier material does not significantly change its characteristics. During relaxation, the overshoot is somewhat more noticeable, but the decay phase is a little bit shorter. It is noticeable that patches 5 and 7 have a significantly higher resistance and resistance range. While patch 2 is around 30 kΩ and reaches around 50 kΩ when stretched, the resistance of patch 5 in its relaxed state is over 100 kΩ and greater than 600 kΩ under tension.
316
S. Wohlrab et al.
The measurements show that there is a difference between the strips with meshes in the transverse direction and in the longitudinal direction. The variant with the meshes in the transverse direction is much more suitable as a tensile sensor. These sensors have a lower resistance range, which makes the measurement less susceptible to interference and better. In addition, the resistance curve has a lower noise and the overshoot is significantly lower. The comparison between patch 5 with meshes in transverse orientation and the patch 6 with meshes in longitudinal direction shows that meshes with 90◦ to the tensile direction have a better behavior than the textile sensors in longitudinal direction. In the case of the stepped ramps by patch 6, the resistance increases even though the elongation decreases. If the longitudinal strip is loaded too much, the resistance decreases again. The saturation point of patch 6 is reached at an elongation of about 25%. It could not be proven that there is a correlation between the spandex content of the sensor textile and the decay phase. For example, comparing patch 5 with 20% spandex and patch 7 with 40%, no significant improvement in decay time can be observed. With a small increase in elongation, it may happen that the resistance decreases instead of increases. For this reason, it is not possible to convert the absolute resistance value to the stretched length. Of all the patches tested, sensors 5 exhibit the best behavior, which most closely matches the desired ideal case. The sensor has the least overshoot and a fast transient decay phase. The stepped ramp shows a change in resistance value for each change in length, although this is higher at low strain. Nevertheless, it should be noted that due to overshoot, transient phase and inconsistent behavior, it is not possible to use the absolute resistance value to infer the change in length. 5.2
EeonTex
The EeonTex sensor shows a different behavior, since it has a negative resistive coefficient and thus the resistance drops when the textile is stretched. It is noticeable that the overshoot is significantly better than with the other patches and mainly occurs when the resistance rises to its initial value. The resistance range is between 500 kΩ to 1 MΩ and is thus still in a well measurable range. The operating range is 500 kΩ, the resistance is constant and reproducible enough to make statements about strain. Like the other patches, the strip has a decay phase. After a movement, the resistance changes even though the strip is not moved. After the decay time of several minutes, the textile has only a slight drift. There are no significant differences in the behavior between the two strips from EeonTex, seen in the measurements in Figs. 8 and 9. This can be explained by the fact that the fabric is constructed in plain weave and thus identical in the longitudinal and transverse directions. A microscope image of the material can be seen in Fig. 10.
Resistance in
Resistance in
Stretch in mm
Textile Strain Sensor
317
Stretch
40 20 0
0
1000 10
3
10
3
2000
3000
4000
5000
6000
4000
5000
6000
4000
5000
6000
E1
1000
500 0 2000
1000
2000
3000
E2
1500 1000 0
1000
2000
3000
Time in s
Fig. 8. Full test procedure with target stretch and change in resistance for both EeonTex patches.
5.3
Comparison
No major differences were found between sensors 1, 2 and 3. These sensors had different amounts of conductive particles in their coating. Besides the changed base resistance, the amount of conductive particles in the coating had no major effects on the described effects of hysteresis or overshooting during fast movements.
Fig. 9. Overshoot during loading and unloading operations of the EeonTex patches and a decreased normalised resistance during load. The dotted line shows a relative stretch of 40%.
318
S. Wohlrab et al.
For sensors 4 to 8, the amount of water in the coating was increased. It was assumed that the higher viscosity would allow better absorption of the coating into the fabric structure and a more even distribution of the conductive particles on the carrier textile. By increasing the water content, the paste had comparatively fewer conductive particles, which is why the overall resistance increased. By adding an elastic carrier material, sensor 4 showed a shortened decay time. However, the overshoot on motion was greater than similar sensors without additional carrier material. On sensors 5 to 8, it was additionally shown that a tensile direction transverse to the mesh orientation had a positive effect on the behavior of the sensor. A correlation of the behavior of the sensors depending on the Spandex content could not be proven. Since all our textile patches have a different operating principle than the EeonTex sensor material, all textile patches were examined under a microscope. Patches 1 to 8 are very similar under the microscope, but there is a clear difference to the material of EeonTex and their geometric shape due to their manufacturing method, shown in Fig. 10. Especially under tensile load, it can be observed that conductive connections between the meshes are separated in patches 1 to 8. This spatial separation of the meshes leads to an increase in the measured resistance proportional to the load. This can be seen in Fig. 7. In comparison, the meshes of the material from EeonTex are more tightly drawn, forming a larger contact area between warp and weft threads. This can be measured in the form of a decreasing resistance during loads, shown in Fig. 9. Both types, and all coating compositions and substrates, exhibit increased resistance during movement. This effect can be explained by rearrangement effects on the yarn level. It is assumed here that the entire textile is built up as an aggregate of sheet conductive and transition resistances. While the longitudinal and prolonged change in length alters the cross-section and the contact area of the yarns, and thus the resistance of the overall textile, the rapid movement in particular leads to a large number of open contact points between the yarns. In the measurements with different speeds, it was shown that the effect of overshooting increases with higher speed. The relationship between the change in force and the change in length is also determined by the substrate material. Contrary to our expectations, the addition of more spandex did not lead to a damping of the overshoot characteristics.
Textile Strain Sensor
Patch 5
319
EeonTex E1
Fig. 10. Surface of the textile patches under a microscope. The left side is in rest, the right side shows the textile stretched for 40% of it’s neutral length. The white arrow indicates the stretch direction.
6
Conclusion
In this paper, the behavior of various conductive knitted and woven textile sensor patches is investigated. The measured sensors differ in the coating, the spandex content, the pre-stretch, the tensile direction and the carrier material. In order to determine the change in resistance at a certain elongation, a custom test system is used which can automatically stretch the sensor. The test system consists of low-cost components such as stepper motors, parts from a 3D printer, and is capable of stretching the sensor textiles to a predefined length or a predefined force. The maximum force in our tests was limited to 20 N by the loadcell used. It was possible to carry out a meaningful analysis with this measurement range due to the flexible and elastic nature of the textiles tested. By specifying several test procedures at a given distance, the influence of the movement speed on the overshoot behavior of the resistor is demonstrated. Furthermore, different samples were subjected to reproducible tests. By storing the resistance, distance and force information in a time series database, the test rig was able to be operated independently. Various textile sensor samples were analyzed for their suitability as sensors for detecting finger placement and movement in a smart glove, despite the complex underlying material behavior and effects such as non-linearity in resistance to length variation, the superposition of thread and fabric effects and additional mechanical properties due to coated textile structures. All textiles exhibit overshoot during both stretching and unstretching. There is also a time-dependent behavior as the resistance has a decay phase after the movement of the patch. Due to the overshoot and the nonlinear behavior, it is difficult to assign a single resistance value to a specific strain. Therefore, it is not easy to characterize the behavior of the sensor. It does, however, allow the comparison of different textiles and manufacturing methods for use as a sensor.
320
S. Wohlrab et al.
The comparison of the textile patches in contrast to the EeonTex strip shows that the manufacturing method of the underlying textile is the most important factor. In contrast to our patches, the EeonTex is comprised of elastic threads in a plain weave and does exhibit a reduced overshoot and delivers more reproducible values. However, even with this strip, the resistance is not linear during movement and increases briefly during strain, although it should decrease. The measurements show that there is a big difference between the textile sensor strips. Each sensor has its own characteristic behavior, which depends on the design of the sensor, but also on the manufacturing process. In order to be able to reliably predict the behavior of the sensors and to find the significant factors that affect the characteristics, further investigations in combination with manufacturing methods are necessary.
References 1. Baldoli, I., Maselli, M., Cecchi, F., Laschi, C.: Development and characterization of a multilayer matrix textile sensor for interface pressure measurements. Smart Mater. Struct. 26(10), 104011 (2017) 2. Biermaier, C., Bechtold, T., Pham, T.: Towards the functional ageing of electrically conductive and sensing textiles: a review. Sensors (Basel Switzerland) 21(17), 5944 (2021) 3. Bulut, Y., S¨ ular, V.: Effects of process parameters on mechanical properties of coated fabrics. Int. J. Clothing Sci. Technol. 23, 205–221 (2011) 4. Choudhry, N.A., Rasheed, A., Ahmad, S., Arnold, L., Wang, L.: Design, development and characterization of textile stitch-based piezoresistive sensors for wearable monitoring. IEEE Sens. J. 20(18), 10485–10494 (2020) 5. Gandler, M., Eibensteiner, F., Langer, J.: 3D printable sensors for smart textiles. In: 2019 International Conference on Information and Digital Technologies (IDT), pp. 153–157. IEEE (2019) 6. Guo, L., Berglin, L., Mattila, H.: Textile strain sensors characterization-sensitivity, linearity, stability and hysteresis. Nordic Text. J. 2, 51–63 (2010) 7. Islam, G.M.N., Ali, A., Collie, S.: Textile sensors for wearable applications: a comprehensive review. Cellulose 27(11), 6103–6131 (2020). https://doi.org/10.1007/ s10570-020-03215-5 8. Jiang, W.-G., Hallett, S.R., Wisnom, M.R.: Development of domain superposition technique for the modelling of woven fabric composites. In: Mechanical Response of Composites, pp. 281–291. Springer, Cham (2008). https://doi.org/10.1007/9781-4020-8584-0 14 9. Langer, J., Eibensteiner, F., Peterka, J., Knaack, P.: Pressure sensitive shoe insoles and socks for rehabilitation applications. In: 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–6. IEEE (2018) 10. Obwaller, N., Langer, J., Eibensteiner, F.: Smart clothing for detecting pressuresensitive gestures. In: 2019 IEEE 13th International Conference on Application of Information and Communication Technologies (AICT), pp. 1–6. IEEE (2019) 11. Petz, P., Eibensteiner, F., Langer, J.: Performance evaluation of conductive textiles for movement pattern recognition in smart socks. In: 2019 International Conference on Information and Digital Technologies (IDT), pp. 370–375. IEEE (2019)
Textile Strain Sensor
321
12. Petz, P., Langer, J., Eibensteiner, F.: Textile in the loop as automated verification tool for smart textiles application. In: 18th International Conference on Computer Aided Systems Theory: EUROCAST 2022, 20–25 February 2022, December 2021 13. Tognetti, A., Carbonaro, N., Zupone, G., De Rossi, D.: Characterization of a novel data glove based on textile integrated sensors. In: 2006 International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2510–2513. IEEE (2006) 14. Peng, W., Feng, X., Ding, T., Qin, Y.: Time dependence of electrical resistivity under uniaxial pressures for carbon black/polymer composites. J. Mater. Sci. 39(15), 4937–4939 (2004) 15. Wohlrab, S.: Erkennen von Fingerstellungen und Handgesten mittels eines SensorHandschuhs auf Basis von piezoresistiven textilen Sensoren und IMU-Sensorik 16. Zhang, X.-W., Pan, Y., Zheng, Q., Yi, X.-S.: Time dependence of piezoresistance for the conductor-filled polymer composites. J. Polym. Sci. Part B Polym. Phys. 38(21), 2739–2749 (2000)
Problem Structuring Combined with Sentiment Analysis to Product-Service System Performance Management Ingrid Saiala C. S. Feitosa and Luiz Cesar Ribeiro Carpinetti(B) Department of Production Engineering, School of Engineering of São Carlos, University of São Paulo, São Paulo, Brazil [email protected], [email protected]
Abstract. The success of any business model depends on how well it is capable of satisfying customers’ requirements. Therefore, organizations must be able to understand their customer’s desires and evaluate how it is performing. This is even more true in innovative contexts such as the implementation of circular economy (CE) concepts (e.g., value optimization, systems thinking, collaboration) by businesses, which defines the so-called circular business models (CBM). Considering the growing potential of text analytics techniques to extract and provide useful information from customers’ reviews or comments in social media, this article proposes a framework to integrate aspect-based sentiment analysis (ABSA) and a problem structuring method (PSM) to support the investigation of a product-service system (PSS) performance in CE context. The proposed framework aims to aid an organization to incorporate stakeholders’ perspectives to support a systemic analysis of a CBM’s current performance, which may support better-informed decisions of performance management. The proposed approach, integrating ABSA and PSM, presented in this paper was grounded by two systematic literature reviews, which showed its potential contribution and novelty. Keywords: Aspect-based sentiment analysis · Problem structuring methods · Soft System Methodology
1 Introduction The growing concerns about the scarcity of natural resources, consumerism, and rapid products obsolescence have stimulated a search for more sustainable production and consumption models, which has led to the emergence of the Circular Economy (CE). The CE is recognized as one of the prominent topics in management discussion for its dual benefits to environmental and economic performance since it proposes a dual loop system focused on the effective and efficient utilization of resources in an ecosystem (i.e., closing the loop) to benefit performance optimization [1]. Moreover, the transition to a CE system requires new types of business models (BM), the CBM, which propose innovative ways of thinking and doing business in opposition to the traditional linear (‘take-make-waste’) perspective [2]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 322–339, 2022. https://doi.org/10.1007/978-3-031-10464-0_21
Problem Structuring Combined with Sentiment Analysis
323
The value proposition in a CBM shall focus on products and/or services that contribute to decreasing environmental impacts as well as increasing social and economic impacts [2]. By its turn, the value creation in a CBM demands changes in the value chain and production design for durability, reparability, upgradability. To achieve that, it is also necessary to increase integration between multiple stakeholders, such as suppliers and customers [3]. Thus, the effective performance of a CBM requires engaging customers in business processes to improve its value proposition and value capture [2–4]. Given that, this study objective is to incorporate customers’ perceptions to build a systemic analysis to support better-informed performance management in a CBM. To do so, this study proposes a framework that guides the implementation of a text analytics approach to obtain customers’ perceptions on a business model and its incorporation together with providers’ perspectives into a systemic analysis through a Problem Structuring Method (PSM). This systemic analysis aims to provide a comprehensive view of the current business performance and to support decisions on what should be done to achieve desired performance levels. The information extracted from the text analytics implementation will assist in the identification of relevant performance attributes and customers’ perceptions of them. These results are then brought together with performance objectives and values of providers into a PSM. This article proposition focuses on the Product Service System (PSS) business model, which is among the CBM groups defined by the British Standard 8001:2017 and may be exemplified as leasing agreements (i.e., not selling ownership of a product or service) and performance-based agreements (i.e., pay for success – performance, or defined results) [5]. The PSS was chosen because of its interaction with customers when delivering performance since this BM integrates products and services to create customer utility and generate value [6]. In addition, PSSs are recognized as a promising approach to improve sustainability dimensions of traditional business models because of their potential to collaborate to resource use efficiency and to dissociate value perception from a physical product only [7]. In conclusion, since the PSSs are mostly innovative BM, it is of great relevance to obtain customers’ acceptance and provide values required by stakeholders to guarantee their market competitiveness. Besides, customer experience is strongly correlated with acceptance of PSS solutions, which means that the performance perceived by a customer determines its willingness to continue to use the service [8]. The literature also highlights that the evaluation of a PSS performance may be considered as one of the most important tasks for a company as it influences its market competitiveness, its cost-effectiveness, and subsequently, its business efficiency [9]. Given that, this article framework provides a valuable resource by building a systemic view of stakeholders’ perception and objectives of a PSS under analysis to support better-informed decisions and improve performance management. At last, the framework conceptual development was based on two systematic literature reviews, which are discussed next. Accordingly, the following sections will present a conceptualization of the topics and techniques included in this proposition (Sects. 1.1 to 1.2) and the systematic literature reviews that guided the development of the proposed framework (Sect. 2). After that the proposed framework is presented and discussed, as well as the steps necessary to its implementation (Sect. 3 and Sect. 4).
324
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
1.1 Text Analytics and Sentiment Analysis The framework proposed in this article aims to provide useful insights to a CBM performance by incorporating customers’ perceptions. This may be done by implementing techniques that support the extraction of useful information from textual data produced by customers, such as products reviews or comments on social media. These are the text analytics or text mining group of techniques, a category of data mining developments that encompasses methodology and process that enable obtaining high quality and useful information or even insights from textual data [10]. Text analytics has been extensively used in recent years, as shown by studies that capture information from social media data and use it in a variety of contexts such as performance assessment of online retailing services, or performance of smartphone brands and their operational systems, and even sentiment analysis concerning presidential candidates in U.S. elections [11–13]. This analysis of large volumes of textual data, usually generated in social media such as Twitter and Facebook as well as in review sections of e-commerce pages, could not be done by humans alone due to constraints such as subjective biases and mental limitations; however, these restrictions can be overcome to produce consistent results by using sentiment analysis systems [14]. Given that, the text analytics task of sentiment analysis and opinion mining has been experiencing rapid growth [15, 16]. The sentiment analysis or opinion mining may be defined as ‘the computational study of people’s opinions, evaluations, and emotions toward entities, events, and their attributes [17]. The information obtained from its implementation may provide means of assisting customers’ and business providers’ decisions [18]. The sentiment analysis task focuses on extracting opinions expressed in textual data, which is considered a text classification task and commonly formulated as supervised machine learning problem [14]. This problem may be defined as a partition of data into k pre-defined groups that are identified by their labels (e.g., “spam” and “not spam” to classify e-mails) [10]. Besides that, semi-supervised learning and lexicon-based approach are also used to implement sentiment classification. The following section discusses the most adequate sentiment analysis task to this article proposition. 1.2 Aspect-Based Sentiment Analysis (ABSA) The sentiment analysis task may be done in three main classification levels. The first level, the document analysis, aims to classify a whole document that is discussing a given topic as expressing positive or negative sentiment. By its turn, the sentence-level aims to classify sentiments expressed in each sentence of a text. At last, there is the aspectlevel sentiment analysis, or aspect-based sentiment analysis (ABSA) as it is mostly referred in the related literature [19–22]. This last one aims to classify the sentiment concerning specific aspects of entities discussed in a text [23]. The first two levels of text classification, the document, and sentence-level analysis, assume that only one topic is discussed in a document or sentence; however, that is not true in many text formats such as products reviews. The ABSA provides an alternative to this limitation. Its approach may be illustrated as presented in Fig. 1.
Problem Structuring Combined with Sentiment Analysis
325
Fig. 1. Example of ABSA task in a sentence.
The sentence “The camera of this cell phone is excellent, but the battery life is not long enough” has two commented aspects (different topics) of the same entity (a cell phone). Also, the opinion expressed about them has different sentiments, i.e., the camera has a positive evaluation while the battery has a negative one. Therefore, if the objective is to assess a product or service performance, more detailed analysis at entity and aspectlevel is required to identify and classify sentiments associated with specific aspects [19, 23]. Given that, the ABSA was identified as the most adequate text analytics task to achieve the objectives of the proposed framework. The ABSA is divided in two subtasks, i.e., the aspect extraction and aspect sentiment classification subtasks, highlighted in green and red respectively in Fig. 1, and introduces a set of problems that requires deeper natural language processing capabilities, but that can also produce a richer set of results [14, 22]. The ABSA shall be used to extract the main performance attributes perceived about the business model analyzed and the opinions on them. This information is input to build a systemic analysis of the PSS current performance, which can be supported by a PSM. These methods are discussed next. 1.3 Problem Structuring Method (PSM) A PSMs aims to provide analytical assistance to real-world situations that may be characterized as complex ones for the presence of different perspectives, multiple actors, conflicting interests to some extent, intangible aspects, and uncertainty [24]. The use of PSMs may facilitate problems formulation under these circumstances through transparency of representation, interaction, and iteration, and by facilitating negotiation [25]. These methods have been used to address business management issues related, for instance, to supply chain management, knowledge and innovation, and information security, and have the potential to structure problems in different areas such as healthcare management, social issues, and environmental management [26]. Different PSMs may be implemented to elicit relevant knowledge and provide a process to iteratively structure a problem situation [24]. Relevant literature presents among the most applied ones the Soft System Methodology (SSM), the Strategic Choice Approach (SCA), the Strategic Options Development and Analysis (SODA), and the Value Focused Thinking (VFT) [26, 27].
326
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
The SSM was selected to integrate the framework proposed here due to its emphasis on structuring a systemic view and for it is a learning system that leads to action taking in complex and problematical situations aiming at their improvement. These characteristics make it adequate to support a systemic analysis [28]. The support offered by the SSM is important to implement one of the CE principles that shall be applied by business and guide organizational decision making, the systems thinking. As SSM and CE share this principle, it may be valuable to CE-related analysis to make use of this PSM to explore complexities in contexts such as policy-making towards CE implementation [29]. 1.4 Soft System Methodology (SSM) The SSM is a system-thinking-based method of intervening in real-world problematic situations [30]. It provides an organized way of thinking about these situations and proposes actions to bring about improvements [31]. The implementation of the SSM process is based on the following essential elements: (1) a problematic real-world situation where may exist potential improvement opportunities; (2) models of system activity relevant to this situation, but not describing it ‘as it is’; (3) use of these models as devices to explore the problematic situation; and (4) a structured debate about desirable and feasible change [31]. The SSM process applies these elements in a learning cycle, which goes from finding out about a problematical situation to defining and taking action to improve it, and this cycle is developed through the stages illustrated in Fig. 2 [29, 31].
Fig. 2. Basic Structure of SSM Learning Cycle. Adapted from [33] and [35].
As illustrated in Fig. 2, the SSM learning cycle encompasses: finding out about a problematical situation; making relevant conceptual models exploring it which are based on different worldviews and CATWOE definitions (i.e., who are the Customer – C and
Problem Structuring Combined with Sentiment Analysis
327
Actors – A, which is the Transformation process – T and in what Worldview – W it is involved, who is the Owner – O, and what are the Environmental constraints – E); questioning the situation using these defined models to identify desirable and feasible actions to improvement; and implementing these actions [31]. In addition, SSM is a learning system, which means that it does not seek optimized solutions, but an accommodation among interests and views that will enable actions to undertake feasible improvements [29].
2 Research Methodology This article proposition was developed based on the results of two systematic literature reviews (SLR). An SLR is a process that, by following methodological rigor, supports the obtention of insights through theoretical synthesis into fields and subfields and that aids the development of a reliable knowledge base from a range of studies [32]. This article’s methodology steps are summarized in Fig. 3.
Fig. 3. Article methodology steps.
As illustrated in Fig. 3, the first phase of this proposition consisted of an investigation in scientific literature. This was done through two literature reviews whose objectives were: (1) to identify current approaches of performance management in PSS in CE contexts and analyze what is assessed, what methods are applied, and attributes taken into consideration; and (2) to review the literature focused on the implementation of text analytics approaches to support performance management and decision making within organizations. The results obtained were important to analyze the feasibility and relevance of this article proposition and to assist the selection of the most suitable text analytics approach to support performance analysis. These literature reviews implementation and obtained results are discussed next.
328
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
2.1 SLR on Performance of PSS in CE The first SLR aimed to answer the following research question: How PSS business models in CE contexts have their performance assessed? Which are the main aspects emphasized when assessing their performance? To help to answer this question, search strings were defined with the following keywords: product-service, product service or product as a service, and performance assessment, performance evaluation, and performance management. They were used in the databases Web of Science and Scopus, which were chosen for this SLR for they encompass renowned scientific publications on the researched topics. The results obtained were initially filtered according to language, year of publication, research area, and type of source. Thus, only articles written in English, published in journals or conferences from 2000 to 2021, and in research areas correlated to this work were selected for the next SLR phases (i.e., publications from areas such as Chemistry, Medicine, Physics, and Astronomy, were excluded). This preliminary stage resulted in 688 articles and after the removal of duplicates, a set of 593 results remained to be filtered on the subsequent phases. Three filters were applied to this resultant set of papers on the next SLR stages. These filters were: (1) Reading the article’s title, abstract, and keywords; 2) Reading the Introduction and Conclusion sections, and 3) Reading the full content. This process goal was to select only articles that have performance assessment and/or performance management-related analysis in their main proposition or articles that propose attributes to be taken in consideration when assessing a circular PSS performance. The described steps are summarized in Fig. 4.
Fig. 4. Stages of SLR on PSS performance assessment.
In the final stage, 52 articles were read in full and 36 of them were selected to be submitted to content analysis. This stage consisted in a comprehensive study of the articles proposition to extract and describe the most relevant aspects to this SLR
Problem Structuring Combined with Sentiment Analysis
329
objectives. This analysis identified that PSS has been focused on several approaches to analyze either its design or implementation phase and had its role in CE adoption also remarked [8, 9, 33–35]. However, the business performance analysis of circular PSS and identification of improvement opportunities as proposed is this paper have not been addressed yet. In addition, the incorporation of customer requirements or customer values on the identified analysis highlights the relevance of these perspectives on PSS assessment [8, 36–38]. In a final analysis, both findings demonstrate the relevance of the framework proposed in this article to incorporate customers’ perception in a systemic analysis of a circular PSS performance and to benefit process improvement. 2.2 SLR on Text Analytics Uses Focused on Improving Decision Making and Performance Management This second SLR aimed at answering ‘How are text analytics techniques being used to obtain useful information to improve decision making and performance management within organizations?’. Three renowned databases were consulted: Web of Science, Scopus, and IEEE Xplore Digital Library. Each of these databases has its own search mechanism, thus a set of keywords was defined and then used to build adequate search strings. The keywords applied were text analytics, text mining, opinion mining, or sentiment analysis, and decision making or decision-making, performance management, business performance management, business performance, and organizational performance. Only articles written in English and published in journals or conferences from the years 2000 to 2021 were selected to the subsequent phases of the SLR. The stages of this SLR are illustrated in Fig. 5 and further discussed next.
Fig. 5. Stages followed in SLR on text analytics to improve decisions and performance management.
330
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
The initial resulting dataset consisted of 195 articles. After the removal of duplicates, 156 articles remained to be filtered on the next steps. The filters applied were the same defined to the SLR previously discussed: 1) Reading the article’s title, abstract and keywords; 2) Reading the Introduction and Conclusion sections; and 3) Reading the full content. These consecutive stages excluded articles that do not develop uses of information obtained from text analytics techniques to improve decisions and process performance in organizations. This choice was made since to fulfil this SLR objective it is important to comprehend how this asset is being used to support organizational process improvement. Furthermore, literature reviews and studies that focus on the proposition of new techniques developments to analyze textual data were also excluded. Given the defined criteria, 31 articles were selected to be read in full. This SLR’s findings have shown the use of text analytics techniques in different contexts as a means of supporting performance analysis and/or improving decisions and processes in organizations. Examples of applications of text analytics techniques are analysis of customer reviews to monitoring restaurant performance; analysis of tweets to identify disruptions in a telecommunication company processes and to evaluate organizational performance in railway services; and decision automation on acceptance of students transfer requests in a university [21, 39–42]. The studies in the selected articles have mainly used data from customer reviews and social media. However, organizational reports are also a source of data in applications such as corporate operating performance assessment [43], credit risk evaluation in the decision-making process to loans [44], and analysis to build forecast models of firm performance [45]. Regarding the text analytics tasks, sentiment analysis, and topic modelling stand out in the examined studies. They are present in 12 of the articles and are seen also associated in the text analytics task ABSA. This evidences that these are largely used text analytics approaches in process performance analysis developments. This validates the choice of this class of methods to achieve the objectives of this study. In addition, only one of the articles analyzed has brought the use of text analytics techniques as a means of integrating stakeholders’ opinions to PSS analysis. In [37] it is proposed a framework to assist PSS design and Lean PSS design that integrates a sentiment analysis approach to analyze comments obtained from an inter-organizational forum on the overall issues of a mold maintenance service. This framework main objective is to assist an effective evaluation of PSS design and to provide guidelines for a Lean PSS design [37]. Even though the discussed study demonstrates an interesting incorporation of customer opinions in PSS analysis, there is still much potential to be explored on text analytics techniques uses in PSS management, as literature on these topics has presented few studies till now. Also, this analyzed study has focused on PSS design phase, but this kind of textual data analysis should also be incorporated to support assessment on the post-implementation phase. Since the value provided to customers in a PSS is more connected to performance during the use phase than to the ownership of a product [36], being able to effectively incorporate customers’ perceptions is imperative to identify improvement opportunities and assist performance management decisions in this context. Finally, the focus of the study discussed and of most of this SLR’s findings were far from circular PSS. Given that it is an emerging business configuration that incorporates
Problem Structuring Combined with Sentiment Analysis
331
innovative values from both customers and providers, PSSs may highly benefit from effective integration of customer perspectives to assist performance improvement. In conclusion, the discussed findings highlighted the potential of this article proposition of providing an effective procedure that supports performance management and improvement of PSS in the CE context. Also, the approach proposed to build a systemic view of a circular PSS to analyze its performance and identify improvement opportunities was not identified in previous literature consulted. 2.3 Research Gap Identified and Addressed by this Article Proposition The results from the first SLR illustrates the use of customers’ perception to evaluate the fulfillment of their requirements throughout the PSS life cycle stages, which can be done either by supporting decision-making to select the best PSS design option or by continuous improvement of the quality of product and services in an implemented PSS [46]. However, most of the findings do not incorporate customers’ perception to monitor performance in the use-phase or post-implementation phase, nor develop this sort of analysis in PSSs in a CE context. By its turn, the results from the second SLR indicate that many organizational contexts have already been benefiting from using text analytics to support performance management and decision making, including in analysis in a PSS design phase [37]. These findings demonstrate the potential of incorporating text analytics approaches to performance analysis and to obtain valuable gains to PSS management, particularly in CE contexts since PSSs are mostly innovative business models and seen as one of the most viable ways of practical implementation of CE and sustainability concepts [8, 47]. Thus, there is great value in monitoring customers’ satisfaction in these businesses and assessing whether values required by these stakeholders are being provided, so they are able to succeed. Accordingly, this framework proposition focused on the business model performance in the use-phase and incorporated both provider’s and customer’s perspectives as a means of supporting a systemic analysis. The information obtained through a text analytics implementation is integrated on a PSM approach to building a systemic view of the analyzed context. This analysis may support performance management by structuring a comprehensive view to identifying improvement opportunities. Hence, companies that implement PSS in a circular context may guide their business towards more effective practices and make better-informed decisions to guarantee customer satisfaction and competitiveness.
3 A Framework to Integrate ABSA and SSM to Support Identification of Improvement Opportunities The proposed framework was grounded by the discussed results from the two SLR. It was identified that evaluations of a PSS may be developed in two stages of its lifecycle and have each a different focus: evaluations in pre-implementation (design or development phase), which aims to support decision-making on the most suitable PSS solution among several alternatives; and evaluation in post-implementation (use or implementation, and
332
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
disposal), whose focus is on monitoring PSS progress and the degree in which its implementation is achieving intended goals [46]. Given that this study scope is on performance assessment in use (or implementation) phase for this framework main objective is to provide an approach that supports the identification of improvement opportunities once the PSS is fully functioning. Furthermore, according to the work of [48] on smart PSSs, performance and quality evaluations on PSS have focused on customers’ perception mainly in the early design phase and there is a lack of context-aware solutions evaluation in the usage phase. Hence, this framework aims to enable performance evaluation in the use phase through the incorporation of stakeholders’ perception (customers and providers) in an analysis that supports the identification of improvement opportunities. As a result, companies that implement this framework may be able to guide their business towards more effective practices and make better-informed decisions to guarantee competitiveness and customer satisfaction. To achieve this objective, an integrated approach of ABSA and a PSM was developed. As previously discussed in Sects. 1 and 2, the ABSA was identified as the most suitable text analytics task to the proposed objective. Its implementation enables the identification of the performance attributes that customers comment about and what is their perception of them, i.e., whether they are satisfied or not. The knowledge obtained from the ABSA implementation alongside information on the business provider’s objectives and values is then used as input to a PSM selected, the SSM. An illustration that summarizes the implementation of this integrated approach is presented in Fig. 6.
Fig. 6. Implementation of the proposed framework
As presented in Fig. 6, comments or reviews generated on webpages, forums, and social media by customers about their experiences with a PSS may be a source of textual data to an ABSA implementation. Instead of treating opinion mining simply as a classification of sentiments, ABSA separate this task in the aspect extraction and aspect sentiment classification subtasks [14, 49], thus, its objective is to classify sentiments concerning specific aspects, or performance attributes, of entities discussed in a text
Problem Structuring Combined with Sentiment Analysis
333
[23]. This information is then associated with information on the provider’s objectives and values, which may be acquired, for instance, through interviews, questionnaires, and organizational documents, in the SSM implementation. The provider’s perspective on business performance regarding their projected expectations can then be used to build a comparison with the customers’ perception obtained from ABSA results. This is done by incorporating these two perspectives into the rich picture proposed by the SSM. This rich picture is a graphical representation that shall capture the main entities, structures, and viewpoints in a situation, or processes going on, and current recognized issues or potential ones [31]. After that conceptual models are built by defining and linking necessary activities to achieve the transformation processes that were identified. These models are developed to provide means of questioning and identifying where there are improvement opportunities. Once these opportunities are recognized, actions to implement the necessary changes may be defined, observing whether these are desirable and feasible. These described stages were integrated in the proposed framework illustrated in Fig. 7. Its implementation steps are further discussed next.
Fig. 7. The Framework that integrates ABSA and SSM to identify improvement opportunities and support performance management.
334
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
As previously discussed, the main objective is to structure a systemic analysis to identify improvement opportunities and to support better-informed performance management. To do so, this framework was divided into the following two phases: Phase 1 – the steps included in this phase are focused on getting (as illustrated on the top of Fig. 7): a. The customers’ perception of current business performance – an ABSA technique shall be implemented on customers’ comments or reviews to extract the main performance attributes perceived in relation to the business model analyzed and their opinions on them (i.e., whether they are satisfied or not); and b. The providers’ perspectives and values – these may be gathered through interviews, questionnaires, and analysis of organizational documents. The first approaches developed to carry out ABSA subtasks were based on the frequency of nouns and phrases in a given text [50], which is supported by the assumption that aspect words are more likely to be repeated. Another approach is the investigation of opinion and targets relations focused on rule-based linguistic patterns. The method most identified in the literature review related to this approach was the topic modelling, mostly the Latent Dirichlet Allocation (LDA) model, which has been widely applied to perform the identification of both entity and aspects [19]. In addition, there is a predominance of supervised learning methods to perform ABSA subtasks, such as linear classifiers (e.g., Support Vector Machine, Neural Networks) and deep learning methods, more specifically deep neural networks (DNN), as these are among the most recent advancements in processing of textual data. Given that many methods may be implemented to perform ABSA tasks. Each of them will have its own advantages and drawbacks, such as the inability to incorporate domain and context-specific orientations in approaches based only on word frequency, or the requirement of a labelled dataset for training deep neural networks. That said, the selection of the most suitable method will depend on criteria such as type and volume of available textual data and amount of time required to label it, as well as previous performance of these methods in similar tasks. In addition, it may also be interesting to implement more than one method, for instance, different DNN models, and to compare their performance to recommend the most suitable one. These decisions shall be addressed in an implementation phase of this framework and as they depend on very case-specific characteristics, this article will not indicate any of them as most adequate. This may be done later in a case study that implements the proposed framework. Phase 2 – the objective in this phase is to incorporate the output of the previous phase (perceived performance attributes and sentiments and performance objectives) into the SSM implementation, which includes: a. Elaborate a rich picture that shall represent all relevant aspects of performance to both actors involved in the analysis. Relationships among these aspects and divergence among performance expectations and service or products provided that may have been identified shall also be added.
Problem Structuring Combined with Sentiment Analysis
335
b. Create conceptual models of the analyzed system based on the SSM root definition and CATWOE. The conceptual model is an ideal representation of a given process or system under analysis and should also describe which activities need to be implemented to make the ideal system operational. c. Develop a comparison between the conceptual model and the real world to discuss where there are improvement opportunities and to define possible changes. The proposed changes shall be desirable and feasible, according to the SSM, so they can be implemented. The SSM still proposes a concluding step of ‘taking action’, which consists in implementing proposed changes that were accepted by the organization. This further step depends on decisions of organization management and is represented in the framework by the ‘Performance Management Decision’ box. The framework implementation aims to provide knowledge that supports these decisions, by identifying process and aspects that may be improved. A final point to remark is that the decision of concluding a systems study that used SSM is an arbitrary process, for there are no permanent solutions since the events and ideas are dynamic and the learning and systems thinking processes are in principle never-ending [28].
4 Discussion This study proposed a framework that offers a path to build a systemic analysis of a BM performance by integration of ABSA, as a means of obtaining customers’ perceptions, and the SSM, which is implemented as a learning process and can be used to guide the identification of improvement opportunities. This integration was proposed to obtain useful information from customers’ perception and to assist management decisions to improve performance. This approach may be highly beneficial in PSS in CE contexts since these business models usually create innovative relationships between provider and customers and are dependent on customer experience and acceptability. Therefore, its implementation may help to guarantee customer satisfaction and business competitiveness. Additionally, the combined use of the techniques ABSA and SSM, as proposed by this study, was not identified in the previous literature. Thus, the proposed framework may provide a novel approach that enriches performance management analysis. Text analytics has been widely applied in many studies; however, ABSA was not used to assess PSS performance in the usage phase. The incorporation of customers’ perception through ABSA helps to monitor current performance and may be seen as a more efficient way to gather customers’ opinions since it uses textual data freely available on the internet (e.g., in forums and social media). This data may provide more sincere and trustworthy opinions since customers freely wrote them; on the other hand, these customers may not be so willing to answer tiring questionnaires or surveys, which are commonly used in studies that analyze customers’ perceptions [8]. These traditional approaches may also require a considerable number of respondents to perform statistical analysis and may be costly to organizations. Furthermore, even though there are different conceptual models to support performance analysis or build evaluations indexes [36, 46, 51], this is the first to propose guidance to structure a systemic view that contrast current performance achieved by a business
336
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
model, according to customers’ perceptions, and providers objectives to identify improvement opportunities. Finally, the structured analysis enabled by this approach may support decision-making within organizations by providing useful information that may be incorporated to build effective decision models, such as with multicriteria decision making methods that have already been used in many studies [34, 37, 52]. At last, further developments of this study shall implement the proposed framework in a case study to enable its validation and help to identify where it needs to be improved. Once a case study is defined and the necessary textual data is collected, the methods to implement ABSA may be selected (e.g., neural networks, topic modelling algorithms) and the models may be built. These may have their performance evaluated according to predefined criteria (i.e., Precision, Recall, F-score, Accuracy) and in comparisons with results obtained from similar applications in related literature. The development of the SSM (phase 2 of the framework) shall be accompanied and have its results validated by members of the PSS being studied and by specialists in this business model. In addition, although the integration of ABSA and SSM proposed in this framework was developed to support performance management in a PSS, its structure does not have context-specific stages. Hence, it is expected that once the framework implementation is validated, it may also be explored to develop similar analyses in different circular business models.
References 1. Alhawari, O., Awan, U., Bhutta, M.K.S., Ülkü, M.A.: Insights from circular economy literature: a review of extant definitions and unravelling paths to future research. Sustainability 13(2), 859 (2021) 2. Oghazi, P., Mostaghel, R.: Circular business model challenges and lessons learned-an industrial perspective. Sustain. 10(3), 1–19 (2018) 3. Lopes de Sousa Jabbour, A.B., et al.: Circular economy business models and operations management. J. Clean. Prod. 235, 1525–1539 (2019) 4. Urbinati, A., Rosa, P., Sassanelli, C., Chiaroni, D., Terzi, S.: Circular business models in the European manufacturing industry: a multiple case study analysis. J. Clean. Prod. 274, 122964 (2020) 5. British Standards: Framework for Implementing the Principles of the Circular Economy in Organizations – Guide. BSI Stand. Ltd., p. 90 (2017) 6. Boehm, M., Thomas, O.: Looking beyond the rim of one’s teacup: a multidisciplinary literature review of product-service systems in information systems, business management, and engineering & design. J. Clean. Prod. 51, 245–260 (2013) 7. Kjaer, L.L., Pigosso, D.C.A., McAloone, T.C., Birkved, M.: Guidelines for evaluating the environmental performance of product/service-systems through life cycle assessment. J. Clean. Prod. 190, 666–678 (2018) 8. Pecorari, P.M., Lima, C.R.C.: Correlation of customer experience with the acceptance of product-service systems and circular economy. J. Clean. Prod. 281, 125275 (2021) 9. Mourtzis, D., Fotia, S., Vlachou, E., Koutoupes, A.: A Lean PSS design and evaluation framework supported by KPI monitoring and context sensitivity tools. Int. J. Adv. Manuf. Technol. 94(5–8), 1623–1637 (2017) 10. Aggarwal, C.C.: Machine Learning for Text. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-73531-3 11. Ibrahim, N.F., Wang, X.: A text analytics approach for online retailing service improvement: evidence from Twitter. Decis. Support Syst. 121, 37–50 (2019)
Problem Structuring Combined with Sentiment Analysis
337
12. Arora, D., Li, K.F., Neville, S.W.: Consumers’ sentiment analysis of popular phone brands and operating system preference using twitter data: a feasibility study. In: Proceedings of the International Conferenc Advanced Information Networking and Applications AINA, vol. 2015, pp. 680–686 (2015) 13. Wang, C.-H., Ali, M.H., Chen, K.-S., Negash, Y.T., Tseng, M.-L., Tan, R.R.: Data driven supplier selection as a circular economy enabler: a Taguchi capability index for manufactured products with asymmetric tolerances. Adv. Eng. Inf. 47, 101249 (2021) 14. Liu, B., Zhang, L.: A Survey of opinion mining and sentiment analysis. In: Aggarwal, C.C., Zhai, C.X. (eds.) Mining Text Data, pp. 415–463. Springer US, Boston, MA (2012). https:// doi.org/10.1007/978-1-4614-3223-4_13 15. Soong, H.C., Jalil, N.B.A., Kumar Ayyasamy, R., Akbar, R.: The essential of sentiment analysis and opinion mining in social media : Introduction and survey of the recent approaches and techniques. ISCAIE 2019 – 2019 IEEE Symposium on Computer Applicationa and Industrial Electronics, pp. 272–277 (2019) 16. Ravi, K., Ravi, V.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowledge-Based Syst. 89, 14–46 (2015) 17. Liu, B.: Sentiment analysis: a multi-faceted problem. IEEE Intell. Syst. 25(3), 76–80 (2010) 18. Aggarwal, C.C., Zhai, C.X. (eds.): Mining Text Data. Springer US, Boston, MA (2012). https://doi.org/10.1007/978-1-4614-3223-4 19. Do, H.H., Prasad, P.W.C., Maag, A., Alsadoon, A.: Deep learning for aspect-based sentiment analysis: a comparative review. Expert Syst. Appl. 118, 272–299 (2019) 20. Singh, A., Jenamani, M., Thakkar, J.: Do online consumer reviews help to evaluate the performance of automobile manufacturers? J. Enterp. Inf. Manage. 33(5), 1153–1198 (2020) 21. Ching, M.R.D., De Dios Bulos, R.: Improving restaurants’ business performance using yelp data sets through sentiment analysis. In: ACM International Conference Proceeding Series, no. 2013, pp. 62–67 (2019) 22. Wang, W., Liu, W., Mingers, J.: A systemic method for organisational stakeholder identification and analysis using soft systems methodology (SSM). Eur. J. Oper. Res. 246(2), 562–574 (2015) 23. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014) 24. Rosenhead, J.: Past, present and future of problem structuring methods. J. Oper. Res. Soc. 57(7), 759–765 (2006) 25. Rosenhead, J.: What’ s the problem ? An introduction to problem structuring methods. Interfaces (Providence) 26(6), 117–131 (1996) 26. de Alexandre, A., Júnior, G., Schramm, V.B.: Problem structuring methods: a review of advances over the last decade. Syst. Pract. Action Res. 35(1), 55–88 (2021) 27. de Almeida, A.T., Morais, D.C., Costa, A.P.C.S., Alencar, L.H., de Daher, S.F.D.: Decisão em Grupo e Negociação: métodos e aplicações. São Paulo: Atlas (2012) 28. Checkland, P.B.: Soft systems methodology. Hum. Syst. Manage. 8(4), 273–289 (1989) 29. Abuabara, L., Paucar, A., Burrowes, T.: Consumers’ values and behaviour in the Brazilian coffee-in-capsules market: promoting circular economy. Int. J. Prod. Res. 57(23), 7269–7288 (2019) 30. Checkland, P.B., Haynes, M.G.: Varieties of systems thinking: the case of soft systems methodology. Manage. Control Theory 3, 151–159 (2019) 31. Checkland, P., Poulter, J.: Soft systems methodology. In: Reynolds, M., Holwell, S. (eds.) Systems Approaches to Making Change: A Practical Guide, pp. 201–253. Springer London, London (2020). https://doi.org/10.1007/978-1-4471-7472-1 32. Tranfield, D., Denyer, D., Smart, P.: Towards a methodology for developing evidenceinformed management knowledge by means of systematic review. Br. J. Manage. 14, 207–222 (2003)
338
I. S. C. S. Feitosa and L. C. Ribeiro Carpinetti
33. Chirumalla, K., Bertoni, A., Parida, A., Johansson, C., Bertoni, M.: Performance measurement framework for product-service systems development: a balanced scorecard approach. Int. J. Technol. Intell. Plan. 9(2), 146–164 (2013) 34. Rondini, A., Bertoni, M., Pezzotta, G.: At the origins of product service systems: supporting the concept assessment with the engineering value assessment method. CIRP J. Manuf. Sci. Technol. 29, 157–175 (2020) 35. Wang, N., Ren, S., Liu, Y., Yang, M., Wang, J., Huisingh, D.: An active preventive maintenance approach of complex equipment based on a novel product-service system operation mode. J. Clean. Prod. 277, 123365 (2020) 36. Wilberg, J., Hollauer, C., Omer, M.: Supporting the performance assessment of productservice systems during the use phase. Procedia CIRP 30, 203–208 (2015) 37. Mourtzis, D., Papatheodorou, A.M., Fotia, S.: Development of a key performance indicator assessment methodology and software tool for product-service system evaluation and decision-making support. J. Comput. Inf. Sci. Eng. 18(4), 1–13 (2018) 38. Wang, P.P., Ming, X.G.: Value evaluation method of industrial product-service based on customer perception. Int. J. Serv. Oper. Inform. 9(1), 15–39 (2018) 39. Fernandes, E., Moro, S., Cortez, P., Batista, F., Ribeiro, R.: A data-driven approach to measure restaurant performance by combining online reviews with historical sales data. Int. J. Hosp. Manage. 94, 102830 (2021) 40. Dlamini, N., Marebane, S., Makhubela, J.: Mining campus transfer request data. In: 7th International Conference on Soft Computing and Machine Intelligence ISCMI, pp. 148–152 (2020) 41. Yang, J., Anwar, A.M.: Social media analysis on evaluating organisational performance: a railway service management context. In: Proceedings – 2016 IEEE 14th International Conference on Dependable, Autonomic and Secure Computing DASC (2016), IEEE 14th International Conference on Pervasive Intelligence and Computing PICom (2016), IEEE 2nd International Conference on Big Data, pp. 835–841 (2016) 42. Ayoub, A., Elgammal, A.: Utilizing twitter data for identifying and resolving runtime business process disruptions. In: Panetto, H., Debruyne, C., Proper, H.A., Ardagna, C.A., Roman, D., Meersman, R. (eds.) OTM 2018. LNCS, vol. 11229, pp. 189–206. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02610-3_11 43. Chang, T.M., Hsu, M.F., Hu, G.H., Lin, K.P.: Salient corporate performance forecasting based on financial and textual information. In: 2016 IEEE International Conference on Systems, Man and Cybernetics SMC 2016 – Conference Proceeding, pp. 959–964 (2017) 44. Zhang, D., Xu, W., Zhu, Y., Zhang, X.: Can sentiment analysis help mimic decision-making process of loan granting? A novel credit risk evaluation approach using GMKL model. In: Proceedings of the Annual Hawaii International Conference on System Science, pp. 949–958 (2015) 45. Sai, P.K., Gupta, P., Fernandes, S.F.: Analysing performance of company through annual reports using text analytics. In: Proceeding 2019 International Conference on Digital Landscaping Artificial Intelligence ICD 2019, pp. 21–31 (2019) 46. Nakada, T., Sholihah, M., Mitake, Y., Shimomura, Y.: Toward the development of a comprehensive product-service system (PSS) evaluation method. Procedia CIRP 93, 802–807 (2020) 47. Rosa, P., Sassanelli, C., Terzi, S.: Towards circular business models: a systematic literature review on classification frameworks and archetypes. J. Clean. Prod. 236, 117696 (2019) 48. Wang, Z., Li, X., Chen, C.H., Khoo, L.P.: Evaluating smart PSS solutions with contextawareness in usage phase. Adv. Transdiscipl. Eng. 12, 333–342 (2020) 49. Wang, B., Liu, M.: Deep Learning for Aspect-based Sentiment Analysis (2015)
Problem Structuring Combined with Sentiment Analysis
339
50. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177 (2004) 51. Chou, C.J., Chen, C.W., Conley, C.: An approach to assessing sustainable product-service systems. J. Clean. Prod. 86, 277–284 (2015) 52. Pan, J.N., Nguyen, H.T.N.: Achieving customer satisfaction through product-service systems. Eur. J. Oper. Res. 247(1), 179–190 (2015)
Texture Transfer Attention for Realistic Image Completion Yejin Kim, Manri Cheon, and Junwoo Lee(B) LG Electronics, Seoul, Korea {yejin726.kim,manri.cheon,junwoo.lee}@lge.com
Abstract. Over the last few years, the performance of inpainting to fill missing regions has shown significant improvements by using deep neural networks. Although the recent inpainting works create visually plausible structure, but insufficient expression of the texture of objects or color distortion make feel heterogeneity. Motivated by these observations, we propose a method for transferring texture patches using skip-connection that Texture Transfer Attention network that better produces the missing region inpainting with fine details. The network is a single refinement network and takes the form of U-Net architecture that transfers fine texture features of encoder to coarse semantic features of decoder through skip-connection. Texture transfer attention is used to create a new reassembled texture map using fine textures and coarse semantics that can efficiently transfer texture information as a result. Keywords: Image completion Attentions
1
· Inpainting · Texture Transfer
Introduction
Image inpainting is an approach to plausibly synthesize alternative contents into missing regions [2] damaged or non-critical information Fig. 1. In computer vision, image inpainting works have been focused on challenging topics and have produced a considerable progress over the decades [1,4,8,18,24], and it has been applied to many tasks such as old photo restoration, image super-resolution, crop/stitching, and video inpainting. The core challenge of image inpainting is to generate high-level semanticallyreasonable and visually-realistic texture details for the missing regions [4,9,13, 15,22,30]. Existing approaches are broadly categorized into two groups: a texture synthesis approach using a low-level image feature and a feed-forward generative model using deep convolution networks. The texture synthesis approach [1,5] can synthesize plausible stationary texture, via level-diffusion and patch-based algorithm. The patch-based algorithm [1] iteratively searches a similar patch in background and pastes it into the missing region to synthesize a visually-realistic result. This approach works especially well with simple composition and complex textures such as natural scenes, however, it cannot hallucinate a novel image c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 340–352, 2022. https://doi.org/10.1007/978-3-031-10464-0_22
TTA-Net
341
Fig. 1. Image Inpainting Results Generated by the Proposed System Built on Texture Transfer Attention. Each Triad Displays the Original Image, the Image with the Damaged Image masked in white, and the Result of Image Inpainting.
that contains the missing regions with high-semantic context or does not have adequate patches in the background. To solve this problem, the feed-forward generative model approach [11,12,15,17,19,22,30] proposes to encode the semantic context of the image into a feature space using deep neural networks and decodes the semantic patch to generate semantically-reasonable results. However, this approach appears visually-blurry due to the loss of details caused by the repeated convolutions and poolings. To ensure high-quality image inpainting performance, we propose a method of extracting the feature of each multi-layer, and efficiently transmitting them to decode the features back into the result. First, we adopt the U-Net structure [25] that delivers multi features from encoder to decoder via skip-connection. Second, we propose a texture transfer attention (TTA) module to efficiently transfer texture information to the result. The conventional patch-based network [30–32] method calculates the similarity of each patch through Softmax, channel-wise, and simulates by each weight. However, if there are many similar patches in the background, multiple patches are summated on the basis of similar weights and the result appears blurry. To solve this problem, we search for the most similar patch and solely use the index and similarity weight on the patch to reflect the result. Our method also applies a kernel of the same size to different resolutions for each layer, and changes the receptive field size to extract visual information such as semantic context of lower layers and texture details of upper layers. Third, we propose a highly simple approach of synthesis texture without using complex fusion or summation that causes blur.
342
Y. Kim et al.
This network is refinement network for fast and accurate training to learn high-level contexts for realistic texture synthesis. The main pipeline without the skip-connection generates the contextual structure in the missing regions and synthesizes the texture components transferred through the TTA for each layer. The TTA network (TTA-Net) is optimized by minimizing the reconstruction loss, adversarial GAN loss [6], VGG based perceptual loss [15] and style loss [7]. Experiments were conducted with publicly available datasets that could well represent inpainting performance such CelebA faces [21], CelebA-HQ faces [16], DTD textures [3], and Places2 [27].
2
Related Work
In computer vision, image inpainting has been focused on challenging topics and has produced significant progress during the last decades [1,4,8,19,24]. Inpainting studies can be largely divided into two categories: non-learning and learning inpainting approaches. The non-learning approach, is a traditional diffusionbased or patch-based with low-level features. However, the learning approach is based on the deep neural network method that is being actively studied recently. This method learns a convolution layer that predicts the contents and pixels of the missing regions. Traditional non-learning approaches such as [2–5] can either propagate surrounding information or copy information from similar patches in the background to fill in missing regions. These methods are effective for stationary and repetitive texture information, but are limited only to non-stationary data that are locally unique. Huang et al. [9] blended the known regions into target regions to minimize discontinuities. Simakov et al. [23] proposed a bidirectional patch similarity-based scheme to better model anomalous visual data for re-targeting and inpainting applications. However, the approaches [9,23] require very high expensive operation that dense computation of patch similarity. To solve this problem, Barnes et al. [1] proposed a patch-match method, the fastest neighboring field algorithm using random initializations. Patch-match showed a significantly better performance before the emergence of learning-based methods, and has been applied to several image editings. Recently, as studies on deep learning have been actively conducted, the paradigm of image inpainting has also been changed based on GAN-based approaches. The first deep neural network for inpainting, Context encoder [24], firstly train deep neural networks for inpainting large holes, proposed a method of filling the missing regions with semantic information through feature learning and adversarial loss with novel encoder-decoder pipeline. However, it performs poorly in generating fine-detailed textures. Iizuka et al. [11] proposed a generative model for high-resolution images using local and global discriminators and expanded the receptive field using dilated convolution. However, additional postprocessing step is required to maintain color consistency near hole boundaries. Yu et al. [30] proposed a generative network that creates stacks that fill the pixels of missing regions with similar patches from the background to ensure color
TTA-Net
343
Fig. 2. Overview of our Framework with Texture Transfer Attention Module and Feature Synthesis Module
and texture consistency in the newly created areas and the surroundings. Pconv [20] is designed to eliminate the mask effect through re-normalization by distinguishing the valid pixels of irregular masks. Yu et al. PEN-net [32] proposed a network that can effectively transmit high-semantic information from encoder to decoder by using a U-Net structure that uses skip connection for multiple layers.
3
Approach
In this section, we introduce the proposed TTA for realistic images. We first describe the overview of an inpainting network in Fig. 2 and details of TTA in Fig. 3. 3.1
The Overall Pipeline
The overall pipeline of the proposed TTA-Net mechanism is illustrated in Fig. 2. This framework is a single refinement network that consists of four modules, i.e., an encoder feature extractor, a TTA with skip-connection, a texture fusion multi decoder and a discriminator. The TTA-Net has a U-Net structure based on the performance verified inpainting model [12,20,24], which can extract multilayered latent features from the encoder and transmits them to decode through skip-connection. To improve training stability and maintain semantic performance, we propose a single refinement network that applies VGG losses. The convolutions used in this framework (Fig. 2), use gated convolutions [31], which are excellent for the removal of mask effects, the discriminator and TTA module on the other hand use vanilla convolution. The feature extractor in the encoder works at each resolution to extract a compact latent feature that is not damaged by iterative convolutions and
344
Y. Kim et al.
Fig. 3. Illustration of the Texture Transfer Attention Layer. First, Unfold the Context Feature and Fine Texture Feature to the Same Size for Calculating the Similarity Weights (as Convolution Filters). These Similarity Weight of All Patches are Compared Channel-Wise to Find the Index and Weight of the Most Similar Patch. Then we Generate Reassembled Texture Map by Folding Texture Features According to the Index Map. The Texture Map and Weight Map are Sent to the Feature Synthesis Module and Synthesized with the Context Feature.
poolings. As the compact latent features encode the multi feature information, high-level semantics of the context and low-level texture details are decoding via skip-connection. Dilated convolutions [12,29] that are performed several times to fill a missing hole, create a coarse semantic context feature that becomes the positional reference for generating the TTA. The TTA module synthesizes multiple times to produce the result. We adopt a patch discriminator [14,31] for faster training and more stability. In addition we use reconstruction loss and VGG loss [7,15] that compares the ground truth and output of networks.
TTA-Net
3.2
345
Texture Transfer Attention
Conventional attention methods usually use the sum of similarity weights [14, 30,31], but if there are many similar patches (ground, sand, bushes, etc.), similar weights will overlap, resulting in a blur. To overcome the blur issue, we adopt texture swap [28,34,35] method of reference SR. Relevance Embedding. Relevance embedding aims to embed the relevance between the encoded texture features P and decoded context features Q by calculating the matching score. We unfold P (without missing regions) and Q(missing regions) into same size patches of pj ∈ P and qi ∈ Q respectively. We calculate the relevance si,j between these two patches by normalized inner product (cosine similarity): pi qj , si,j = (1) pi qj where si,j represents similarity of patch of fine texture components pi and coarse semantic patch qj . This similarity si,j is used to create a reassembled texture map T and fusion of Q and T . Feature Swapping. We propose a method of creating reassembled texture map T that reconstructs the details of the fine texture P in the form of the semantic feature Q. The patch pj ∈ P , which is most relevant to qi ∈ Q, is brought to the qi location, and the existing qi and texture information of pj are synthesized. First, a channel-wise comparison of the similarity si,j is performed on each patch to find the pj patch most similar to qi and generate an index map H for location. (2) hi = arg max si,j j
where hi is the index representing the location of the most relevant patch in the fine texture P for the i-th element patch of the semantic feature Q. By applying the index map H to transfer the patches of fine texture P , we can generate a reassembled texture layer T that can be applied to decoding. ti = phi
(3)
where ti denotes the value of T in the i-th position, which is selected from hi -th position of P . Summary, reassembled texture map T is obtained by reconstruction texture feature P according to the semantic shape of Q, and transmitted to the decoder and reflect in the result. 3.3
Similarity Weight Texture Synthesis
Fusion Ration Map. We propose a method to synthesize a semantic feature Q and a reassembled texture T in the decoder. Our method refers to the similarity
346
Y. Kim et al.
weight value si,j calculated above and does a channel-wise comparison to search the highest ratio ri for fusion. The ratio map R represents confidence of the reassembled texture map for each position in T : ri = max si,j
(4)
j
where ri denotes the i-th position of fusion ratio map of R. Similarity Fusion. The fusion ratio map R is used to synthesize the decoder context feature Q and reassembled texture map T . Instead of directly applying R to T , we first concatenate Q and T and perform the convolution layer. The fused features are element-wisely multiplied by the ratio map R and added back to Q. This operation can be represented as: Ff us = F + Conv(Concat(F, T )) (
R ) 1+R
(5)
where F is the same to as the semantic feature Q and Ff us indicates the synthesized output features. Conv represents a convolution layer and Concat indicates a concatenation operation. The operator denotes an element-wise multiplication between feature maps. However, since each patch is simply multiplied by a different ratio ri and added to F , a color distortion occurs that results in checkerboard effect. To solve this problem, we used normalization factor (1 + R). In summary, the TTA effectively transfers relevant fine texture P to semantic context feature Q, making resulting images more realistic. 3.4
Training Objective
The factors considered when choosing loss are: 1) to improve the spatial structure of the inpainted missing region, 2) to create a plausible visual quality of the resulting image, and 3) to take advantage of rich texture from the encode feature extractor. Our objective function combines reconstruction loss Lrec , perceptual loss Lper , adversarial loss Ladv and style loss Lstyle . We experimentally tested the appropriate hyperparameters λ and experiments with the conditions of λrec = λadv = 1, λper = 0.1 and λstyle = 100. Loverall = λrec Lrec + λadv Ladv +λper Lper + λstyle Lstyle
(6)
Reconstruction Loss. The reconstruction loss contributes to create an approximate shape by comparing the predicted image to ground truth. We can generate a more accurate Ipred by adopting an l1 -norm instead of an MSE. Lrec = Igt − Ipred
(7)
TTA-Net
(a) Input
(b) PConv
(c) EC
(d) DeepFillv2
347
(e) Ours
Fig. 4. Comparison of Qualitative Results with Models that Proven Performance on Places Validation Sets. Best Viewed (Especially Texture) with Zoom-in.(a) Input is Ground Truth with Mask. (b) Partial Convolution [20]. (c) EdgeConnect [22]. (d) Gated Conv [31]. (e) Ours.
Adversarial Loss. Adversarial loss can significantly to improve the structural/visual quality of this synthesized image. We adopt the SN-Patch GAN loss [31] for more stable training and improve semantic information and local texture. Dsn denotes a spectral-normalized discriminator and G indicates an inpainting network that receives an incomplete image z = Igt (1 − M ) as input. (8) LG = −Ez∼(z) [Dsn (G(z))] LD = Ex∼Pdata (z) [ReLU (1 − Dsn (x)] +Ez∼Pz (z) [ReLU (1 + Ds n(G(z))]
4
(9)
Experiment Results
The proposed system was evaluated of Places2 [27], Celeb-HQ faces [17] and DTD textures [3], in term of quantitative and qualitative. Our model was trained using an NVIDIA RTX GPU with a 256 × 256 resolution size and a batch size of 16. In the training phase, the datasets of CelebA-HQ and DTD were downsampled to 256 × 256 resolution from their original size and Places2 is randomly cropped for texture learning. Basically the model was trained with PyTorch v3.6, CUDNN v7.0, CUDA v10.0. It took about 0.15 s to test a 512 × 512 resolution image, regardless of a hole size, and no additional pre- or post-processing is required.
348
Y. Kim et al.
(a) Input
(b) EC
(c) GC
(d) Ours
(e) GT
Fig. 5. Comparison of Qualitative Results with Models on CELEB-HQ.(a) Input is Ground Truth with Mask. (b) EdgeConnect (c) Gated Conv (d) Ours (c) Ground Truth
4.1
Quantitative Results
Like other image generation tasks, image inpainting lacks good quantitative evaluation metrics. Even if the inpainted region is not the same as the ground truth, it is acceptable to the user if it has a plausible structure and texture. We especially measured Frechet Inception Distance (FID) [10] and Learned Perceptual Image Patch Similarity (LPIPS) [33] which can indicate how similar human feel to the ground truth. The results listed in Table 1 show the performance of our model and baseline on 512 × 512 resolution validation image of Places2 with free-form random masks. We report our evaluation in terms of multi-scale structural similarity (MSSSIM) [26], FID and LPIPS. MS-SSIM extracts and evaluates the similarity of structural information from paired images at multiple scales, providing results that approximate human visual perception. FID measures a feature vector distance between the GT and image generated. LPIPS is image patch similarity, which evaluate the quality of human perceptual similarity judgments.
TTA-Net
349
Fig. 6. Example Results Generated by the Proposed Network on DTD Datasets. Left are Ground Truth, Middle are Masked Images and Right are Completed Images. Table 1. Quantitative Comparison on Validation Images of Places2 with MS-SSIM, FID and LPIPS. We use Free-Form Mask where the Missing Region is 10%-40% of Mask Size. † Lower is Better. ¶ Higher is Better. Method
MS-SSIM¶ FID†
LPIPS†
DeepFillv1
80.26
110.9
0.1195
DeepFillv2
85.58
48.62
0.0687
EdgeConnect 86.72
44.96
0.0609
Ours
43.52 0.0595
85.3
As shown in Table 1, our proposed method shows that MS-SSIM is similar to other models and better in FID and LPIPS. This mean that our model has high-semantic inpainting performance similar to other models, but have better texture expression. 4.2
Qualitative Results
We compared the proposed model with previous state-of-the-art approaches [20,22,30,31]. To compare the texture generating performance, a qualitative comparison was performed by challenging images from the places2 dataset with complex and irregular textures with an original size. In Fig. 4 the entire image and enlarged image patch are displayed together to compare the semantic context and texture of the image. Figure 5 and Fig. 6 show the results of our model using CelebA datasets and DTD texture datasets. We use user custom masks and free-form random masks to compare various cases.
350
Y. Kim et al.
As shown in Fig. 4: Comparison of qualitative results with models that proven performance on Places validation sets. Best viewed (especially texture) with zoom-in: (a) Input is ground truth with mask. (b) Partial convolution [20]. (c) EdgeConnect [22]. (d) Gated conv [31]. (f) Ours., all models have succeeded in making the missing regions plausible, but they have different results. The result of EC [22] was used with similar texture patches repeatedly around the missing region without taking into account the semantic context, giving it a heterogeneous feel. The result of GC [31] has a semantic structure, but the texture looks blurry compared to the surroundings as it accumulates several similar patches. On the other hand, the result of our model has a composition that considers the semantic context of the surrounding area, and it can be confirmed that a photorealistic result appears by creating a texture similar to a background property. In addition, as shown in Fig. 5 and Fig. 6, our model works well with human faces or complex and repetitive textures.
5
Conclusion
In this paper, we proposed a novel image inpainting system based on an endto-end U-Net generative network with a TTA which efficiently transfers a fine texture from the encoder to the decoder. We showed that the TTA module significantly improves the texture representation while preserving the semantic context of the result. In addition, we proposed a highly simple and effective sysnthesis module to reflect fine texture in the results. Quantitative results and, qualitative comparisons demonstrated the superiority of our proposed TTA-Net. As a future work, we plan to improve the proposed network to image with a higher resolution and modify it to work well with videos.
References 1. Barnes, C., Shechtman, E., Finkelstein, A., Goldman, D,B.: Patchmatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graphics (TOG) 28(3), 24 (2009), Proceedings of SIGGRAPH 2009 2. Bertalmio, M., Sapiro, G., Caselles, V., Ballester, C.: Image inpainting. In: Proceedings of the 27th Annual Conference On Computer Graphics And Interactive Techniques, pp. 417-424 (2000) 3. Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3606-3614 (2014) 4. Criminisi, A., Perez, P., Toyama, K.: Region filling and object removal by exemplarbased image inpainting. IEEE Trans. Image Process. 13(9), 1200–1212 (2004) 5. Efros, A., Freeman, W.E.: Image quilting for texture synthesis and transfer. In: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, pp. 341-346. Association for Computing Machinery (2001) 6. Efros, A.A., Leung, T.K.: Texture synthesis by non-parametric sampling. In: Proceedings of the International Conference on Computer Vision, ICCV 1999, vol. 2, p. 1033. IEEE Computer Society (1999)
TTA-Net
351
7. Gatys, L.A., Ecker, A.S., Bethge, M.: Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2414-2423 (2016) 8. Goodfellow, I.J., et al.: Generative adversarial nets. In: Advances in neural information processing systems, pp. 2672-2680 (2014) 9. Hays, J., Efros, A.A.: Scene completion using millions of photographs. ACM Trans. Graphics (TOG) 26(3) (2007) 10. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: NeurIPS, pp. 6626-6637 (2017) 11. Huang, J.-B., Kang, S.B., Ahuja, N., Kopf, J.: Image completion using planar structure guidance. ACM Trans. Graphics (TOG) 33(4), 129 (2014) 12. Iizuka, S., Simo-Serra, E., Ishikawa, H.: Globally and locally consistent image completion. ACM Trans. Graphics (TOG) 36(4), 1-14 (2017) 13. Irani, M., Shechtman, E., Simakov, D., Caspi, Y.: Summarizing visual data using bidirectional similarity. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 1–8. IEEE Computer Society (2008) 14. Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp. 1125-1134 (2018) 15. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015) 16. Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: European Conference on Computer Vision (2016) 17. Karras, T., Aila, I., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018) 18. K¨ ohler, R., Schuler, C., Sch¨ olkopf, B., Harmeling, S.: Mask-specific inpainting with deep neural networks. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 523–534. Springer, Cham (2014). https://doi.org/10.1007/ 978-3-319-11752-2 43 19. Levin, A., Zomet, A., Peleg, S., Weiss, Y.: Seamless image stitching in the gradient domain. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 377–389. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24673-2 31 20. Liu, G., Reda, F.A., Shih, K.J., Wang, T.-C., Tao, A., Catanzaro, B.: Image inpainting for irregular holes using partial convolutions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 89–105. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6 6 21. Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision, ICCV (2015) 22. Nazeri, K., Ng, E., Joseph, T., Qureshi, F., Ebrahimi, M.: Edgeconnect: structure guided image inpainting using edge prediction. In: The IEEE International Conference on Computer Vision, ICCV Workshops, October 2019 23. Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3d view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, July 2017 24. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.: Context encoders: feature learning by inpainting. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2536–2544, June 2016
352
Y. Kim et al.
25. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4 28 26. Wang, Z., Simoncelli, E.P., Bovik., A.C.: Multiscale structural similarity for image quality assessment. In: ACSSC, vol. 2, pp. 1398-1402 (2003) 27. Yang, C., Lu, X., Lin, Z., Shechtman, E., Wang, O., Li, H.: High-resolution image inpainting using multi-scale neural patch synthesis. In: CVPR, pp. 4076–4084 (2017) 28. Yang, F., Yang, H., Fu, J., Lu, H., Guo, B.: Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5791–5798 (2019) 29. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: ICLR (2016) 30. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: CVPR, pp. 5505-5514 (2018) 31. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Image style transfer using convolutional neural networks. In: Free-Form Image Inpainting With Gated Convolution (2018) 32. Zeng, Y., Fu, J., Chao, H., Guo, B.: Learning pyramid-context encoder network for high-quality image inpainting. In: The IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1486–1494 (2019) 33. Zhang, R., Isola, P., Efros, A.A., Shecht-man, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 34. Zhang, Z., Wang, Z., Lin, Z., Qi, H.: Image super-resolution by neural texture transfer. In: CVPR, pp. 7982-7991 (2019) 35. Zheng, H., Ji, M., Wang, H., Liu, Y., Fang, L.: CrossNet: an end-to-end referencebased super resolution network using cross-scale warping. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 87–104. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1 6
Examining Correlation Between Trust and Transparency with Explainable Artificial Intelligence Arnav Kartikeya(B) University of California, Santa Cruz, 1156 High St, Santa Cruz, Cupertino, CA 95064, USA [email protected] Abstract. Trust between humans and artificial intelligence (AI) is an issue which has implications in many fields of human computer interaction. The current issue with artificial intelligence is a lack of transparency into its decision making, and literature shows that increasing transparency increases trust. Explainable artificial intelligence has the ability to increase transparency of AI, which could potentially increase trust for humans. This paper attempts to use the task of predicting yelp review star ratings with assistance from an explainable and non explainable artificial intelligence to see if trust is increased with increased transparency. Results show that for these tasks, explainable artificial intelligence provided significant increase in trust as a measure of influence.
Keywords: Explainable artificial intelligence Transparency · Artificial intelligence
1
· XAI · Trust ·
Introduction
Trust in automation, primarily artificial intelligence, is an important issue in the field of human computer interaction. Prior literature has shown that with an increase in transparency, trust increases as well [1]. Artificial Intelligence (AI) has the well known problem of a lack of justification of its decision making, therefore decreasing transparency between a user and the AI. Explainable artificial intelligence (XAI) provides increased transparency to the user, which has the potential of increasing trust. The XAI used in this paper is LIME, an algorithm developed in 2016 for model agnostic explanations into model decision making [4]. This paper attempts to find the correlation between XAI, increased transparency, and trust in specific tasks. A common task for artificial intelligence is to suggest or aid a human in a task. Recommendation algorithms used by Netflix, or systems aiding pilots in spatially disorienting situations are prime examples of this task. Literature examining the relation between trust and transparency in these tasks exist, as well as literature on the impact of XAI on human interaction [5]. Our work makes the following contributions: It examines trust through the lens of XAI’s influence c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 353–358, 2022. https://doi.org/10.1007/978-3-031-10464-0_23
354
A. Kartikeya
of human decision making, and uses metrics of quantitative data as oppose to subjective opinion-based surveys. Influence in this case is defined as how often a user changes their decision with additional aid from an artificial intelligence model. Rather than using metrics of opinion-based surveys, this paper examines how exactly the model’s decision impacts human decisions, which will show how trust and transparency correlate in realistic scenarios rather than subjective opinions. I also examine the effectiveness of opinion-based surveys such as the Trust in Automation Questionnaire in how it represents actual influence of a model in a realistic scenario. I hypothesize the following: 1. the LIME XAI model will influence human decisions significantly more than a normal artificial intelligence, 2. trust measured through the Trust in Automation opinion-based survey [2] will correspond with the actual influence from artificial intelligence.
2
Experimental Design
In order to experimentally examine the relation between trust and transparency in the task described previously, a method of two separate surveys given to two separate groups was used. The surveys both asked the respondent to complete the task of predicting the star rating an individual gave to a restaurant with only the information of the text in the review, and a machine learning model’s output. The surveys only vary in the amount of information given by the model, meaning the exact texts given and questions asked were the exact same. The following section elaborates on the differences and similarities. The machine learning model in both cases were the same, they both were pre-trained facebook fasttext models [3], and both were trained on the same official yelp dataset. 2.1
Similarities Between Surveys
Both surveys follow the same two sequence set of questions, and these sequences repeat 15 times to create 30 questions. The sequence is as follows. The first question in the sequence asks the user to predict, on a scale of 1 to 5 stars, what they believe the Yelp review associated with the question rated the restaurant. The second question gives a model output in the form of an image, which is either the label of the class of yelp review (in the form of label X.0, where X represents the number 1 through 5 for stars). Depending on the survey, either explanation is shown. After all the 15 two-sequence questions are answered by the respondent, they will answer the Trust In Automation Questionnaire on a 1 to 7 likert scale. 2.2
Differences Between Surveys
The previous section mentions that either class label or Fig. 1 is used as the model’s output, and this depends on the survey being answered. The first survey,
Examining Correlation Between Trust and Transparency with XAI
355
known as the basic survey, only offers class label, which provides no transparency into the model’s decision making. The second survey, known as advanced survey, uses Fig. 1 as its output. From left to right, the first section shows the confidence that the model had in its classification of each star. The figure shows that it had a confidence of 0.03 for 5 stars. The second portion shows the top features, or words, that influenced the model’s decision of the label it chose and the ones it did not. A higher number represents a larger influence in its decision. The third section repeats the text from the Yelp review, and highlights those specific words that were mentioned in the second section. A darker highlight corresponds to higher influence. This figure shows greater transparency in the model’s decision than Fig. 1 does. The explanation for Fig. 1 was created by LIME. In both instances, a pretrained Fast-text natural language processing model was further trained on the official Yelp dataset to accomplish the task of predicting the star rating of a review.
Fig. 1. Explanation provided for second survey. (Left) confidence in each class prediction, (Middle) top most influential words in decision making, (Right) top influential words, highlighted within entire text
2.3
Distribution of Survey
Both surveys were separate tasks distributed on Amazon Mechanical Turks. 25 participants were used for each survey, with no additional criteria included. The sets of participants were not the same set of individuals, and were anonymous.
3
Evaluation of Survey Results
In order to quantify the 50 respondent answers and understand the relation between trust and transparency, data was split into multiple groups and compared between the two surveys using customized metrics. 3.1
Splitting the Data
Data was split by first which survey was being answered. The first group contained answers from the basic survey, and the second from the advanced. In each group, further subgroups were created. In both surveys, the model was correct in its prediction 6 times, 6 were incorrect by 1 or 2 stars, and 3 by more than 3 stars.
356
A. Kartikeya
I compared these groups between the surveys in order to ensure confidence was not the leading factor in decision making, and to determine exactly how much transparency can influence a users decision’s. Measuring if there was difference in the sets of data for all three would ensure that transparency increases trust in all cases, however if there is not a difference in the less correct sets, it would show that confidence has a greater influence on trust and transparency cannot overcome the lack of confidence in the model. 3.2
Comparing the Data
To compare the data between the groups, a custom metrics was created, the model influence metric. This was done to show the exactly how the model changed the users decision. A metric of checking if the user changed their answer between originally answering and being presented with the model’s decision would not suffice, because there is the likelihood the respondent further lowered their answer away from the model’s decision, as they simply rethought their answer. Therefore, a metric was needed which checked if the respondent changes their answer to be closer to the model. This would show that the respondent did use the model’s output in their decision making, showing trust between the model and the respondent. If the respondent’s answer was closer to the model’s output by the second question, then a score for the respondent was incremented by 1. For example, if a respondent believed yelp review to have a star prediction of 2 stars, but the model predicted 4 and the respondent then changed their answer to 3, that would increment their score by 1. This was done for each of the subsections based on correctness, and the results were compared between the basic and advanced sets of data as mentioned previously. One case which presented an issue was the case of when the respondent guessed the same answer as the model did, without the model’s input. If they guessed 3 stars before looking at the model, an the model also guessed 3 stars, then no trust is being measured if the respondent keeps their answer. To prevent this from occurring, I omitted those answers from the data and kept the metric as a percentage value. This ensured that those data points would not be included, while still maintaining a proper metric that can be compared over varying sizes of data. The following section will discuss the results. Equation 1 shows the formula for the metric used.
4 4.1
100% 25
#influenced answers ( #non omitted answers )
(1)
Results Model Influence Metric Results
After processing all 50 survey responses using metrics and groups described in the previous section, the results were created and displayed in Fig. 1 and Table 1. Between the groups of correct, slightly incorrect, and totally incorrect, the metric
Examining Correlation Between Trust and Transparency with XAI
357
used showed a statistically significant difference. A one tailed independent mean t-test was used for finding the p values listed in the table. In all cases p value is less than 0.05. Table 1 shows the individual means of all 6 groups compared. 4.2
Trust Evaluation Results
Questions taken from the trust in automation questionnaire were placed at the end of both surveys and analyzed in order to see if there exists any contrast between the subjective trust measured by the questionnaire and the metric used in this paper. Both group’s surveys were averaged. The advanced questionnaire had an average of 4.50 and the basic an average of 4.71 (p-value = 0.14, one-tailed independent mean t-test). The results show no statistically significant difference between the advanced and basic surveys, which does not align with the difference shown in the previous section with the model influence metric. Table 1. Difference between basic and advanced survey in means, p-value calculated by one-tailed t-test
4.3
Groups
Basic survey Advanced survey p-value
Correct
33.27%
59.54%
.003
Slightly in. 33.20%
55.67%
.009
Totally in. 50.00%
65.34%
.038
Discussion and Conclusion
The purpose of this study was to experimentally verify if transparency increases trust in the case of artificial intelligence aided tasks similar to the one described. Results from the experiment show statistically significant difference between the sets of data, and therefore do not reject hypothesis 1. This shows that with increased transparency, trust increases as well between the respondent and the model, at least when trust is measured as model’s influence on the respondent. The data also shows that any additional insight into the model’s decision making will increase trust, regardless of if the model is totally incorrect in its results. This is shown by comparing the results from the “totally incorrect” sections of the transparent and non transparent models. The second section of the experiment pertained to the Trust in Automation questionnaire, and whether its results align with that from the surveys. The confidence and dependence questions which were discussed in previous sections show no statistically significant results between transparent and non transparent models, which is not reflected by the metrics used for this paper. This shows that for measuring model influence through change in decision making after model input, the Trust in Automation questionnaire does not align with measures used. Subjective trust measured through the Trust in Automation Questionnaire therefore
358
A. Kartikeya
contrasts with trust measured through the metric used in this paper in experimental scenarios. This paper comes to its conclusions for the specific experiment and metric used in the paper. This paper serves as a starting ground for future research in verifying the hypothesis mentioned for different experiments and metrics, as well as larger sets of data.
References 1. Hoff, K.A., Bashir, M.: Trust in automation: integrating empirical evidence on factors that influence trust. Hum. Factors J. Hum. Factors Ergon. Soc. 57(3), 407–434 (2015) 2. Jian, J.-Y., Bisantz, A.M., Drury, C.G.: Foundations for an empirically determined scale of trust in automated systems. Int. J. Cogn. Ergon. 4(1), 53–71 (2000) 3. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016) 4. Ribeiro, M.T., Singh, S., Guestrin, C.: “Why should I trust you?”: explaining the predictions of any classifier. CoRR, abs/1602.04938 (2016) 5. Weitz, K., Schiller, D., Schlagowski, R., Huber, T., Andr´e, E.: “Do you trust me?”: increasing user-trust by integrating virtual agents in explainable AI interaction design. In: Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents, IVA 2019, July 2019, pp. 1–3 (2019)
Utilizing AI in Test Automation to Perform Functional Testing on Web Application Dalia Alamleh(B) The University of Nottingham, Nottingham NG7 2RD, UK [email protected]
Abstract. Artificial Intelligence is the trend in software development. Unfortunately, Artificial Intelligence algorithms and technologies are still not utilized enough in software testing. Designing Test automation has become the main job for quality engineers and software testers. Mainly, Test Automation is beneficial in reducing manual testing efforts. Utilizing AI in test automation can form a huge benefit in code optimization and test oracle problem. The primary objective of the research was to approve the usability of the Fuzzy Inference System in providing a test oracle for web application functional testing. The secondary objective was to utilize Artificial Intelligence techniques like self-healing for the test Automation using web scraping. Also, to compare the web scraping approach and the Image processing approach in locating the web elements on the websites dynamically. I have addressed the problem by developing Test Automation that verifies the search functionality for a given website. The hypothesis is mainly to check if the Fuzzy Inference System can predict if the search functionality for a given website is working or not. I tested the hypothesis on ten different websites. Then, after I analysed the results, I have found that implementing the Fuzzy Inference System in test automation can form a reasonable solution for the test oracle problem. Furthermore, using the Fuzzy Inference System is as efficient as the manually prepared test oracle that covers all the possible cases for the inputs using if-else statements. Finally, I have demonstrated how web scraping can be utilized to perform self-healing for the test Automation. Keywords: Test Automation · Artificial Intelligence · Fuzzy Inference System · Test Oracle
1 Introduction AI is a phenomenon that nearly everyone has heard of recently. Nowadays, AI is taking over the software testing field as well. Utilizing AI in software testing to solve the testing problems is being the new trend in the software Quality assurance industry [14]. Software Testing is a very critical role in the software development cycle. Software quality assurance (QA) puts a lot of effort into accelerating the testing process like implementing Test Automation [4]. The field of utilizing Artificial Intelligence in Test Automation has grown during the last years [14], the main target is to minimize the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 359–377, 2022. https://doi.org/10.1007/978-3-031-10464-0_24
360
D. Alamleh
QA time and efforts of writing or modifying the Test Automation. Also, to overcome problems facing the Test Automation like the Test Oracle Problem [9]. In this project, I aim to design a Test Automation for a Web Application that aims to apply functional testing on webpage elements by tracking the HTTP requests generated while interacting with web elements [5]. For verification, I will use Fuzzy Logic [10] to predict the result of the Test Automation execution. The project’s biggest goal is to speed up the QA process and to provide a solution for the Test Oracle Problem using AI. The project and the conducted research aim to provide answers to the following questions. Can the designed Test Automation be intelligent to validate the test results to overcome the test oracle problem by using Fuzzy logic? Moreover, how to utilize the Fuzzy Inference system in test automation for web functional testing, and what is the efficiency for such implementation compared to the old way of preparing a static condition for all possible cases using if else statements? What is the best AI mechanism for executing a test scenario on a webpage? That is, the Test Automation can be intelligent and dynamic. e.g.: one test Automation script for testing Search functionality that can work on different websites. Also, what is the ability for applying self-healing for the test Automation using web scraping. Finally, my project aims to evaluate the approach of testing the user interface (UI) for a webpage by analysing the HTTP requests generated while interacting with the webpage elements. That is, the Test Automation aims to test the functionality of the server-side for web applications through the client-side by applying Gray box testing. This approach should help in locating the failure layer, e.g., the bug location is on the client side or on the server side. To answer the above question, the primary objective of the project is to develop a test Automation that utilizes AI. The test Automation aims to perform a search functional test on web applications and evaluate the results using the Fuzzy Inference system.
2 Related Work Many pieces of literature used Fuzzy Logic to evaluate the website quality. In [13], authors implemented the Fuzzy Inference System to evaluate website design quality based on Fuzzy-DEMATEL. In other literature [11], authors used Neuro-Fuzzy Logic based on nine web measures like performance, number of elements on each webpage and other metrics to evaluate the web page quality. In another literature [12], this work proposed a fuzzy classifier to evaluate the Web applications vulnerability toward wellknown attacks like xss and SQL injection. In other literature, [8] authors evaluated the website quality using Fuzzy logic. The authors added the correctness of the web page element “link” as one of the inputs for the Fuzzy inference system. Authors calculated the rate of correctness for webpage elements “links” to evaluate the quality of a website. In this method, the authors used the Xenu tool to check the link correctness and then Fuzzy Logic to evaluate the quality of that website based on other different factors like (‘rate of links correctness’ and website performance). Authors found that Fuzzy Logic is an excellent way to evaluate the quality of websites. Generally, all the previous Literature found Fuzzy Logic to be feasible in predicting the quality of a website.
Utilizing AI in Test Automation to Perform Functional Testing
361
The authors [6] have presented a novel approach of providing a test oracle for software using deep learning and fuzzy inference systems. The proposed work applied only for the software that has numeric output. Authors used the Fuzzy Inference system as a first layer to map inputs into a fuzzy space. Mainly, this layer is used to train the Deep learning network layer. The second layer is the deep learning network, which is designed to process data provided by the Fuzzy inference system and try to find a pattern. Based on that, the output of the Deep learning network is the test oracle. Authors have tested their approach on different software and approved its validity. In my project, I aim to implement a Fuzzy Inference System for a different purpose, which is evaluating the functionality of the search feature in a given website. I aim to process the Search HTTP request and response content using Fuzzy Inference System to determine if the content generated is valid or not. The novelty in my approach is that I get the advantages of the fuzzy inference system output surface to form a test oracle for system inputs based on certain rules. In conclusion, according to the literature, AI has been implemented in a different area of software testing for web applications. Utilizing AI to solve the test oracle problem is one main area. My project target is to develop test Automation to test the server-side of web applications through client-side and provide test oracle using Fuzzy Logic for the server-side level. This approach aims to save the software testers time and effort. Furthermore, it will provide a solution for the test oracle problem by utilizing AI in test automation using Fuzzy inference System. For the secondary objective, I aim to utilize AI techniques in test Automation. That is, Test Automation is intelligent where it can test on different websites without being supervised.
3 Methodology This work attempts to check if the search functionality of a website is working or not. That is, when the end-user fills the search and then clicks the search button on the web page it will function properly and retrieve the search results. In order to evaluate if search functionality is working or not, I aim to use Fuzzy Logic. Mainly, using the Fuzzy Inference system to provide a test oracle for the test Automation. Furthermore, to approve that Fuzzy Logic test oracle can be a feasible replacement for other test oracle methods. The reason behind choosing fuzzy logic is to get advantage of the Fuzzy set theory ability of classifying the inputs in order to predict the result of the test Automation. The proposed test automation system architecture as showed in Fig. 1 is consisting from two main layers: First layer is the Data Generation and Collection. In the First layer mainly, I execute the test scenario script on the web browser. The test scenario script is to locate the search field and the search button. Then, fill the search field with a search Keyword and click on the search button. There are two methods for Data Generation, first one using Image processing and the other one is to use web scraper agent. The data collected is the HTTP request and response for the search endpoint, which is generated by clicking on the search button. As the test case validation is done on the server-side level by processing this HTTP request and response of the search endpoint generated
362
D. Alamleh
Fig. 1. Test automation system architecture
by interacting with the webpage element on the client-side (web browser). Therefore, in the first layer I aim to develop a function that extracts the search Endpoint request from a group of HTTP requests. The second layer is the Data Evaluation Layer, where I have two approaches. First one is the Fuzzy Inference system approach that is targeting to present a test oracle for the test automation. Fuzzy Inference system layer aims to predict if the search functionality of the website is passed or failed based on the Fuzzy inference rules. The other evaluation approach is to prepare all different possible cases and if the constraints meet for one case, then the test case is passed otherwise it is failed. I call this approach “All possible cases using if-else”. The overall structure of the Test Automation should be flexible in a way that works on different websites. In my Test Automation I am targeting different 10 websites. Test Automation should verify the search functionality per each one of them. I aim to design the test automation in a way that is not supervised or personalized. In other words, Test Automation should be flexible and able to test each one of the below websites listed in Table 1. 3.1 Data Generation and Collection The first layer aims to execute the Test Automation for Search functionality on different websites. The test case scenario for the Search functionality is mainly filling the search field with a search keyword and then clicking on the search button. The test case expected result is to retrieve the search result for the search keyword.
Utilizing AI in Test Automation to Perform Functional Testing
363
Table 1. List websites that will be tested using the test automation URL https://www.google.com https://www.amazon.com https://old.bau.edu.jo/ https://www.reddit.com/ https://www.bbc.co.uk/search https://uk.yahoo.com/?p=us https://www.nottingham.ac.uk/search.aspx https://www.ox.ac.uk/ https://www.leeds.ac.uk/ https://stackoverflow.com/
First of all, the test script explores the given URL to extract the needed data for the test scenario. The test Automation should locate the search field and search button web elements. I follow two different AI mechanisms for collecting the location of the search field and the search button for different websites. (1) Using image processing to detect the field and the search button, this is explained in more detail below. (2) Using web scraper to dynamically identify the Xpath for the search field and search button. The second step is the Data collection, which is to collect all the HTTP/HTTPs requests and responses that were generated by performing the search functionality on the Automation web browser. In order to collect the HTTP requests, System uses the Selenium-Wire library which extends Selenium. System Later Processes and filters all The HTTP requests generated by the browser to extract the Search Endpoint. 3.2 Data Evaluation Using Fuzzy Inference System In this layer of the system, the test Automation evaluates the collected data using the Fuzzy Inference System. The Fuzzy Inference System task is to predict if the Search Endpoint web-service generated by the search automated test on a given website is valid. I add different rules for the fuzzy inference system to classify the output based on the inputs. The following are the two inputs in the Fuzzy Inference system: The first input is the status code for the Search web-service HTTP request where each status code is divided by 100. The reason for that is to standardise the input variables range so both inputs are at the same range, instead of having the first input range in hundreds and the other input from (1–5). The second Input is the rate of the existence of the search keyword in the Search Endpoint response. System has a function for calculating the rate of the existence of the search keyword. The Function takes the body of the response for search end-point as an Input. Then by using Natural Language Toolkit - NLTK library to record
364
D. Alamleh
the number of times each word has occurred in a document using the method: “Frequency distributions”. Formally, a frequency distribution can be defined as a function mapping from each sample to the number of times that sample occurred as an outcome [1]. In this test Automation case, the targeted word is the search Keyword and the document is the HTML body response. “Frequency distributions” are generally constructed by running a number of experiments, and incrementing the count for a sample every time it is an outcome of an experiment (Bird, 2006). The function will produce a frequency distribution that encodes how often each word occurs in a text. By looking only for the search keyword in the text, in the system: 1. If the word is repeated more than 10 times in the HTTP body response of the search endpoint that means the search functionality rate is high. I assign the value 5 as an input for the Fuzzy Inference System which indicates the high rate for the search keyword in the search endpoint. 2. If the word repeated for around 5 times in the HTTP body response of the search endpoint, that means the search functionality rate is medium, I assign the value 3 as an input for the Fuzzy Inference System which indicates the medium rate for the search keyword in the search endpoint. 3. Finally, if the search keyword word was repeated less than 2 times, then this indicates a low frequency rate for the search keyword times in the HTTP body response of the search endpoint. Thus, I assign the value 1 as an input for the Fuzzy Inference System which indicates the low rate for the search keyword in the search endpoint. The Fuzzy Inferences System has two inputs, for example if the Inputs are (2,5): That means that the search endpoint request status code is 200 and the search keyword rate in the body of the request is high. Such cases indicate that the search functionality is passed. For the fuzzy Inference system rules, Rules are designed to be simple and clear.
If (status code is 2xx) and (Search keyword rate existence is medium) then (Result is Passed If (status code is 4xx) then (Result is Failed) If (status code is 5xx) then (Result is Failed) If (Search keyword rate existence is Low) then (Result is Failed) If (status code is 3xx) and (Search keyword rate existence is medium) then (Result is Maybe)
Test Automation processes the inputs using the Fuzzy inference system. The output of the Fuzzy system will be one of the following (Failed, Passed, Unknown): 1. “Failed”: indicates that an incorrect behaviour was detected in the Test Automation, that is, the HTTP request and response generated when interacting with the webpage element was not as expected. Thus, the functionality of the webpage element is not working. Range of the output “Failed” is from (0 to 40).
Utilizing AI in Test Automation to Perform Functional Testing
365
2. “Passed”: indicate that the behaviour detected in Test Automation is correct and the functionality of the webpage element is working based on the server-side level. Range of the output “Failed” is from (75 to 100). 3. “Maybe”: System was unable to give a decision for this case. Range of the output “maybe” is from (40 to 75). The Fuzzy Inference Output surface viewer 3D Plot represents the test oracle for the search functionality for the combination of two inputs as demonstrated later (see Fig. 7). 3.3 Alternative Data Evaluation Layer Assertions are well known as one of the test oracles where QA usually design test automation and provide a previous expected result. In this approach, I prepared a group of cases using “If- else” statements and assertion. That is, if a case meets the expected results, then I add assert (True) else I add assert (False). An assertion in general is evaluating a constraint that applies to some rules and computation. When the assertion to the value is false then the test automation will exit the run and highlight that error was found [3]. In this alternative test oracle method, I implement a function that uses the normal Assertion Library in python. The Inputs of this layer is the status code of the response for the Search End-point and the search keyword existence rate. In this layer, I defined each expected case using if else statement. For example, all possible cases number for the inputs equals to 12 as demonstrated in Table 2. As for the first input there are four possible cases for the status code (2xx,3xx,4xx,5xx) and for the second input, there are 3 possible cases (high, medium, low). Table 2. All possible cases for the HTTP search endpoint Rate of Search keyword in the HTTP response status code
Low
Medium
High
2xx
Failed
Passed
Passed
3xx
Failed
Passed
Passed
4xx
Failed
Failed
Failed
5xx
Failed
Failed
Failed
The Output from this layer is an assertion of True or False. Where False means search test functionality does not work, and True means Test functionality does work. The main target of this layer is to compare the results of this method with the result of the fuzzy inference system to evaluate the usability of applying Fuzzy Logic in the functional automation test. Furthermore, to evaluate the quality of the prediction for the Fuzzy Inference system. The comparison is mainly, to compare the two codes (sizes and performances) and lastly their results.
366
D. Alamleh
The plan is to run two experiments using the same websites. The first Run uses the Fuzzy Inference System evaluation. The second Run uses the alternative evaluation. I expect that both Data Evaluation layers will have the same results for the same inputs. If so, then Fuzzy Inferences will be approved to be a good replacement as a test oracle for the approach of listing all possible cases. The motive of this comparison is not to suggest that the Fuzzy Logic approach is better than or should replace this method, but rather to see it as an improvement step toward reducing code that leads to reducing the maintenance effort. 3.4 Implementation Data Generation using Web Scarping The Test Automation system starts with a command to open the test browser using Selenium web driver. A browser is opened wherein the top bar, the browser should display a message that “this browser is controlled by the automated test software” as showed in Fig. 2.
Fig. 2. The browser is being controlled by the test automation of the selenium web driver
In this layer the search functionality test-case is executed by the Selenium web driver. Selenium Web driver communicates directly with the browser. Selenium Web driver provides great support for testing the properties of page components that keep changes [7]. Test case is a group of steps done on the webpage usually by the end-user for manual testing. Test cases can be scripted using Selenium Web driver as it supports a wide range of programming languages. For web scraping approach for data generation, Browser Calls the URL for the targeted website as classified in the Test Automation. Then, test Automation locates the search field and the search button using web scraping. The web scraping mainly looks for the search form in the page HTML, which usually has the attribute “action = search” as shown in Fig. 3.
Utilizing AI in Test Automation to Perform Functional Testing
367
Fig. 3. Demonstrates the HTML for the search form with action = “/search”
Fig. 4. “Demonstrates the field type: = “Text” for the field web element
Inside the search form, for the field web element, “Input” always represents the search field with HTML attribute “type = text” as shown in Fig. 4. Also, Inside the search form, for the button web element, “Input” always represents the search button with HTML attribute “type = submit” as shown in Fig. 5.
Fig. 5. Demonstrates the button web element where value = “submit” for the button web element
The system applies web scraping on the given URL. I built a class called parser that will have four functions in total. The first function: gets all the HTML forms in the given URL. The second function: gets the details for a specific form. The third function finds the field and saves all the attributes for the element field and returns the attribute name that later will be used in determining the field location. The fourth function saves the attributes and the values for the search button and later will be used in locating the search button location. The System saves the website HTML then Iterates through forms to find the form that has the HTML attribute “action = search” then requests this form details. After that System determines the web elements attribute name for both (search field and search button) using the third and the fourth function. At this point, the parser class has finished the job by determining the attribute name for web elements. Then Test Automation substitutes the value of the attribute “name” in the Xpath. The following commands show how Selenium web drivers determine the location for the web element based on Xpath using the HTML attribute name. After allocating the web page elements, the search field, and the search button, the system performs the selenium web driver commands to perform the following test case.
368
D. Alamleh
1. Fill the search field with the search Keyword. 2. Click on the search button.
Data Generation using Image Processing For the second approach of Data generation using image processing and PYAUTOGUI, test Automation locates the web page elements using python library PYAUTOGUI. First of all, I provide the test Automation with images for the search field and the search button. Then System uses PYAUTOGUI to locate the (x,y) coordination for the position of where the images were identical to the provided ones on the screen using image processing and segmenting. Later, Test Automation interacts with the web page elements using PYAUTOGUI. PYAUTOGUI takes over the mouse movement on the screen and does the following: • moves the cursor to the (x,y) position of the identical image of the search field and fills it with the search keyword • moves the cursor to the (x,y) position of the identical image of the search button and clicks on it. The First Layer of the system aims to collect data. Data is mainly the search endpoint HTTP request and response that is generated from actions performed on the test Automation web browser. All HTTP requests and responses generated from the moment the selenium web driver has opened the browser till the test automation clicks on the search button are collected using the Selenium wire library. Data Collection System iterates through all the HTTP requests to extract the search endpoint. Generally, there is best practice for naming the search endpoint, where it should include the term “/search?”. However, sometimes Websites might use a different name. In this level, I add a function to extract any request that has an endpoint “/search?” at the same time ignores the endpoints “/suggest?” or “/complete?”, as for the functionality test, I aim to test the search itself. not the Auto complete functionality that provides a search keywords suggestion. Moreover, Search endpoint should have the search keyword in its payload of the request for example q =” search keyword”. In the below Fig. 6 is a sample of a search endpoint. Data Evaluation After extracting the Search Endpoint in the Data collection phase. The next step is to evaluate if the search Endpoint is valid or not. To do so, Test Automation has to provide a test oracle. In this project, I aim to implement AI to provide this test oracle. Thus, in this Layer, I evaluate data using two approaches. The first one, “Fuzzy Inference System”. The second one, “All possible cases using if-else”. The first approach of data evaluation is the Fuzzy Inference System. I built a class for the Fuzzy inference system. The Fuzzy inference system accepts two inputs. The Two inputs are extracted from the Search Endpoint that was collected from the first layer. The Fuzzy inference system has a rule that helps to classify the outputs based on the inputs.
Utilizing AI in Test Automation to Perform Functional Testing
369
Fig. 6. Shows a search endpoint sample where the name of the service is “/search?” and search keyword “q = hi”
The Fuzzy inference output is a crisp value, which is a number from zero to 100%. I check the crisp output of the fuzzy inference system, whereas: • If the crisp output for the Fuzzy inference system is greater than 85% then the system predicts that the search functionality is working. Thus, the test case is marked as passed. • Otherwise, if the crisp output of the fuzzy inference system is less than 80, based on that the system predicts that search functionality is not working. Thus, the test case is marked as failed. The Second approach system has an alternative function to evaluate the inputs. The function takes the status code and the rate of the search keyword as inputs. Then through different conditions, System checks which if- statement meets the constraints so it can identify the test result.
4 Result This Chapter discussed the results of different experiments and studies that were applied on the data generation and the data evaluation levels of the test Automation. First of all, the main purpose of the study conducted for the Data Generation level of the test Automation is to compare the different approaches of locating the web page elements on different websites. Also, perform a self-healing test-case using a web scraper. That is, the test Automation can perform when HTML selectors change across different websites. Furthermore, to check the feasibility of identifying the bug location by performing functional testing on client side then evaluate the result on the server-side level by analysing the HTTP request generated during the interacting with client-side level (website UI). Secondly, for the Data evaluation experiments, the main purpose of the experiments conducted is to validate the Fuzzy inference system approach in providing a test oracle for the test Automation. Mainly, validate the quality of prediction for the test result for the search functionality of a given website. Lastly, to compare the test oracle results and performance for the “All possible cases using if-else conditions “approach and “Fuzzy Inference system” approach.
370
D. Alamleh
I applied the experiments and the study on different websites categories, like university websites, search engine websites and entertainment websites. The websites that were selected had a search bar feature. The test Automation is designed to verify if the search functionality is working or not based on the search endpoint in the server-side level. The Search endpoint is generated once the user clicks on the search button. The search endpoint is a GET HTTP request that takes the search keyword as a parameter in the payload and retrieves the search result as a response. The test automation script simulates the user behaviour, where it fills the search field with search keyword then hits the search button. The expected result for this behaviour is to retrieve results for the search keyword. 4.1 Compare the Different Approaches for Data Generation At the beginning of Data generation layer, I conducted a study where I proposed two methods to identify the best way to allocate the search field and the search button in a given website. The approaches are “Image processing approach” and “web scraping approach “. In the first method “Image processing approach for data generation” I used the PYAUTOGUI python library. There it can locate the web element that matches a given image. Also, PYAUTOGUI can control the mouse movement on the screen. In the implementation, test Automation located the web page elements using library PYAUTOGUI successfully. Then, test automation interacted with the webpage elements to perform the search test case script. My findings were as follows, first of all, the PYAUTOGUI library needs access over the mouse and the screen. That means while executing the test automation, PYGUIAUTO takes over the mouse movement. Thus, a tester cannot move the mouse or open other windows while running the Automation script as the PYGUIAYTO won’t be able to locate the element if it was hidden under another screen. Most importantly, if a tester moves the mouse while running the test script that will cause a wrong calculation of the mouse cursor position which will affect the test Automation execution. Secondly, I found that PYGUIAUTO takes so long to apply the image comparison and processing to locate the web elements. Lastly, while using this approach for different websites, different sets of images are needed as there are different icons and themes per different websites. That means, a test Automation should be provided with different Images for different websites. On the other hand, for “The web scraping approach for data generation”, I found that I can execute the test in the background and still be able to use the running device for other purposes. Unlike the “Image processing approach”, the device’s screen that runs the test Automation should only be dedicated for the test. One more advantage for the web scraping approach is that there is no pre-training needed or data preparation to perform scraping as most of the websites use the same HTML structure and tags naming. Most importantly, web scraping is much faster than image processing in allocating the web elements on a website. • For the Image processing approach, it took 41703 ms to locate the web page elements and execute the test case.
Utilizing AI in Test Automation to Perform Functional Testing
371
• For the web scraper agent, it took 11086 ms for locating web elements and executing the test script. To summarize, this study has shown that web scraper agent is better compared to image processing in allocating the web page elements on a website, due to different reasons. The first reason, web scraping is faster compared to Image processing. Also, a web scraper agent does not require data preparation to perform the task. Finally, Test Automation that uses web scrapers can be run in the background unlike the image processing approach that needs a dedicated device for running the test automation. For all of the above reasons, I decided to use web scraping for this test Automation. 4.2 Self-hEALING TEST Automation Self-healing automation is the ability to carry out the task of locating web page elements successfully even when the environment has changed [2]. For normal test Automation that does not include self-healing feature, Testers use the HTML attribute “ID” in locating the webpage elements as it is the best way for allocating the webpage elements [7]. Testers use the HTML attributes ID as a hardcoded value in the testing script. The problem with this approach is that the “ID” of the webpage elements changes frequently. Thus, testers should keep updating the ID values in the testing script accordingly. In my approach, the proposed Test Automation used scraper agent to perform web scraping to allocate the web page elements. I designed the scraper agent to dynamically identify the HTML attribute “name” for the web element. The HTML attribute “name” is used to allocate the XPATH for the web elements. Later, the Selenium web driver uses the XPATH to allocate the element on the website and interacts with it. Also, I created a memory text file for the scraper agent to save HTML attribute “name” per each run. The test Automation checks the memory text file to see if existing data can help the agent in locating any web elements before performing a scraping. I was able to perform a self-healing test Automation as the scraper agent was able to allocate the search field and search button for different websites without any supervision using one selector only which is the Xpath. To conclude, the proposed Test Automation used scraper agent to perform web scraping to allocate the web page elements, like the search field and the search button. I found that the scraper agent was able to allocate the search field and search button for different websites (environment) then perform the search test case scenario of filling the field and clicking on the search button. Moreover, I found that scraper agent has improved the test automation by learning from each run by interacting with new webpage elements and then saving the data in the memory file. 4.3 Data Collection by Extracting Search Endpoint In the Data generation layer, after performing the search scenario on the webpage, Test Automation was able to collect all HTTP requests generated. Most importantly, extract the search endpoint from a group of requests. In this approach, I collected every request generated by the test Automation using selenium wire library. I was able to extract the
372
D. Alamleh
search endpoint successfully for all targeted websites. Table 3 shows the search endpoint collected per each website. Table 3. Search endpoints generated by the test automation per each website URL
Search endpoint
https://www.google.com
https://www.google.com/search?q=master
https://www.amazon.com
https://www.amazon.com/s/ref=nb_sb_noss_2? url=search-alias%3Daps&field-keywords=master
https://old.bau.edu.jo/
http://old.bau.edu.jo/SearchResults.aspx?q= master
https://www.reddit.com/
https://gateway.reddit.com/desktopapi/v1/q= master
https://www.bbc.co.uk/search
https://www.bbc.co.uk/search?q=master&page=1
https://uk.yahoo.com/?p=us
https://uk.search.yahoo.com/search?p=master& fr=yfp-t&ei=UTF-8&fp=1
https://www.nottingham.ac.uk/search.aspx
https://www.ox.ac.uk/search?query=master
https://www.ox.ac.uk/
https://www.leeds.ac.uk/search?q=master&sea rchOption=searchSite
https://www.leeds.ac.uk/
https://stackoverflow.com/search?q=master
https://stackoverflow.com/
https://www.ebay.com/sch/i.html?_from=R40&_ trksid=p2380057.m570.l1313&_nkw=master&_ sacat=0
4.4 Data Evaluation In this phase, Search HTTP endpoint is evaluated if it is passed or failed. In order to do so, a test oracle is needed. Therefore, I had proposed two alternative ways to provide a test oracle for the search functionality for a given website. First method is to use the Fuzzy Inference system model. Second method is to use normal probability logic of adding different cases using if- else statements and for each case I added assert True or assert False. The Fuzzy Logic Data Evaluation Experiment This experiment uses the Data evaluation Layer of the system that has the Fuzzy inference system. I run the test automation script once per each website and recorded the following: 1. Search Endpoint Request. 2. Search Endpoint status code. 3. Running Time for the fuzzy Logic system layer. Below Table 4 have details for The Fuzzy Logic approach experiment.
Utilizing AI in Test Automation to Perform Functional Testing
373
Table 4. Fuzzy system experiment result for testing search on a given website URL
Fuzzy logic time (ms)
Fuzzy-system results
https://www.google.com
10
Passed
https://www.amazon.com
8
Passed
https://old.bau.edu.jo/
6
Failed
https://www.reddit.com/
8
Passed
https://www.bbc.co.uk/search
7
Passed
https://uk.yahoo.com/?p=us
6
Passed
https://www.nottingham.ac.uk/search.aspx
7
Passed
https://www.ox.ac.uk/ https://www.leeds.ac.uk/ https://stackoverflow.com/
8
Passed
15
Passed
7
Passed
The All-Possible Conditions Using If- else Approach Experiment The second experiment was to run the test automation script again, once per each website as well but this time using the alternative layer for data evaluating. The alternative Layer has a predefined testcase with all possible conditions using If-else statements. Below Table 5 that has details for All possible cases approach. Table 5. Result of all-possible conditions using if- else approach experiment URL
All-possible cases run time (ms)
All-possible cases results
https://www.google.com
0
Passed
https://www.amazon.com
0
Passed
https://old.bau.edu.jo/
0
Failed
https://www.reddit.com/
0
Passed
https://www.bbc.co.uk/search
0
Passed
https://uk.yahoo.com/?p=us
0
Passed
https://www.nottingham.ac.uk/ search.aspx
0
Passed
https://www.ox.ac.uk/
0
Passed
https://www.leeds.ac.uk/
0
Passed
https://stackoverflow.com/
0
Passed
Comparing Experiments Result Based on the result from the experiments demonstrated in Table 3 and Table 4, the Prediction proposed by fuzzy Logic shows that the inference system was 100% accurate
374
D. Alamleh
in prediction compared to “All possible cases testing results” in verifying the search functionality for different websites. With a deeper look into the Fuzzy inference system layer, only six rules added in the inference system. The Number of overall assertions needed was two; assert False if the value of the crisp output for the fuzzy inference system is less than 85%. Otherwise, assert True. On the other hand, “All possible cases testing results” had twelve test-cases in total each case has assertion based on the scenario. For the performance of both data evaluation layers: 1. The Fuzzy inference system running time did not exceed 30 ms which is a reasonable time compared to the advantages of providing a test oracle. 2. For “All possible cases data evaluation Layer” the running time was 0 ms. Based on the results of both experiments, Nine out of ten websites search functionality appeared to work as expected. However, Test Automation has detected a failure in one website. Based on the Test Automation result the search endpoint generated correctly but the rate of the existence for the search keyword in the body of the response was low. Thus, the test Automation detected the failure. I checked the website by manual testing and confirmed that for the website the search functionality did not work. To conclude, based on the result, I approved that the test oracle provided from the Fuzzy inference system was 100% correct for all ten websites. The running time for the Fuzzy inference system is reasonable. Also, the Fuzzy Inference System has less rules to cover the same test cases compared to the “All possible test cases “approach. 4.5 Detecting Failure’s Layer The main advantage of the designed test automation is to locate the failure layer. To do so I used the following Rules: • If the issue appeared in the Data generation layer where no search end point generated then the issue of the search functionality is related to the front-end layer (Web client level). • If the issue appeared in the data evaluation that means the issue is in the backend layer (server-side level). For example, if the test result has passed that means the server has returned the correct results for the search endpoint request. The experiment for testing the search functionality for “BAU” website’s result has failed in the Data Evaluation Level. That means the issue is on the back-end (Server-side level). To conclude, I found that the proposed approach of test Automation of evaluating the functionality for a client side (website UI) using a server-side level is useful in detecting the failure layer for the search functionality.
Utilizing AI in Test Automation to Perform Functional Testing
375
4.6 Challenges and Limitation In the Data Generation Layer, for the web scraper agent, the main challenge related to using selenium webdriver is that some websites do not follow the stranded of HTML tags and names. Because of that, the task of scraping becomes harder for the agent, for example, in some websites the form’s action is not defined as “search”. To overcome this issue, I provided the agent with a list of common names and titles used for the search field and button like (value = “submit”) for the search button which is commonly used across different websites. Another challenge was the performance of the Test Automation. Obviously, Test Automation takes a long time while running. As a lot of sleep code and waiting for the page to load was inserted in the code. The reason for inserting wait and sleep code is to make sure all of the HTML elements are displayed on the screen. Another challenge faced me during collecting the search Endpoint, was that Targeted website detected the AI bot when using the “request” library when calling the search Endpoint as a normal API. Thus, blocking the request. To overcome this issue, I had to use the Selenium web driver to call the GET request. However, such an approach would not work for POST requests. So, in the future other ways should be implemented to avoid bot detection.
5 Conclusions This Chapter describes the project’s key findings and future work. First of all, I have presented a “Gray Box Testing” approach for testing websites, where the verification for the search function for a website is applied on both, client-side and server-side levels. I approved the efficiency of this approach in identifying the defect layer. As a result of using this testing approach, I was able to detect the layer that caused the failure in the search functionality for a website. Secondly, I conducted experiments to compare two AI mechanisms that allocate the web page elements on a website. The first mechanism is web scraping that locates the XPath of the element based on the HTML attribute “name”. The second mechanism is Image processing using PYAUTOGUI. Based on the experiments, I found that web scraper is better than Image processing in locating the web elements as it is faster and reliable. Thirdly, I have presented an approach of a self-healing test Automation. That is, a web scraper agent is used to locate web page elements dynamically using the HTML Attribute “name”. Even when the attribute name for the web element changes, the test Automation is still able to identify it again by applying web scraping. Consequently, test Automation can perform on different websites as the web scraper agent locates the web elements dynamically. Furthermore, the scraper agent learns from previous runs and saves the Attribute- name in a memory file. The Test Automation checks this memory file every run before applying the scraping. Lastly, the oracle problem is known to be the main obstacle in test automation. This work proposed a test automation which utilizes an intelligent decision-making algorithm known as fuzzy logic by using Fuzzy set theory which classifies the inputs to predict the
376
D. Alamleh
output. For evaluating the Search Endpoint correctness, this novel approach can predict any possible results for a combination of two inputs. Indeed, this work approved that the Fuzzy inference system is a great artificial intelligent approach that can provide a test oracle for functional testing applied on web applications. Consequently, Experiment results have been further analysed to determine the correctness of the prediction. The results revealed, when the test case was marked by the Fuzzy inference system as “failed”, that indicated an issue on the website search functionality. On the other hand, when the Fuzzy inference system marked the test case as passed, then the search functionality was approved to be working as expected. These results imply that the Fuzzy Inference System model is approved to be effective for predicting the functional test result for the given website. Furthermore, For the Fuzzy Inference system, testers only need to define the rules for the inputs, and then any possible case is covered by the Fuzzy inference system. The 3D plot for the output surface of the Fuzzy Inference System can help to visualise the test oracle for a combination of two inputs. In Fig. 7 below, the high peak illustrates when the result is passed for the combination of two inputs (x-axis is the first input: status code, y-axis is the second input: is the search keyword rate). The low area in the 3D plot shows when the result must fail.
Fig. 7. Visualisation for the test oracle determined by the fuzzy system for combination of inputs on X axis and Y axis. The test oracle result represented by the Z axis
For future this work can be extended to implement different AI algorithms for predicting a test oracle like Bayesian Belief Network. Furthermore, evaluate the results of the Fuzzy Inference System.
References 1. Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pp. 69–72 (Jul 2006)
Utilizing AI in Test Automation to Perform Functional Testing
377
2. Hudaib, A.A., Fakhouri, H.N., Al Adwan, F.E., Fakhouri, S.N.: A survey about self-healing systems (desktop and web application). Commun. Netw. 9(01), 71 (2017) 3. Korel, B., Al-Yami, A.M.: Assertion-oriented automated test data generation. In: Proceedings of IEEE 18th International Conference on Software Engineering, pp. 71–80. IEEE (Mar 1996) 4. Di Lucca, G.A., Fasolino, A.R., Faralli, F., De Carlini, U.:Testing web applications. In: International Conference on Software Maintenance, 2002. Proceedings, pp. 310–319. IEEE (Oct 2002) 5. Di Lucca, G.A., Fasolino, A.R.: Testing web-based applications: the state of the art and future trends. Inf. Softw. Technol. 48(12), 1172–1211 (2006) 6. Monsefi, A.K., Zakeri, B., Samsam, S., Khashehchi, M.: Performing software test oracle based on deep neural network with fuzzy inference system. In: Grandinetti, L., Mirtaheri, S.L., Shahbazian, R. (eds.) High-Performance Computing and Big Data Analysis: Second International Congress, TopHPC 2019, Tehran, Iran, April 23–25, 2019, Revised Selected Papers, pp. 406–417. Springer International Publishing, Cham (2019). https://doi.org/10.1007/9783-030-33495-6_31 7. Ramya, P., Sindhura, V., Sagar, P.V.: Testing using selenium web driver. In: 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), pp. 1–7. IEEE (Feb 2017) 8. Rekik, R., Kallel, I.: Fuzz-Web: A methodology based on fuzzy logic for assessing web sites. Int. J. Comput. Inf. Syst. Ind. Manag. Appl. 5, 126–136 (2013) 9. Trudova, A., Dolezel, M., Buchalcevova, A.: Artificial intelligence in software test automation: a systematic (2020) 10. Zadeh, L.A.: Fuzzy logic. Computer 21(4), 83–93 (1988) 11. Malhotra, R., Sharma: Application of adaptive neuro-fuzzy inference system for predicting software change proneness. In: 2013 In International Conference on Information Technology. IEEE 12. Alhassan, J.K., Misra, S., Umar, A., Maskeli¯unas, R., Damaševiˇcius, R., Adewumi, A.: A fuzzy classifier-based penetration testing for web applications. In: International Conference on Information Technology & Systems, 95–104. Springer, Cham (Jan 2018) 13. Kaur, S., Gupta, S.K.: A fuzzy-based framework for evaluation of website design quality index. Int. J. Digit. Libr. 22(1), 15–47 (2020). https://doi.org/10.1007/s00799-020-00292-6 14. King, T.M., Arbon, J., Santiago, D., Adamo, D., Chin, W., Shanmugam, R.: AI for testing today and tomorrow: industry perspectives. In: 2019 IEEE International Conference On Artificial Intelligence Testing (AITest), 81–88. IEEE (Apr 2019)
On the Modelling of Species Distribution: Logistic Regression Versus Density Probability Function Jo˜ao Bioco1,3(B) , Paula Prata1,3 , Fernando Canovas2 , and Paulo Fazendeiro1,3 1
2
C4 - Centro de Competˆencias em Cloud Computing (C4-UBI), Universidade da Beira Interior, Covilh˜ a, Portugal [email protected] Facultad de Ciencias de la Salud, Universidad Cat´ olica San Antonio de Murcia, Murcia, Spain 3 Instituto de Telecomunica¸co ˜es (IT) Covilh˜ a, Covilh˜ a, Portugal
Abstract. The concerns related to climate changes have been gaining attention in the last few years due to the negative impacts on the environment, economy, and society. To better understand and anticipate the effects of climate changes in the distribution of species, several techniques have been adopted comprising models of different complexity. In general, these models apply algorithms and statistical methods capable of predicting in a particular study area, the locations considered suitable for a species to survive and reproduce, given a set of eco-geographical variables that influence species behavior. Logistic regression algorithm and Probability density function are two common methods that can be used to model the species suitability. The former is a representative of a class of models that requires the availability (or imputation) of presenceabsence data whereas the latter represents the models that only require presence data. Both approaches are compared regarding the capability to accurately predict the environmental suitability for species. On a different way, the behaviour of the species in the projected environments are analysed by simulating its potential distribution in the projected environment. A case study reporting results from two types of species with economical interest is presented: the strawberry tree (Arbutus unedo) in mainland Portugal, and the Apis mellifera (African Lineage) in the Iberian Peninsula. Keywords: Agent-based modelling and simulation · Species distribution models · Environmental modelling · Logistic regression Density probability function · Pseudo-absence data
1
·
Introduction
Species distribution models (SDMs) are widely implemented in ecological and biological research to predict the potential geographical distribution of species. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 378–391, 2022. https://doi.org/10.1007/978-3-031-10464-0_25
On Modelling Species Distribution
379
Several methods have been used in order to implement species distribution models (SDMs), e.g. [2,7,10,12]. Two types of species occurrence data can be used in the parameterization of these methods: (i) presence-only data that contains the locations where the species in study was observed and (ii) presence/absence data containing the locations where species were observed as well as the locations where species were not. In the parameterization of the methods that require presence/absence data, whenever the presence data is the only available, background (or pseudo-absence) data are generated [15]. Two main approaches can be used to create these pseudo-absence data: select pseudo-absence at random in the study area; and use a preliminary approach to restrict the selection of absence data in locations considered less suitable for the species [16]. There is no method that performs always better than the other. For this reason, it is important to use different methods in order to evaluate its performance. This paper compares the performance of the logistic regression algorithm and density probability function regarding the capability to project the species environmental suitability and consequently analysing the impact of the data used considering the behaviour of these two methods. Unlike previous studies, in addition to generating the suitability landscape, the simulation of species’ distribution in the forecast environment is also performed. Departing from the suitability map produced by each approach, the virtual species evolution on different environmental landscapes is simulated and discussed for two case studies of species with an economic interest.
2
Background and Related Work
Density probability function and logistic regression are two common methods used to predict distribution of species. Normally these two methods use as parameters a set of Eco-geographical environmental variables (EGVs) that influence the behaviour of the species, and a sample with the locations where the species was observed (presence-only data) or, alternatively, the locations where both presence and absence of the species were registered (presence/absence data). Density probability function (DP) and logistic regression (LR) are two methods that belong to different categories: DP requires only presence data whereas LR requires both presence and absence data (presence/absence) [14]. Normally, absence data are considered unreliable and unavailable [16]; however, these data are required in case of using LR. In order to be possible to use LR when these data are unavailable, pseudo-absence data are used [8,11,17]. Previous works such as [1,10] addressed the impact of the quantity of pseudoabsence data selected in the prediction of species distribution models. The effect of sample size of both presence-only and pseudo-absence data was analysed using as case study several techniques such as regression, classification and machine learning methods, findings that the approach used to select pseudo-absence data, and the size of pseudo-absence data, have great impact on the models performance.
380
J. Bioco et al.
These previous studies analyse the species-environment relationship by implementing methods to project current or future climate scenarios (environmental suitability). The presented study is focused on comparing the performance of LR method and the DP, based on some of the criteria defined by previous studies. However, this study goes one step forward simulating the distribution of species in the projected environments, to analyse how species evolve and spread to the suitable locations for its survival. To perform the simulation of the distribution of species, we choose to use the species distribution simulator (SDSim) [3] due to its ability to use the output of a SDM as a defining feature for the simulation. 2.1
SDSim
SDSim is an easy to use Web-based species distribution simulator that allows users to simulate the distribution of real or virtual species [3]. Simulating the distribution of species in SDSim, users can observe how species spread and colonize more suitable locations for the species’s survival. In order to perform simulations, users should fill all the simulation’s parameters, including uploading the set of EGVs that describe the species environment, and the upload of species occurrence data (for the case of real species). In addition, users can access all the previous simulations, including the simulation results, and perform new simulations based on previous ones. Two main tasks are performed in SDSim: (i) to describe the suitability map of the species: the map with values between 0 and 1, where values closer to 1 represent more suitable places for the species and 0 otherwise. SDSim implements LR and DP to project the environmental suitability for the species; (ii) to simulate how the species evolve in this study area based on a general species life cycle composed with three main steps: birth, spread and death phase. At the end of the simulation, SDSim produces as output the suitability map, the distribution map of the species, the video of the simulation, and the performance of the method used, measured by receiver operating characteristic (ROC) curve and area under the curve (AUC). SDSim is available online at https://sdsim.it.ubi.pt.
3
Methods
The projections of the environmental suitability (suitability map) for each species were obtained in two ways: LR and DP implementations. Both methods use occurrence data to describe the relationship between the species distribution and a set of EGVs. In LR each EGV is a predictor variable and the occurrence data is the response. Since LR is a classification method, in addition to locations where species were observed (presence data), it requires absence data. Taking into account that there is only presence data available, pseudo-absence data were generated. Pseudo-absence data were chosen at random from the study area, excluding the locations where the species was observed [1]. In order to evaluate
On Modelling Species Distribution
381
the effect of pseudo-absence data, five different samples were chosen. These samples include all the available presence-only data, and the pseudo-absence data. Pseudo-absence data varies from 200 to 1000 points (the first sample contains 200 pseudo-absence data, and the last sample contains 1000 pseudo-absence data). Each sample is composed by the values of each EGV in each point, and the corresponding response variable (0 or 1). Based on the predictor variables (EGVs) and the values of the response variables, LR produce a model that predict the species probability occurrence, given the set of EGVs values at each point of the study area. For LR implementation, scikit-learn (a python machine learning library) was used. Parameters were defined as follows: solver = ‘liblinear’, random state = 0, tol = 0.00001, max iter = 1500, C = 0.050, penalty = ‘l1’. For DP, presence-only data is sufficient to project the suitability map for the species. Presence-only data are used to calculate the mean and standard deviation of each EGV (EGV’s optimal suitability values). These eco-geographical values are then standardized in the form: xi = (xi − μ)/σ, where xi is the value of an EGV in that location, μ is the mean and σ the standard deviation for that EGV. The probability density function is applied for each EGV, and then the values are normalized in the interval from 0 to 1 (optimal). Then, the aggregation of each EGV value in each point produce the suitability map (predicted map) for the species. The area under the ROC (AUC) was the technique used to evaluate the two methods. AUC is a good technique to evaluate the classification performance of a model, applied by several studies [1,10,12,15]. 3.1
First Case Study: Strawberry Tree
Strawberry tree (Arbutus unedo L.), is a Mediterranean species found in large quantities in Portugal and in Mediterranean area. The study of the distribution of this species has significant interest in Portugal for economical reasons. Its fruit are used to produce spirit drinks, considered the main source of revenue for forestry owners [13]. Nine EGVs that influence the behaviour of the strawberry tree were used: BIO1, BIO2, BIO5, BIO9, BIO15, tmax , tmin , slope, altitude. These EGVs were previously applied in the study of [13] where were considered as the variables that most influence the behaviour of the strawberry tree. The climate variables (BIO1, BIO2, BIO5, BIO9, BIO15, tmax , tmin ) were obtained from the climatic atlas of the World [9] (30 s resolution, 1 × 1 km). The altitude was obtained from Global Multi-resolution Terrain Elevation Data 2010 (30 s resolution, 1 × 1 km) [6]. The slope was calculated using the slope algorithm, in degrees [4]. In this case study 318 locations where the species was observed were used (presence-only data), with different pseudo-absence data randomly chosen in the study area. Occurrence data can be obtained from SDSim website.
382
J. Bioco et al.
Experimental Results There are similarities between the suitability map obtained from DP and the suitability maps obtained from LR. Figure 1 shows the suitability map obtained from DP and the suitability maps obtained from each sample from LR. Visually, the difference between the suitability map obtained from DP and the suitability maps obtained from LR increases as the pseudo-absence increases. In order to analyse how well these two methods predict the species suitability map, the Area Under the Curve (AUC) is calculated along with the ROC. Figure 2 presents the AUC and ROC in order to perform a comparison between the classification performance of LR and DP. For LR, the values of AUC is on average 0.70, whereas the values of AUC for DP is on average 0.62. Figure 3 shows the distribution map of the species and the corresponding ROC curve for the two methods. These figures were obtained from the SDSim. The sampling strategy of SDSim consists in choosing equal quantity of both presence-only and pseudo-absence data, and these pseudo-absence data are chosen at random from the study area. In addition to predict the suitability map (the map with the environmental conditions), SDSim also simulates the distribution of the species in the predicted environment, according to its species life cycle, including birth, death and spread phase. The initial population was set to 100 individuals randomly placed in the study area; the life cycle parameters were defined as follow: birth rate : 0.5, death rate : 0.2 and spread rate : 0.3; and the simulation runs 200 times. The value of AUC for the distribution map of the species using DP (AUC = 0.62) is equal to the average AUC of the suitability map, whereas the AUC for the distribution of the species using LR (AUC = 0.68) is less than the average AUC of the suitability map with LR. 3.2
Second Case Study - Apis Mellifera Honeybee
Apis mellifera is an Iberian honeybee, also presented in another locations. They can be found in both natural and artificial hives. The interest in study the distribution of this species is related to the production and storage of honey and the construction of colonial nests from wax, widely used in the honey derivatives industry, and in the cosmetics industry respectively. In this case study 4 EGVs (resolution 10 × 10 km) obtained from climatic atlas of the World were used [9]: maximum temperature of the warmest month (mxtwm), minimum temperature of the coldest month (mntcm), rainfall seasonality (rf seas) and average annual temperature (tann) [5]. Presence-only data was collected in terrain and can be found on the SDSim website. Experimental Results Figure 4 presents the suitability map obtained from the LR and DP. Visually the difference between the suitability map obtained from LR and the suitability map from DP increases as the sample size increases.
On Modelling Species Distribution
383
(a) Suitability Map - Density Probability Function (DP)
(b) 200 Pseudo-Absences (LR)
(c) 400 Pseudo-Absences (LR)
(d) 600 Pseudo-Absences (LR)
(e) 800 Pseudo-Absences (LR)
(f) 1000 Pseudo-Absences (LR)
Fig. 1. Suitability map obtained by density probability function (Fig. a) and logistic regression (Fig. b, c, d, e, and f) with different quantity of samples of strawberry tree. All the figures with the 318 occurrence data, varying the quantity of pseudo-absence data from 200 to 1000.
384
J. Bioco et al.
(a) Sample of 200 Points
(b) Sample of 400 Points
(c) Sample of 600 Points
(d) Sample of 800 Points
(e) Sample of 1000 Points
Fig. 2. ROC curve - comparison between logistic regression algorithm and density probability function for strawberry tree.
On Modelling Species Distribution
385
(a) Distribution Map and ROC Curve (Density Probability)
(b) Distribution Map and ROC Curve (Logistic Regression)
Fig. 3. Distribution maps obtained from both logistic regression method and density probability function from sdsim for strawberry tree.
Figure 5 shows the AUC for both methods. On average, the value of AUC for the LR is 0.67 whereas for the DP, the average value is 0.61. Figure 6 shows the distribution map of the species and the corresponding ROC curve for the two methods obtained from the web-based species distribution simulator (SDSim). The initial population was set to 100 individuals randomly placed in the study area; the life cycle parameters were defined as follow: birth rate : 0.5, death rate : 0.2 and spread rate : 0.3, and the simulation runs 200 times. The value of AUC for the distribution map of the species using DP (AU C = 0.60) is less than the average AUC of the suitability map, whereas the AUC for the distribution of the species using LR (AU C = 0.72) is greater than the average AUC of the suitability map with LR.
386
J. Bioco et al.
(a) Suitability Map - Density Probability Function
(b) 200 Pseudo-Absences (LR)
(c) 400 Pseudo-Absences (LR)
(d) 600 Pseudo-Absences (LR)
(e) 800 Pseudo-Absences (LR)
(f) 1000 Pseudo-Absences (LR)
Fig. 4. Suitability map obtained by density probability function (Fig. a) and logistic regression (Fig. b, c, d, e, and f) with different quantity of samples of Apis Mallifera honeybee. All the figures with the 135 occurrence data, varying the quantity of pseudoabsence data from 200 to 1000.
On Modelling Species Distribution
(a) Sample of 200 Points
(b) Sample of 400 Points
(c) Sample of 600 Points
(d) Sample of 800 Points
387
(e) Sample of 1000 Points
Fig. 5. ROC curve - comparison between logistic regression algorithm and density probability function for Apis mellifera honeybee.
388
J. Bioco et al.
(a) Distribution Map and ROC Curve (Density Probability)
(b) Distribution Map and ROC Curve (Logistic Regression)
Fig. 6. Distribution maps obtained from both logistic regression method and density probability function from SDSim for Apis mellifera honeybee.
4
Discussion
Observing the suitability maps of the species, it is possible to notice a high concentration of optimal locations for each species on the maps obtained by DP. Despite both methods ensuring very similar patterns on maps, maps obtained by DP produces more suitable regions, allowing the species to reproduce and colonize more quickly. According to the performance measures, for both case studies, DP produced poor results. However, from a biological point of view, the DP approach seems to be the one that finds a closer agreement with the real data collected in the field. On the other hand, the suitability maps obtained by LR have many more places where the species has difficulty surviving. Values of suitability maps are lower, causing the species to take longer to spread in the environment. This effect results from the approach used to select pseudo-absence data (randomly selected). When selecting pseudo-absence data, several suitable locations for the species are potentially classified as absences. LR, and for that matter any regres-
On Modelling Species Distribution
389
sion approach, fits the model with these data, and then evaluate its performance using the same data. Therefore, the selected approach used to generate pseudoabsence data has a high influence on LR performance. Despite its wide use, one can say that from the biological standpoint, the random selection of pseudoabsence data is not the best approach. Another factor to take into account, which also impacts LR performance is the sample size (the quantity of pseudo-absence data). According to the results, a number of pseudo-absence closer or equal than the number of presence-only data turned out to be a good approach. Overall, both methods (LR and DP) performed better than random classifiers (AUC = 0.5). However, in these case studies, LR performed better than DP. When the result of these two methods was fed to SDSim, LR also performed better in projecting the suitability map, and also presented better results in the species distribution map.
5
Conclusion
In this study, the performance of LR and DP in projecting environmental suitability were compared. Two case studies regarding the distribution of species were performed. In the first case study, the distribution of Arbutus unedo in mainland Portugal was modelled; the modelling of the distribution of Apis mellifera in the Iberian Peninsula was performed in the second study. For each case study, environmental suitability for the species was obtained in two ways: using LR and DP. The evaluation of the performance of both methods was compared regarding the suitability map and the simulation of the distribution of a virtual species in such environmental landscapes. Only presence data were available for both cases studies, consequently, pseudo-absence data were generated and used in the assessment of the methods. The cardinality of the pseudo-absence data has been shown to have a significant impact on the performance of LR. Considering the usage of presence-only data to project the environmental suitability, unlike LR that use in addition to presence-only data, pseudo absence data (when absence data is missing), it was expected that DP performed better than LR. However, strictly numerically speaking, in both case studies, LR performed better than DP in describing the relationship between occurrence data and environmental conditions, but, in general, the results obtained by the two methods present similar patterns. The results of using SDSim to simulate the potential distribution of species in that projected scenarios provide valuable insights, with complementary parametric tuning, regarding what would be the real distribution and proliferation of a species regardless of the base method used. For further work, it will be interesting to implement different strategies of selecting pseudo-absence data and then analyzing the effects of both pseudoabsence data and the sample size in the model’s performance. A more comprehensive comparison involving more methods should be considered, in addition to logistic regression and probability density function.
390
J. Bioco et al.
Acknowledgments. This work was supported by operation Centro-01-0145-FEDER000019 - C4 - Centro de Competˆencias em Cloud Computing, cofinanced by the European Regional Development Fund (ERDF) through the Programa Operacional Regional do Centro (Centro 2020), in the scope of the Sistema de Apoio ` a Investiga¸ca ˜o Cient´ıfica e Tecnol´ ogica - Programas Integrados de IC&DT. This work was also funded by FCT/MCTES through national funds and when applicable co-funded EU funds under the project UIDB/50008/2020. We thank all the authors of the paper [13] for providing the occurrence data used in the first case study.
References 1. Barbet-Massin, M., Jiguet, F., Albert, C.H., Thuiller, W.: Selecting pseudoabsences for species distribution models: how, where and how many? Meth. Ecol. Evol. 3(2), 327–338 (2012) 2. Beaumont, L.J., et al.: Which species distribution models are more (or less) likely to project broad-scale, climate-induced shifts in species ranges? Ecol. Model. 342, 135–146 (2016) 3. Bioco, J., Prata, P., Canovas, F., Fazendeiro, P.: SDSim: a generalized user friendly web ABM system to simulate spatiotemporal distribution of species under environmental scenarios. Environ. Model. Softw. 147, 105234 (2022) 4. Burrough, P.A., McDonnell, R.A.: Principles of Geographical Information Systems, p. 190. Oxford University Press, New York (1998) 5. C´ anovas, F., De la R´ ua, P., Serrano, J., Gali´ an, J.: Analysis of a contact area between two distinct evolutionary honeybee units: an ecological perspective. J. Insect Conserv. 18(5), 927–937 (2014). https://doi.org/10.1007/s10841-014-9701-1 6. Danielson, J.J., Gesch, D.B.: Global multi-resolution terrain elevation data 2010 (GMTED2010). US Department of the Interior, US Geological Survey (2011) 7. Elith, J., et al.: Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29(2), 129–151 (2006). https://doi.org/10.1111/j. 2006.0906-7590.04596.x 8. Engler, R., Guisan, A., Rechsteiner, L.: An improved approach for predicting the distribution of rare and endangered species from occurrence and pseudo-absence data. J. Appl. Ecol. 41(2), 263–274 (2004) 9. Hijmans, R.J., Cameron, S.E., Parra, J.L., Jones, P.G., Jarvis, A.: Very high resolution interpolated climate surfaces for global land areas. Int. J. Climatol. 25(15), 1965–1978 (2005) 10. Liu, C., Newell, G., White, M.: The effect of sample size on the accuracy of species distribution models: considering both presences and pseudo-absences or background sites. Ecography 42(3), 535–548 (2019) 11. Pearce, J., Ferrier, S.: An evaluation of alternative algorithms for fitting species distribution models using logistic regression. Ecol. Model. 128(2–3), 127–147 (2000) 12. Phillips, S.J., Anderson, R.P., Schapire, R.E.: Maximum entropy modeling of species geographic distributions. Ecol. Model. 190(3–4), 231–259 (2006) 13. Ribeiro, M.M., et al.: Bioclimatic modeling in the Last Glacial Maximum, MidHolocene and facing future climatic changes in the strawberry tree (Arbutus unedo L.). PLoS ONE 14(1), e0210062 (2019)
On Modelling Species Distribution
391
14. Robertson, M.P., Caithness, N., Villet, M.H.: A PCA-based modelling technique for predicting environmental suitability for organisms from presence records. Divers. Distrib. 7(1–2), 15–27 (2001) 15. VanDerWal, J., Shoo, L.P., Graham, C., Williams, S.E.: Selecting pseudo-absence data for presence-only distribution modeling: how far should you stray from what you know? Ecol. Model. 220(4), 589–594 (2009) 16. Wisz, M.S., Guisan, A.: Do pseudo-absence selection strategies influence species distribution models and their predictions? An information-theoretic approach based on simulated data. BMC Ecol. 9(1), 1–13 (2009) 17. Zaniewski, A.E., Lehmann, A., McC Overton, J.: Predicting species spatial distributions using presence-only data: a case study of native New Zealand ferns. Ecol. Model. 157(2–3), 261–280 (2002)
Artificial Intelligence Tools for Actuator Fault Diagnosis of an Unmanned Underwater Vehicle Paolo Castaldi2 , Saverio Farsoni1 , Massimiliano Menghini2 , and Silvio Simani1(B) 1 2
Department of Engineering, University of Ferrara, Ferrara, Italy [email protected] Department of Electrical, Electronic, and Information Engineering, University of Bologna, Bologna, Italy http://www.silviosimani.it
Abstract. The paper addresses the development of an artificial intelligence algorithm implemented for maximum power point tracking control of a unmanned underwater vehicle. It is shown that this algorithm tracks the optimum operation point and provides fast response even in the presence of faults. The strategy implements the tracking algorithm by using real—time measurements, while providing maximum power to the grid without using online data training. The solution is simulated in the Matlab and Simulink to verify the effectiveness of the proposed approach when fault–free and faulty conditions are considered. The simulation results highlight efficient, intrinsic and passive fault tolerant performances of the algorithm for general unmanned underwater vehicles with low inertia. Keywords: Fault diagnosis · Neural networks · Actuator fault High–fidelity simulation · Autonomous underwater vehicle
1
·
Introduction
Considering unmanned underwater vehicles, energy is produced by forces that usually are transformed into electricity due to the movement of turbine blades. This energy is captured and transmitted to an electric generator. The whole system represents the unmanned underwater vehicles that can vary in size depending on the location conditions [6,16]. This renewable source of energy allows to generate electric power as long as the unmanned underwater vehicles is present. There are two types of electric generators exploited in these installations, e.g. synchronous and asynchronous machines. The former requires external DC power supply for the rotor or permanent magnets. The latter relies on induction principle where the rotational magnetic field from the rotor induces the voltage in the stator winding. Nowadays, unmanned underwater vehicles can rely on several solutions depending on the generator type [3,10]. One of the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 392–403, 2022. https://doi.org/10.1007/978-3-031-10464-0_26
Unmanned Underwater Vehicle Fault Diagnosis
393
most common implementation is the Permanent Magnet Synchronous Generator (PMSG) because of the operation simplicity and low cost. Induction machines are selected depending on their wounded rotor, which characterises the Doubly Fed Induction Generator (DFIG), and the Squirrel Cage Induction Generator (SCIG), including single or double cage [4]. One of the unmanned underwater vehicles most important parameter is Power Coefficient (Cp ), which indicates the efficiency of the system in terms of energy transformation. It considers all possible losses that may affect the unmanned underwater vehicles performance including mechanical and electrical characteristics. This parameter is usually provided by the manufacturer on the basis of laboratory tests and mathematical model simulations. The generator is able to produce a certain amount of power if the torque and speed can reach its electrical design characteristics. Consequently, the length of the blades is important for achieving the required torque and rotational speed. From the mechanical point of view, Cp depends mainly on the relation between the rotational speed ωr , the length of the blades, i.e. the rotor radius R, and the water speed Vw . This parameter is known as tip speed ratio λ described by Eq. (1): ωr R (1) λ= Vw In order to understand how the function Cp can be described, it is important to recall the basic concepts of wind energy and power. These concepts are used in order to explain the so–called Betz Limit, which represents the turbine efficiency in terms of mass conservation. This law states that it is possible to capture up to 59.5% from total available wind and water [2]. The tip speed ratio is calculated using Eq. (1), where λ is the tip speed ratio, Vw is the wind speed, R is the length of the blade, and ωr the thrusters’ speed. Moreover, the analysis of Unmanned Underwater Vehicles (UUV)s requires the following considerations: – wind is a renewable energy source and it is necessary to measure its speed and direction for proper unmanned underwater vehicle operations; – the nacelle rotation angle also known as yaw, which needs to be regulated on the basis of the water flow direction; – for variable speed unmanned underwater vehicles, the blades may change their inclination angles. This allows to create resistance, thus leaving the turbine to accelerate or stop. The angle of the blade is known as pitch angle; – once the energy is transmitted to the generator, its design allows to generate a certain amount of electric power. However, the voltage, frequency and phase of the signals are to be coupled with respect to the grid levels. For standalone systems, these parameters must match the load requirements. In order to achieve the electric connection, power electronic converters are incorporated. Maximum Power Point Tracking (MPPT) control allows the system to provide most of the available active power by controlling the power electronic converter. For example, the most common converter is the back to back
394
P. Castaldi et al.
(AC/DC/AC) configuration. Since power electronic devices can reach high speed frequency response, it is easier to command the voltage signal that leads a MPPT control. First, a fully controlled rectifier sets the optimum voltage for maximising the extractable power from the generator. Then, this maximum power point is constantly tracked to guarantee ideal operation. Finally, a fully controlled inverter produces the output voltage signal that needs to be synchronised with respect to the load or the grid requirements. It allows the system to work continually, while the yaw and pitch controllers work simultaneously for finding the best direction and pitch angle [5]. MPPT control is performed by different methods [13]: maximum power control, optimum torque control, and optimum tip–speed control. Maximum generated power control requires speed sensors to generate a power reference signal, which is used for controlling the digital controller block. It is important to note that those sensors cannot provide an accurate measurement of the unmanned underwater vehicle speed, as the wind field changes when the blades are rotating. Moreover, the measured generated power (Pm ), the grid voltage and current (vg , ig ), as well as the reference power estimated on the basis of the actual unmanned underwater vehicle speed (Pm∗ ) are used as auxiliary control signals, as shown in Fig. 1. Residual signals
u
FDI scheme
y Errors
Actuators
Trajectory Guidance
AUV model
stabilization control
Measurement sensors
Faults
Fig. 1. The simulator scheme.
A variation of this method results in the Power Signal Feedback (PSF) control, which uses records of output power from the grid. The values of the maximum power curves are stored and computed for creating a reference. This signal and the output measured power command feed the digital controller. Optimum generated torque control uses torque measurements from the generator, which is transformed into a torque reference. Similarly, in the maximum generated power control system depicted in Fig. 2, the grid voltage and current are measured and used by the digital controller in order to command the power converter.
Unmanned Underwater Vehicle Fault Diagnosis
395
Measurement sensors
Faulty actuators
…
…
…
…
ODIN AUV simulator
1 2 r
1 2 m
Selector Selector
Residual generator 1
1
Residual generator 2
2
Selector Selector
Selector
j
Residual generator i
i
…
l
Selector
…
Selector Selector
Residual generator p
p
Fig. 2. Residual generator bank for fault isolation.
Optimum tip–speed control is probably the less accurate method, which relies only on the unmanned underwater vehicle speed measurement and its rotor speed. Since the system requires speed sensor and rotor speedometer, it needs specific design requirements. Multiple algorithms can be applied for maximum power control to produce a stable and constant Cp at the output of the generator, as shown in Fig. 3. The model input considers all the variables related to the rotor speed and pitch angle, whilst the output is the pitch angle command [7]. The key point of the paper relies on the development of a control solution that is robust with respect to faults affecting the system. In general, Fault Tolerant Control (FTC) solutions are divided into two types of schemes, i.e. Passive Fault Tolerant Control (PFTC) and Active Fault Tolerant Control (AFTC) systems. On one hand, PFTC does not need for Fault Detection and Diagnosis (FDD) or Fault Detection and Isolation (FDI) task, or even controller re–design, but it has quite limited FTC features. On the other hand, AFTC is able to manage the fault on the system component in an active way and it rearrange the control laws such that the stability is maintained and acceptable performances of the entire process
396
P. Castaldi et al. 20
r1 (k)
r2(k)
0 -20 0
Faulty
200
400
600
10
-2 0
200
400
600
200
400
600
200
400
600
r4 (k)
0
0 200
T
400
600
10
-10 0 20
r5 (k)
r6(k)
0 -10 0
0
10
r3 (k)
-10 0
2
0 200
400
600
-20 0
Fig. 3. Residuals for single fault isolation.
are kept. Therefore, a successful strategy of AFTC exploits real–time FDD/FDI modules in order to provide the updated information regarding the health status of the dynamic process. Over the last decades, the increasing demand for safety, stability, reliability, and availability in power plants has motivated important research activities in the FTC area, as described e.g. in [9]. In particular, with special attention to unmanned underwater vehicles, they represent complex nonlinear dynamic processes, with aerodynamics that are nonlinear, partially unknown, and unsteady. Moreover, their rotors, generators and components are affected by complex turbulent wind inflow field effects that generate fatigue loading and disturbance torques. To this end, the need of condition monitoring, diagnosis and robust control of unmanned underwater vehicles has motivated these fundamental and challenging task activities, as addressed e.g. in [15]. Unmanned underwater vehicles actually installed in offshore conditions may implement complex control methodologies and techniques in order to obtain the prescribed achievements and performances. A fundamental control task that explains the interest of this paper, and in particular regards the intrinsic fault tolerance capabilities of the designed control solution. In fact, the passive fault tolerance features of the control module must take into account the management of the possible faults affecting the process under investigation. Considering this particular issue, the FTC problem applied to an unmanned underwater vehicle benchmark was analysed e.g. in [14], which considered a simple but, at the same time, realistic, general and high–fidelity simulator of a typical and industrial unmanned underwater vehicles.
Unmanned Underwater Vehicle Fault Diagnosis
397
Finally, the paper will show that the control solution implemented via a perturb and observe approach together with a MPPT scheme is able to exhibit passive fault tolerant features when applied to the considered unmanned underwater vehicle simulator.
2
Process Model and Control Scheme
The unmanned underwater vehicle model consists of all mechanical and electrical components. The main modules are represented by the blades that capture wind forces, the nacelle housing the generator and the gear–box, the tower supporting the unmanned underwater vehicle, and the connection to the grid or load. This model is simulated in the Matlab and Simulink environments. First, it is necessary to define the type of generator. Once selected, the WECS can produce energy reducing mechanical fatigue and possible faults. Note that the occurrence of faults is also considered in this work, as analysed for example in [17]. The MPPT represents the strategy that is able to extract the higher amount of power on the basis of the working conditions. The Cp curve as a function λ is used to represent this concept on the basis of the manufacturer information. The mathematical relation can be described as parametric model that includes the Cp curve. In particular, Fig. 4 depicts the general Cp coefficient with respect to the λ parameter, using the simulation data provided in [11] with the following parameters: c1 = 0.5176, c2 = 116, c3 = 0.4, c4 = 5, c5 = 21 and c6 = 0.0068. As shown in Fig. 4, these parameters lead to Cpmax = 0.48 and λopt = 8.1. According to [19], there are mainly three algorithms for determining the MPPT: – the Power Signal Feedback (PSF), based on the optimum torque control; – the Hill Climb Search (HCS), based on the maximum power control; – the tip–speed ratio (TSR), based on the optimum tip speed control. In particular, the PSF technique provides a power reference based on the load or grid side electric characteristics, and then the inverter is configured to maximize the power extraction. The HCS method applies an intelligent memory method using the techniques relying on the ‘search-remember-reuse’ algorithm, which finds the maximum power extraction without the need of knowing the parameters of the unmanned underwater vehicle or the electrical load/grid connection. Finally, the TSR strategy allows to produce the maximum power by measuring or estimating the unmanned underwater vehicle speed and the generator rotational speed. Additionally, it is required knowledge of the optimum TSR operational point. However, there are different aspects to be considered in order to calculate Cp : this work has selected the Cp expression of Eq. (2) taken from [18]: 21 −λ
Cp (β, λ) = 0.5176 116 λi − 0.4 β − 5 e + 0.0006795 λ
i
(2)
398
P. Castaldi et al. 5
r2(k)
r1 (k)
0 -5 0
200
T
400
400
600
200
400
600
200
400
600
50
0
0 200
400
600
10
-50 0
Faulty
20
r5 (k)
r6(k)
0 -10 0
200
r4(k)
r3 (k)
-10 0
0
-2 0
600
10
2
0 200
400
600
-20 0
Fig. 4. Second example of residual signals.
where:
λi =
1 0.035 − 3 λ + 0.08 β β +1
−1 (3)
and β is the collective pitch angle. Note that each specific unmanned underwater vehicle model can be described by different expressions of the Cp function of Eq. (2). Knowing the for of the Cp expression, it is possible to calculate every point of its curve in real time during the power generation. Moreover, using this relation it is possible to determine the variable inputs and outputs in the controller design and to define the Cp at a desired working point. Considering that the variations of the pitch angle β modify the UUV speed, it is important to control this variable by stablishing a relationship between the UUV speed and the pitch angle. Therefore, a general pitch angle expression can be obtained using a trigonometric expression taken from the general blade design procedure addressed in [8], which has the form of Eq. (4): ∂ 2 sin β (cos β − λ sin β) (sin β + λ cos β) = 0 ∂β
(4)
After some simplifications, the optimum relative UUV angle β, also known as pitch angle for a local tip–speed ratio, has the form of Eq. (5): Vw 2 β = tan−1 (5) 3 ωr R
Unmanned Underwater Vehicle Fault Diagnosis
399
Usually, the generator speed ωr has to be constant in order to maintain constant voltage and power. However, when the TSR method is applied, also ωr may change. This means that the frequency may vary. However, standalone loads do not affect the DC link, but the total amount of delivered power must be set constant. On the other hand, if this method is used when the system is connected to a grid, the output power must be connected thought a converter. Some considerations regarding inertial forces at the generator are addressed e.g. in [19]. The main input for this algorithm is the measured power Pm of Eq. (6) depending on the applied torque Tf , the UUV angular speed ω, and the overall vehicle efficiency η of the system from the generator input to the inverter output, which represents one key factor: Pm = PLOAD + Tf ω + ω J ddωt = η1 POU T + Tf ω + ω J ddωt
(6)
Equation (6) indicates the amount of power that is produced in terms of angular speed, and how the inertial forces J ddωt can produce extra power to the system.
3
Perturb and Observe Algorithm
The MPPT using the Perturb and Observe (PO) algorithm represents a technique based on the derivative of the output power curve. The results presented in this paper have been achieved by selecting PMSG. This approach is similar to the hill climb algorithm proposed e.g. in [19]. The PO method is implemented using a state machine, whose block diagram is shown Fig. 5, where the initial values are updated each cycle. In the first stage, the algorithm verifies if the MPPT algorithm is activated or not by confirming a bit that is set to 1 or 0. Then, in the second step, the initial voltage parameters are defined to perform the initialisation of the controller. In the next stage, the algorithm proceeds to calculate the power P by multiplying the measured voltage V and current I. The differences of the power and voltage are thus calculated, ΔP and ΔV , whilst the algorithm verifies if the actual values need to be increased or decreased based on the reference value of the voltage Vref . These values are stored for establishing the maximum and minimum limits of reference voltage, Vref M in and Vref M ax that the controller can generate [11].
400
P. Castaldi et al.
The converter power system is then configured using the vector space control considering the voltage grid level. The system also allows to synchronise the generator frequency and the grid. Figure 6 depicts the generator power output based on the UUV speed, whose profile is reported in Fig. 7. This profile was selected for providing a better understanding of inertial forces at the generator. It is worth noting that constant UUV speeds produce a more stable power coefficient Cp since the algorithm does not require to multiple iterations. In other words, the more stable the UUV speed is delivered to the blades, the more power can be obtained from the generator and extracted from the power source. Under these conditions, possible faults are also tolerated and thus compensated in a passive way.
5
r2(k)
r1 (k)
0 -5 0
200
T
400
200
400
600
200
400
600
200
400
600
10
20
r4(k)
0
0 Fault free Faulty
200
400
600
50
-10 0 20
r6(k)
r5 (k)
0 -50 0
0
-2 0
600
r3 (k)
-20 0
2
0 Faulty
200
400
600
-20 0
Fig. 5. Residual signals for multiple fault isolation.
The power coefficient Cp as a function of time is depicted in Fig. 8. In particular, Fig. 8(a) shows the Cp response with PO control in the presence of faults. It can be noted how the Cp function increases when the flows faster across the blades. On the other hand, Cp decreases progressively when the UUV speed is reduced abruptly [12]. This effect is due to inertial forces. Figure 8(b) shows the presence of multiple oscillations due to the faults affecting the system that are not mitigated by the PO algorithm corrections over the time. It is worth noting that the initial power and voltage values, ΔV and ΔP , are considered null. However, when the system starts to compute these val-
Unmanned Underwater Vehicle Fault Diagnosis
401
ues, the data are recorded and updated continuously every sampling time. This method is widely applied also for solar applications. As an example, a similar algorithm can be found in [1] for solar panels MPPT control.
20
r1 (k)
r2(k)
0 -20 0
Faulty
200
400
400
600
200
400
600
200
400
600
r4 (k)
0
0 200
T
400
600
10
-10 0 20
r5 (k)
r6(k)
0 -10 0
200
10
r3 (k)
-10 0
0
-2 0
600
10
2
0 200
400
600
-20 0
Fig. 6. Residuals with possible fault case.
5
r2(k)
r1 (k)
0 -5 0
200
T
400
400
600
200
400
600
200
400
600
50
0
0 200
400
600
10
-50 0
Faulty
20
r5 (k)
r6(k)
0 -10 0
200
r4(k)
r3 (k)
-10 0
0
-2 0
600
10
2
0 200
400
600
-20 0
Fig. 7. Third example of residual signals for single fault isolation.
Finally, the comparison of Fig. 8(a) and (b) serve to highlight the passive fault tolerance features acquired by the developed control approach relying on a MPPT scheme with a PO algorithm.
402
P. Castaldi et al. 5
r2(k)
r1 (k)
0 -5 0
200
T
400
200
400
600
200
400
600
200
400
600
10
20
r4(k)
0
0 Fault free Faulty
200
400
600
50
-10 0 20
r6(k)
r5 (k)
0 -50 0
0
-2 0
600
r3 (k)
-20 0
2
0 Faulty
200
400
600
-20 0
Fig. 8. Case of isolation of multiple faults (two concurrent faults).
4
Conclusion
The perturb and observe method addressed in this paper can be considered as a reliable technique for unmanned underwater vehicle systems implementations for low inertia generators. The algorithm iteration speed allowed the controller to correct the actuated signal feeding the power converter. The main advantage of the proposed approach relies on the fact that it can be easily implemented by using conventional programmable devices and can achieve even faster responses using embedded systems. The simplicity of the algorithm could allow to work simultaneously using complementary techniques such as artificial neural networks, deep learning or similar. The simulation tests highlighted that this technique can be implemented for real scenarios where the speed sensors are not available. Future investigations will address a more accurate analysis of the design of the proposed fault tolerant control scheme, which will include also real implementations and the online training for application to generators with high intertia.
References 1. Banu, I.V., Istrate, M.: Modeling of maximum power point tracking algorithm for photovoltaic systems. In: 2012 International Conference and Exposition on Electrical and Power Engineering, Iasi, Romania, 25–27 October 2012, pp. 953– 957. IEEE (2012). https://doi.org/10.1109/ICEPE.2012.6463577 2. Castellani, F., Garinei, A., Terzi, L., Astolfi, D., Gaudiosi, M.: Improving windfarm operation practice through numerical modelling and Supervisory Control and Data Acquisition data analysis. IET Renew. Power Gener. 8(4), 367–379 (2014). https:// doi.org/10.1049/iet-rpg.2013.0182
Unmanned Underwater Vehicle Fault Diagnosis
403
3. Garcia-Sanz, M., Houpis, C.H.: Wind Energy Systems: Control Engineering Design. CRC Press (February 2012). ISBN 978-1439821794 4. Gasch, R., Twele, J. (eds.): Wind Power Plants. Fundamentals, Design, Construction and Operation. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3642-22938-1 5. Heier, S.: Grid Integration of Wind Energy Conversion Systems, 3rd edn. Wiley, London (2014) 6. Heier, S.: Grid Integration of Wind Energy: Onshore and Offshore Conversion Systems. Engineering & Transportation, 3rd edn. Wiley (June 2014). ISBN 9781119962946 7. Khouloud, B., Mahieddine, A., Tahar, B., Rabah, L., Azzeddine, G.: Robust control of doubly fed induction generator for wind turbine under sub-synchronous operation mode. Energy Procedia 74(1), 886–899 (2015). https://doi.org/10.1016/j. egypro.2015.07.824 8. Kulunk, E.: MPPT control methods in wind energy conversion systems. In: Aerodynamics of Wind Turbines, Rijeka, Croatia, April 2014, pp. 3–18. IntechOpen (2014). ISBN 978-953-307-508-2. https://doi.org/10.5772/17854 9. Mahmoud, M.M., Jiang, J., Zhang, Y.: Active Fault Tolerant Control Systems: Stochastic Analysis and Synthesis. Lecture Notes in Control and Information Sciences, Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36283-5 10. Manwell, J.F., McGowan, J.G., Rogers, A.L.: Wind Energy Explained: Theory, Design, and Application. Wiley, West Sussex, England (2002) 11. MathWorks. MathWorks Wind Turbine, 29 May 2019. https://la.mathworks.com/ help/physmod/sps/powersys/ref/windturbine.html 12. Mohammadi, J., Vaez-Zadeh, S., Afsharnia, S., Daryabeigi, E.: A combined vector and direct power control for DFIG-based wind turbines. IEEE Trans. Sustain. Energy 5(3), 767–775 (2014). https://doi.org/10.1109/TSTE.2014.2301675 13. Muhammad, R.: Power Electronics, Circuits, Devices, and Applications, 4th edn. IEEE, New Jersey, USA (2014) 14. Odgaard, P.F., Stoustrup, J., Kinnaert, M.: Fault-tolerant control of wind turbines: a benchmark model. IEEE Trans. Control Syst. Technol. 21(4), 1168–1182 (2013). ISSN 1063-6536. https://doi.org/10.1109/TCST.2013.2259235 15. Odgaard, P.F., Stoustrup, J.: Fault tolerant wind farm control - a benchmark model. In: Proceedings of the IEEE Multiconference on Systems and Control, MSC 2013, Hyderabad, India, 28–30 August 2013, pp. 1–6 (2013) 16. Pramod, J.: Wind Energy Engineering. McGraw–Hill (September 2010). ISBN 9780071714778 17. Simani, S., Farsoni, S.: Fault Diagnosis and Sustainable Control of Wind Turbines: Robust Data-Driven and Model-Based Strategies. Mechanical Engineering, 1st edn. Butterworth-Heinemann - Elsevier, Oxford (UK), 4 January 2018. ISBN 9780128129845 18. Thongam, J.S., Ouhrouche, M.: MPPT control methods in wind energy conversion systems. In: Fundamental and Advanced Topics in Wind Power, Rijeka, Croatia, March 2014, pp. 339–360. IntechOpen (2014). ISBN 978-953-307-508-2. https:// doi.org/10.5772/21657 19. Wang, Q., Chang, L.: An intelligent maximum power extraction algorithm for inverter-based variable speed wind turbine systems. IEEE Trans. Power Electron. 19(5), 1242–1249 (2004). https://doi.org/10.1109/TPEL.2004.833459
Applying the Delphi Method to Measure Enterprise Content Management Workflow System Performance Hisham AbouGrad1(B) and Jon Warwick2 1 University of East London (UEL), University Way, Docklands, London E16 2RD, UK
[email protected] 2 London South Bank University (LSBU), 103 Borough Road, London SE1 0AA, UK
Abstract. Organisations need to measure enterprise content management (ECM) workflow systems performance to achieve their mission and objectives. This requires an exploration of the business environment where ECM workflow systems operate using an appropriate decision-making method and business process management (BPM) values. This paper describes the Delphi method as an appropriate methodology and identifies CERT values as appropriate BPM values with the support from experts and experienced professionals to measure ECM workflow systems performance. CERT values are Customer orientation (C), Excellence (E), Responsibility (R) and Teamwork (T). The purpose of this paper is to explain how the Delphi method can be used to measure ECM workflow systems performance. Further, CERT values are described to drive the business processes through the Delphi method to measure workflow system performance. The paper examines the academic literature on Delphi studies, ECM and CERT values and the benefits of this combination of ideas are revealed. The Delphi method strengths are identified to measure ECM workflow systems performance. Overall, this study focuses on the Delphi rounds as decision-making criteria to formulate a methodology in combination with CERT values to evaluate ECM workflow systems performance. Keywords: Enterprise content management · Business process management · Workflow systems performance · Delphi method · Decision-making criteria
1 Introduction The Delphi method is known both theoretically and practically as a systematic technique to explore complex business practice and information system (IS) problems [1–3]. The Delphi method is also known for its usability in developing business decision-making criteria, key performance indicators (KPIs) and problem-solving objectives. Indeed, Delphi results can be used to improve business processes performance. In essence, the Delphi method is a structured communication technique, first developed as a systematic, interactive, forecasting process based on a panel of experts [4–6]. The experts respond to a set of questions in a series of rounds each providing their own response (controlled feedback). After each Delphi round, a facilitator delivers an anonymous summary of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 404–419, 2022. https://doi.org/10.1007/978-3-031-10464-0_27
Applying the Delphi Method to Measure Enterprise Content Management
405
the experts’ predictions from the responses along with the reasons they provided for their judgments. As a result, experts are encouraged to review their answers in light of the replies of others. It is believed that during the Delphi rounds, the variation of the responses will decrease, and the group will converge towards a consensus response. The Delphi method as a process is stopped by pre-defined ‘end’ criteria (e.g., the number of rounds, achievement of consensus, or stability of results). Usually, the mean or median scores of the final round determine the final decision criteria and results [4, 5]. The Delphi method has been applied to detect critical issues, enable prediction and provide practitioners who are in leadership roles with important information relating to decision-making criteria development, policy construction or improvement in their practice [7–9]. Practically, Delphi has been used in many research areas to set goals and develop new roles for professionals. As a result, this study explores how the Delphi method can be used to measure workflow system performance using CERT values to achieve business objectives. The Delphi method along with business process management CERT values can be used to develop a set of actions (BPM Construct) in order to improve ECM workflow information system performance [10]. The business process management values of CERT (i.e., Customer orientation, Excellence, Responsibility, and Teamwork) are measurement concept values for an enterprise’s processes, which can be applied to improve workflow system performance. CERT values are workflow information systems (WIS) and BPM values for managing an organisation workflows [2]. The Schmiedel et al. study [10, p. 298] has defined CERT values as ideals that influence behavioural and organisational patterns of a group and major business process objectives. An enterprise content management system is a core strategic information management system for managing an enterprise’s content and handling unstructured content (e.g., digital documents, emails, application forms). ECM systems have control over the creation and distribution of business information, characteristics and functionalities [11]. In this paper, ECM is used as a workflow information system that can be assessed using the Delphi method in conjunction with CERT values, which indicate sets of workflow measurement activities to optimise the system performance. ECM can be considered as an integrated approach to information system that covers and aligns established concepts (e.g., document management system (DMS), (web) content management and records management (RM) system) at an enterprise-wide scale [12]. ECM as a workflow information system improves an enterprise’s customer services, streamlines processes, improves employee productivity, tracks information, provides assistance to comply with regulations, eliminates unnecessary digital and non-digital information and helps implement business continuity measures. Importantly, an organisation uses ECM system to clearly identify the required type of organisational culture, data type and other enterprise resource planning (ERP) system that ECM system would be integrated with to ensure effective workflow performance [13]. Workflow information systems are used to understand content actions, which are an important key condition for effective customisation of ECM systems [2, 12, 14]. WIS are information systems, which are used to solve problems for an enterprise’s BPM [15]. WIS are the automation of processes involving combinations of human activities and IS applications [16]. The Delphi method can be applied to develop a deeper understanding of
406
H. AbouGrad and J. Warwick
an enterprise’s BPM (e.g., in examining which values support an organisation’s workflow information systems). Indeed, Delphi is usually chosen as its iterative approach enhances validity compared to other data gathering methods and processes (e.g., a single cross-sectional questionnaire). In practice, the Looy et al. study [17] have proven that Delphi results are higher in quantity and quality of ideas than other decision support methods. Also, Delphi examples are present in information system studies in general, and an enterprise’s BPM and workflow information systems in particular. The following sections present a background to the Delphi method and introduce CERT values, ECM systems and workflow information systems. Consequently, the Delphi methodology and the concept of a Delphi study technique (framework) are discussed. Then, secondary results from published different Delphi studies are considered to describe and explain the impact of the Delphi method on workflow information systems performance, concluding with a discussion and conclusion as an outlook summary.
2 Background The following sections provide discussions on the Delphi method, business process management, CERT values, enterprise content management performance and workflow information systems in order to illustrate the novelty and importance of the Delphi results, which can be applied to measure WIS performance using BPM values. 2.1 The Delphi Method and Business Process Management Assessing the performance of an enterprise’s workflow system requires the application of a methodology, which can be used for performance measurement. The Delphi method has been recognised as a means to compare workflow systems for producing empirical findings [18]. For example, a Delphi study could be implemented with BPM experts from different industries and geographic locations to define BPM job general profile and specifications. The Delphi method uses expert opinions to find a reliable consensus using a sequence of question rounds coupled with controlled feedback from the experts. The Delphi method has been used to specify BPM values for developing controlled decisionmaking criteria to find consensus regarding workflow system issues and other related problems. The Delphi study technique was named “Project DELPHI” at The RAND Corporation when the Delphi method was first developed to elicit experts’ opinions for overall feedback [5]. Essentially, the Delphi method questions a group of chosen experts using questionnaires or interviews to improve the effectiveness and efficiency levels of an organisation’s BPM [19]. Business process management is an approach that focuses on understanding and developing an enterprise’s workflow information system based on two objectives. In the beginning, the identified two objectives are the effectiveness and efficiency levels of the business process (workflow system). In practice, the early BPM research studies have focused mainly on efficiency by focusing on the role of information systems and information technology (IT). Later, workflow information systems have been developed as a new approach for business technical prospects, mainly to develop business processes
Applying the Delphi Method to Measure Enterprise Content Management
407
and workflow system design. Therefore, BPM studies are developed to focus on workflow information systems modelling and automation with the use of BPM approaches to implement effective information systems and IT solutions. Nowadays, BPM approaches and workflow information systems use business practice and research studies to improve WIS in both effectiveness and efficiency levels in order to achieve an organisation’s workflow objectives [19, 20]. Research studies on BPM values have suggested that together the Delphi method and BPM approaches can be used to conceptualise and analyse the elements of an organisation’s BPM values. The Schmiedel et al. study [10] has recognised the use of CERT values as BPM (workflow system) values in achieving enterprise’s workflow objectives, and they used the Delphi method to complete their study. Hence, CERT values as BPM construct have been recognised as success values when they are used along with the Delphi method to measure workflow information system performance. 2.2 The Delphi Method and CERT Values Achieving expert consensus along with stability is a key objective of Delphi as it allows the participating experts to reach consensus on the significant aspects of a workflow system’s issues [4, 6, 7]. Consensus is achieved when an arranged percentage of the participants come to an agreement. Usually, consensus is achieved when an assured percentage of the responses are within a given range (tolerance) for the predictable value. Also, the stability of views is an important indicator in Delphi together with its correlation to consensus. Certainly, stability is reached when no further changes in responses are obtained by the Delphi process. Although, consensus and stability are essential and need to be obtained, the purpose of a Delphi approach is in producing critical investigation and discussions, and not forcing a quick agreement. The Delphi method is also considered as a useful method for developing instruments to be used in information system studies [4, 6]. Hence, an instrument could be a physical questionnaire designed to collect demographic information or a survey form with questions to which the experts would respond based on their expertise. Such questions would usually be generated from problems facing an area of business process or a particular organisation’s workflow information system. The Schmiedel et al. BPM study [10] has recommended CERT values as a workflow system measurement concept (BPM construct) and the Delphi method as a study technique, work together well, in considering ECM workflow system performance. An approach would be to structure a series of rounds based on the Delphi method with the measured response (controlled feedback) using CERT values. In fact, CERT values have reached a high level of consensus rate on achieving workflow system objectives [10]. Also, CERT values have the commitment to develop workflow system objectives and the accountability to ensure stability in workflow system decision-making process. 2.3 CERT Values and Enterprise Content Management System Performance Enterprise content management has the capability to promote efficient, effective and flexible workflow information system performance. Practically, ECM is a solution for most contemporary information management and business process management problems.
408
H. AbouGrad and J. Warwick
ECM systems are the business tools, approaches, processes and skills an organisation needs to manage its information assets over its business lifecycle (workflow system) [12]. On the other hand, BPM workflows are the key engine driving ECM systems, because an understanding of business activities is a crucial precondition for setting up and customising a successful ECM system to an organisation’s workflow. ECM workflow system performance is the information system capabilities, achievements and speed to run BPM workflows. ECM systems ensure workflows are completed sufficiently as it has an impact on various performance aspects [10]. This raised question, such as what information should enterprises know? How do enterprises establish a workflow structure that enables them to understand their ECM workflow system? In fact, enterprises should look at their BPM values as one of the major factors for implementing an ECM system, because of its impact on WIS performance [12]. Research studies on ECM systems established key capabilities related to business strategy development, process and deployment using workflow systems [21–23]. Also, many studies have argued that BPM analysis has provided a suitable basis for identifying content and its users along with the different systems in which content resides as ECM systems implementation affects an organisation’s workflow activities. As a result, organisations should take CERT values as both the starting point and target for implementing ECM systems to achieve the required workflow objectives. CERT values have been implemented to verify a BPM instrument to evaluate the extent to which an organisation adopts an information system for its BPM workflow system [2, 10, 20]. CERT values are a measurement concept, which can be applied for evaluating WIS performance. CERT values are BPM values that influence organisation structure, behaviour and patterns. Customer orientation (C) is the proactive and responsive attitude toward the needs of process output recipients; Excellence (E) is the workflow system continuous improvement and innovation required to achieve high performance; Responsibility (R) is the commitment to BPM workflow system objectives and the accountability for process decision-making; Teamwork (T) is the positive attitude toward cross-functional collaboration [24]. 2.4 The Delphi Method and Workflow Information Systems Workflow information systems need a decision-making approach to ensure that business processes have efficient and effective performance toward meeting the expected objectives. The Delphi method has been used in this context as a series of rounds (Table 1), which support experts’ assessment by preparing a set of indicators to measure workflow system performance [4, 6, 24]. The Delphi method is an effective approach to explore ideas and structure group communication on framework development and rating, as well as weighing decision criteria by multiple criteria decision-making in order to develop a decision-making tool to prove a business process management workflow system concept (workflow model) or workflow-based process performance measurement system [17, 25]. Workflow is a set of activities to represent a business process, which involves the coordinated implementation of multitasks performed by multiple resources to achieve specific objectives [23, 26–28]. Workflow systems are used to guide organisations to
Applying the Delphi Method to Measure Enterprise Content Management
409
standardise their business process management mechanism to meet their industry standards. The charter Workflow Management Coalition (WfMC) was established in 1993 for developing interoperability standards and common terminology for use by industry workflow vendors, which are sharing common themes for different workflow contexts [29]. WfMC has defined workflow as “The automation of a business process, in whole or part, during which documents, information or tasks are passed from one participant to another for action, according to a set of procedural rules.” [29, p. 132]. Together, workflow automation in such systematic actions and the Delphi method are able to enhance understanding of the most efficient and effective workflow system performance. In fact, a workflow information system is used to manage workflow technologies for information systems development and management [14, p. 1]. WIS allows the BPM workflow to have the ability to generate the required performance levels to achieve the expected objectives. Also, workflow system needs BPM as a comprehensive approach to understand workflow effectiveness and efficiency levels by focusing on CERT values in the workflow information system implementation and performance. In contrast, the Delphi method is a research study technique, which is able to utilise CERT values as a comprehensive indicator to evaluate workflow system performance. The Delphi method is used to discover workflow system issues and indicators by weighing decision criteria using the Delphi multiple criteria decision-making rounds [4, 17, 20, 25]. In addition, the Delphi method can be used to explore ideas and structure group communication to develop systematic performance framework for rating and weighing decision-making criteria to measure workflow systems. Thus, the Delphi method is a multi-criteria framework and a decision-making tool, which can be used to prove a workflow model concept by measuring the system performance to improve workflow information system performance. The Delphi method seeks to gain the most reliable consensus from experts and/or experienced professionals [4, 17, 25]. In common, the Delphi method is chosen as its iterative approach enhances validity compared to a single questionnaire. The purpose of Delphi is to produce critical investigation and discussions for the final agreement rather than forcing quick agreement. Indeed, the Delphi method has resulted in higher quantity and quality of ideas, than other group decision-making methods. In fact, examples of Delphi utilize are present in information systems studies in general and BPM workflow systems in particular [7, 10, 17].
3 The Delphi Method Explanation and discussions about Delphi’s framework, reliability and validity are provided in this section as follows. 3.1 The Delphi Method as a Research Study Framework Implementing a study on business processes or workflow systems require a structured approach such as the Delphi method, which is used to collect expert opinions to find a consensus by a series of controlled questions. The Delphi method conventionally has three rounds each of which has different aims [4, 10, 17]. Round 1 is Brainstorming to
410
H. AbouGrad and J. Warwick
identify and/or select key indicators or factors and set up an initial list of criteria. Round 2 is Narrowing Down to validate list of criteria or key indicators and rate (rank) those key indicators in importance or feasibility. Round 3 is Weighing to reach consensus or verify the decision and validate the study results. However, there are several ways to use Delphi; for example, some researchers depend on experts for finding issues and key indicators while others make use of a literature review to formulate a set of indicators or categories prior to the Delphi rounds. Table 1 summarises the conventional Delphi method rounds and its inputs and outputs. Table 1. Delphi method rounds [17]. Round
Input of the codification panel
Output of the expert panel
1
Brainstorming • Propose initial list of criteria • Request missing criteria
• Per initial criterion: rate its importance give open comments • For all criteria: rate overall importance give open comments • Propose missing criteria
2
Narrowing down • Consolidate criteria
• Per criterion: rate its importance give open comments • For all criteria: rate overall importance give open comments
3
Weighing • Determine final criteria • Request weightings
• For all criteria: rate overall importance give open comments • Weigh criteria and options
Brainstorming. This is the first stage of the Delphi method (Round 1), which can be considered as a pilot study for collecting a small sample of the study data to propose the initial list of criteria and indicate importance (e.g., workflow system im-portance levels). Based on the Brainstorming round, the initial decision criteria will be the subject of a Narrowing Down round [4, 17]. In practice, brainstorming is used to identify key measurement indicators for business processes and workflow systems. CERT values have the potential to structure the Brainstorming round in order to identify initial key performance indicators. Based on brainstorming round 1, the identified KPIs from key variables of CERT values can be used in the Narrowing Down round [10, 20]. Narrowing Down. This is the second stage of the Delphi method (Round 2), which validates the results from the Brainstorming round. Narrowing Down is used for seeking
Applying the Delphi Method to Measure Enterprise Content Management
411
a complete rating or ranking of the recognised key indicators by measuring their importance or feasibility to obtain a degree of consensus. This gives a list of key indicators to use in the Weighing round [4, 17]. Narrowing Down is used to confirm the key indicators of the workflows in order to obtain the final rate of consensus for decision-making of workflow system and to get the key measurement indicators. This will recognise workflow system key variables of each CERT value and rate/rank each key variable. Based on Narrowing Down (round 2), both the recognised workflow system key variables of CERT values and the rate/rank of the consensus of each key variable can be used in the Weighting round [10, 20]. Weighing. This is the final stage of the Delphi method (Round 3) and it is used to conduct a final evaluation to reach, and then, reveal the Delphi results. Thus, the study round 3 will determine the final criteria, e.g., weighting system, workflow model diagram [4, 7]. The weighing round seeks the most reliable key indicator variables of CERT values to obtain the best workflow system performance. These key indicator variables are the workflow system running principles that determine the most appropriate information system behaviour and structure for the organisation to achieve its mission and goals [10, 20].
3.2 The Delphi Method Reliability The Delphi method has a comprehensive set of indicators, because Delphi takes account of all important and different workflow information system aspects. The Delphi method discovers the workflow system issues and key performance indicators by weighing decision-making criteria using Delphi rounds as a multiple criteria decision-making process [4, 7, 17, 25]. This decision-making framework is used to identify and select the study experts who participate as highly skilled professionals to obtain a knowledgeable view, work experience or practise opinion. Knowledge in a professional field, business subject or expertise on the BPM issues that is being investigated can be used to explore ideas and structure group communication on a framework for the workflows rating or ranking. In contrast, a key objective of the Delphi method is achieving consensus and stability by allowing participants to reach consensus on the significant aspects of the workflow system [7, 30]. Consensus is achieved when arranged percentage of the participants come to an agreement on such BPM issues. Stability is reached when no further unstable responses are obtained within the process of Delphi rounds. In fact, stability is the consistency of responses between successive Delphi rounds [7]. Hence, consensus and stability are essential for ensuring the Delphi results are obtained. 3.3 The Delphi Method Validity Delphi is an expert analysis method, which has been proven to be appropriate and useful for studying business processes and workflow systems. Delphi has been used to construct, identify, select and validate key factors (KPIs) in a number of business studies [4, 7]. In practice, the Delphi method is used for examining the BPM workflow system validity
412
H. AbouGrad and J. Warwick
in the weighing round by doing a quantitative assessment of the reliability and validity of the workflow model. A quantitative measurement instrument can be used to define a specific business operations workflow using CERT values multi-dimensional structure [19, 20]. This gives empirical insights on the descriptive and predictive rules of the new workflow model. Schmiedel et al. in a global Delphi study [10] have verified and evaluated the codification results by expert participants to ensure the validity of the BPM study findings. Hence, the Delphi method has a key methodological role by ensuring the validity of its results as experts can be asked to validate their study results toward the final BPM workflow system findings and decision-making criteria. This gives consistent results from realistic quantitative evidence.
4 Delphi Results Research studies have provided significant insights and findings using the Delphi method. Delphi has achieved various improvements in finding consensus and determining the effectiveness of the criteria for measuring BPM workflow system performance [7]. Delphi method has been used to define and examine the characteristics of the business processes, which enable better understanding of workflow system and the measurement of effectiveness and efficiency levels. In fact, previous Delphi studies have recognised CERT values (Table 2) as distinct key BPM values to measure an enterprise’s workflow system [19, 20]. Therefore, CERT values and Delphi method can be used to make a criterion (framework) that supports decision-making in order to measure workflow information system performance. Table 2. CERT values to measure workflow information systems [20]. Value/CERT constructs
Definition
Customer orientation
The rules, policies, and attitude to obtain the required customer relationship results
Excellence
Enterprise’s workflow system performance continuity of improvement and innovation
Responsibility
The courage and accountability to accomplish an enterprise’s business process objectives
Teamwork
The team members’ ability to resolve business process issues through a positive attitude
CERT business process management values are an enterprise’s decision-making principles that determine regular business behaviour and structures in the workflow system relational activities. Schmiedel et al. studies [10, 20, 24] have reported findings, which include identifying CERT values (Table 2) as important themes from the BPM study participants’ feedback. Now, the results of further studies are considered to explore the relationship between the Delphi method and CERT values. This includes consideration of the Delphi method reliability, validity and results.
Applying the Delphi Method to Measure Enterprise Content Management
413
4.1 The Relationship between the Delphi Method and CERT Values CERT values have the ingredients to make a business environment accessible by workflow systems to improve their performance. The Delphi method can utilise CERT values as business process management core values to map the workflow procedures based on CERT classifications. Schmiedel et al. [10] have reported the results of a Delphi study finding that excellence (E) has the highest value ranking, and then, customer orientation (C) and responsibility (R), which are both ranked more highly than teamwork (T). These values represent core business process management elements to improve workflow information system performance [20, 24]. Thus, the Delphi method with CERT values can be used to measure and improve WIS performance. 4.2 Delphi Results The Delphi method is able to develop key performance indicators by following its three principal rounds (Table 1). These decision-making rounds can measure workflow information system performance in such a structural way (e.g., the first round to find or recognise the indicators, and then, the second round to validate those recognised indicators and rank them based on their business practicality and objectives achievement, and finally, a third round to verify the results in order to use them as KPIs). In general, the Delphi method uses its rounds as a ranking decision-making model, for instance: 1) Initial ranking (e.g., entity X greater than entity Y); 2) Rate the recognised entities (e.g., entity X = 9/10 while entity Y = 4/10); 3) Compare X and Y entities based on their scale from 1 to 10 (e.g., entity X has five more identified elements than entity Y, which verify the scale of X to Y) [4, 17, 30]. In fact, Delphi’ third round calculates the scale to confirm the rank of each entity [4, 17]. The completion of the Delphi method results in decision-making criteria that has consensus and rounds, which can be used as a concept (model) to develop business performance by improving workflow information system performance. However, Delphi method rounds can be distributed within more than three stages of the study framework. Looy et al. [17] have used Delphi within four rounds in a study relating to a business process maturity model (BPMM) to assess and improve business process maturity (Table 3). Table 3. The Looy et al. study [17] toward a Business Process Maturity Model (BPMM). BPMM study round
Delphi round
First
Brainstorming
Second – Third
Narrowing down
Fourth
Weighing
The Looy et al. Delphi study [17] has concluded five stages to develop a proof-ofconcept of a BPMM decision-making tool (model): First, evaluating scores of collected BPMMs based on calculations and according to the achieved weightings; Second, based
414
H. AbouGrad and J. Warwick
on BPMMs practicality and achievement a questionnaire can be developed and tested by a pilot study; Third, the study questionnaire can be used to compare with a decisionmaking table of BPMM sample, the answers then delivered to the table, which will navigate systematically to the most suitable BPMMs based on the questionnaire final answers; Fourth, BPMMs proof-of-concept can be automated by the questionnaire and decision-making table; Fifth, BPMMs proof-of-concept is tested by case studies (e.g., Managers who wish to start with a BPMM are asked to evaluate the BPMM and its output, then assess the Managers satisfaction with the BPMM decision-making criteria and the selection process). The Schmiedel et al. Delphi study [10] has proven that CERT values are supportive of business process management success. Indeed, the Delphi method controlled feedback have contained many positive responses to confirm CERT values success in developing workflow system performance (e.g., “I am committed to work with others to continually improve the performance of my business process to deliver excellent service/product to the customer and I take full responsibility for my actions”). The Schmiedel et al. Delphi study [19] on how business cultural values determine BPM success has confirmed that Delphi can support organisations to find unrecognised issues, which require feedback from experts and experienced professionals. The Reliability of Delphi Results. The Delphi method has resulted many empirical evidence and insights in regard to workflow information systems. Indeed, impirical studies require reliable measurement techniques to be implemented as a comparison tool to deliver evaluation insights. The Delphi study technique together with business process management values such as CERT values can ensure reliability throughout the multi-stage process (Table 1), and through, CERT values an organisation can ensure workflow information system reliability to achieve its objectives [20, 24]. To measure ECM workflow system performance, CERT values can be applied to evaluate the system performance reliability through every Delphi study round (stage). For example, the Schmiedel et al. study [20] has measured the reliability using four sorting rounds based on the Delphi method, which has delivered an average for Kappa and Placement-Ratio measurement indexes (key indicators) in each round. This has provided a testing index on reliability, which in round four has shown that a Kappa value > 0.6 and a placement-ratio > 0.8 have been reached (see Table 4). Accordingly, the appropriate agreement levels have been achieved based on Kappa and Placement-Ratio key indicators to make the measurement mechanism to be applied in the application phase. Table 4. The Schmiedel et al. [20] reliability levels in the testing index. Index
Round 1
Round 2
Round 3
Round 4
Kappa
0.29
0.42
0.26
0.67
Placement-Ratio
0.59
0.72
0.62
0.82
Applying the Delphi Method to Measure Enterprise Content Management
415
The Validity of Delphi Results. The Delphi method has been chosen to develop BPM workflow system studies, as a result of its iterative procedures, which enhance the validity of the study findings [17, 30]. The Delphi method focuses on expertise feedback as an appropriate framework to construct, recognise, find and validate workflow information system key indicators or valuable factors in several studies as described by Quyên study [4]. The most important advantage of the Delphi method is to ensure the validity of the study findings by asking the experts to validate the responses. In Delphi, examining the validity of the study findings is undertaken in the third round. The purpose of Delphi rounds (Table 1) is to construct validated indicators and factors (e.g., CERT business process management values, which are used to examine the validity of the study measurement and develop confirmatory factors through factor analysis) [10, 17, 20, 24, 30]. Also, comparing Delphi study findings to other current studies in the same study area/subject allows an analysis of the validated Delphi results. To measure ECM workflow system performance, CERT values can be applied to evaluate the workflow system performance validity in each Delphi round. For example, the Schmiedel et al. study [20] has validated BPM construct of CERT values by measuring the contribution of the formed indicators distinct construct (C.E.R.T) to the total BPM construct values (CERT) using three criteria (Table 5). First, the weight has shown CERT indicator weights are highly significant, which confirm the previous study stages of measurement procedures. Second, the relationship between BPM constructs have been evaluated using the adequacy coefficient R2a . This has shown the formed distinct indicators match with the aggregate BPM construct values. Third, the study has measured the BPM construct values for conceptual redundancy based on separating CERT influence from the BPM construct values using multicollinearity examination on the basis of the variance inflation factor (VIF). This has shown distinct indicators have less than the restrictive limit of 3.30, which means no multicollinearity. Conversely, BPM construct values (CERT) have VIF ranges between 3.66 and 5.27, which means the probability (p) of multicollinearity is increased. Overall, the study has used the Petter Table 5. The Schmiedel et al. [20] validation of BPM construct values (CERT). Indicator
Weight
Significance
VIF
Adequacy coefficient R2a
C
0.55
p < 0.001
1.74
0.83
E
0.54
p < 0.001
2.10
0.86
R
0.55
p < 0.001
1.89
0.84
T
0.54
p < 0.001
2.15
0.86 0.87
BPM construct values C
0.26
p < 0.001
3.66
E
0.27
p < 0.001
4.53
R
0.27
p < 0.001
4.55
T
0.27
p < 0.001
5.27
416
H. AbouGrad and J. Warwick
et al. [31] four options to assess p and developed independent samples t-tests between distinct indicators and BPM construct values, comparing key demographics (e.g., industry sector, and C. E. R. T. report). The t-tests produced insignificant p values that range between 0.45 and 0.71. Hence, multicollinearity has an insignificant effect, which means CERT values support BPM and workflow information systems to achieve the expected objectives. 4.3 Finding Key Performance Indicators The Delphi method has higher quality and quantity of ideas than other BPM and workflow information systems measurement methods for developing decision-making criteria; the major advantage of Delphi method compared to other methods is the validation of the study findings as experts are asked to validate their feedback classification to reform the final decision-making criteria [10]. In fact, the Delphi method has analytical rounds, which facilitates strategic decisions in terms of which BPM dimensions an enterprise must improve (e.g., dimensions of CERT values that provide key indicators below the average compared to others should be considered as an area of performance development in a new BPM strategic plan) [20, 24]. Researchers have found that the Delphi method has been successfully applied to obtain expert opinions, structure a group communication process and build consensus to achieve the study aim and objectives. Hence, the Delphi method is a valuable research development framework for eliciting participants’ experiences, views and ultimate their agreement [8]. Nowadays, it is necessary for enterprises to study their workflow information systems future requirements and they need to meet their expected objectives and required BPM workflow performances. Delphi has the capacity to capture the collective knowledge of an enterprise BPM workflow in order to find the key indicators to improve ECM workflow system performance [32].
5 Conclusion The Delphi method and CERT business process management values can be applied to evaluate how an enterprise’s workflow systems are performing. As a result, decisionmakers, specialists and experts apply the Delphi method to measure workflow system in practice using CERT values to examine their BPM workflow performance [10, 24]. For example, Delphi has been applied in a study to develop organisational management information reports and a spider model diagram (Fig. 1) in order to explain BPM workflow performance in relation to CERT values [20]. In contrast, the Delphi method as a research study technique has been implemented to develop decision-making criteria and key performance indicators, which can be used to improve workflow information system performance. Accordingly, the concentration on the Delphi study technique to measure WIS performance using CERT values does present a limit on the Delphi method.
Applying the Delphi Method to Measure Enterprise Content Management
417
Fig. 1. The Schmiedel et al. [20] CERT insights spider model diagram.
Practically, the Delphi method as a decision-making study technique should be used in other study areas, and therefore, it is recommended to implement Delphi for doing further information system studies in order to formulate decision-making criteria or key performance indicators to improve performance in order to successfully achieve specific objectives. However, further research can be achieved by Delphi using other organisational or BPM values. For example, the Delphi method can be used to measure other BPM dimensions such as financial performance in terms of efficacy and robustness. Also, further studies are possible on dimensions such as business relations and social responsibility in the measurement and improvement of professional practice. The Delphi method has proven to be an appropriate framework for developing studies on BPM and workflow information systems. The Delphi method has been used to construct, identify, select and validate KPIs for business process management. Also, CERT values allow an enterprise to make an evaluation of its workflows by ranking business processes CERT classifications. Hence, the Delphi method and CERT values can be used to evaluate ECM workflow system in order to improve the system performance to ensure meeting the expected objectives [4, 10, 33]. The Delphi method is a study technique with unique characteristics including the weighting results, which have shown significant contribution to various professional practice and research studies. Methodologically, the Delphi method is a successful framework through which to construct BPM key performance indicators. The Delphi rounds do construct the initial BPM key indicators, and then, systematically structure them to be used as measurement values to evaluate workflow information system performance or any other enterprise systems performance. In summary, the Delphi method is suitable for measuring workflow system performance, BPM workflow practice and research studies in a range of areas such as: defining roles of stockholders; identifying issues and problems; finding key performance indicators; exploring critical issues; selecting a project team; forecasting enterprise future and new strategies; answering research questions; developing service standards and/or
418
H. AbouGrad and J. Warwick
policy construction; and delivering sufficient evidence (insights) for decision-making to improve organisational performance [7, 8, 17, 24, 30, 34].
References 1. Bouaynaya, W.: Characterization of cloud computing reversibility as explored by the delphi method. Inf. Syst. Front. 22(6), 1505–1518 (2020) 2. AbouGrad, H., Warwick, J., Desta, A.: Developing the business process management performance of an information system using the delphi study technique. In: Reyes-Munoz, A., Zheng, P., Crawford, D., Callaghan, V. (eds.) TIE 2017. LNEE, vol. 532, pp. 195–210. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-02242-6_15 3. Ameyaw, E.E., Hu, Y., Shan, M., Chan, A.P.C., Le, Y.: Application of Delphi method in construction engineering and management research: a quantitative perspective. J. Civ. Eng. Manag. 22(8), 991–1000 (2016) 4. Quyên, ÐT.N.: Developing university governance indicators and their weighting system using a modified Delphi method. Procedia - Soc. Behav. Sci. 141, 828–833 (2014) 5. Dalkey, N., Helmer, O.: An experimental application of the Delphi method to the use of experts. Manage. Sci. 9(3), 458–467 (1963) 6. Alarabiat, A., Ramos, I.: The delphi method in information systems research (2004–2017). Electron. J. Bus. Res. Methods 17(2), 261–268 (2019) 7. Nworie, J.: Using the Delphi technique in educational technology research. Techtrends link. Res. Pract. to Improv. Learn. 55(5), 24–30 (2011) 8. Sitlington, H., Coetzer, A.: Using the Delphi technique to support curriculum development. Educ. + Train. 57(3), 306–321 (2015) 9. Gallego, D., Bueno, S.: Exploring the application of the Delphi method as a forecasting tool in information systems and technologies research. Technol. Anal. Strateg. Manag. (2014) 10. Schmiedel, T., vom Brocke, J., Recker, J.: Which cultural values matter to business process management? Bus. Process Manag. J. 19(2), 292–317 (2013) 11. Olusola, M., Sunday, I.: Evaluation of content management systems performance. IOSR J. Comput. Eng. 9(4), 62–69 (2013) 12. vom Brocke, J., Simons, A., Herbst, A., Derungs, R., Novotny, S.: The business drivers behind ECM initiatives: a process perspective. Bus. Process Manag. J. 17(6), 965–985 (2011) 13. Hullavarad, S., O’Hare, R., Roy, A.K.: Enterprise Content Management solutions - Roadmap strategy and implementation challenges. Int. J. Inf. Manage. (2015) 14. Guerrero-García, J.: A Methodology for Developing User Interfaces to Workflow Information Systems. Université Catholique de Louvain (UCL) (2010) 15. García, J.G., Vanderdonckt, J., González Calleros, J.M.: Developing user interfaces for community-oriented workflow information systems. In: Ubiquitous and Pervasive Computing, pp. 253–275. IGI Global (2010) 16. Guerrero-García, J., Vanderdonckt, J., Lemaige, C., Calleros, J.M.G.: How to describe workflow information systems to support business process. In: 2008 10th IEEE Conference on E-Commerce Technology and the Fifth IEEE Conference on Enterprise Computing, E-Commerce and E-Services, Jul., pp. 404–411 (2008) 17. Van Looy, A., De Backer, M., Poels, G.: Towards a decision tool for choosing a business process maturity model. In: Peffers, K., Rothenberger, M., Kuechler, B. (eds.) DESRIST 2012. LNCS, vol. 7286, pp. 78–87. Springer, Heidelberg (2012). https://doi.org/10.1007/ 978-3-642-29863-9_7 18. Müller, O., Schmiedel, T., Gorbacheva, E., vom Brocke, J.: Towards a typology of business process management professionals: identifying patterns of competences through latent semantic analysis. Enterp. Inf. Syst. 10(1), 50–80 (2016)
Applying the Delphi Method to Measure Enterprise Content Management
419
19. Schmiedel, T., vom Brocke, J., Recker, J.: Culture in business process management: how cultural values determine BPM success. In: vom Brocke, J., Rosemann, M. (eds.) Handbook on Business Process Management 2. IHIS, pp. 649–663. Springer, Heidelberg (2015). https:// doi.org/10.1007/978-3-642-45103-4_27 20. Schmiedel, T., vom Brocke, J., Recker, J.: Development and validation of an instrument to measure organizational cultures’ support of business process management. Inf. Manag. 51(1), 43–56 (2014) 21. vom Brocke, J., Simons, A., Cleven, A.: Towards a business process-oriented approach to enterprise content management: the ECM-blueprinting framework. Inf. Syst. E-bus. Manag. 9(4), 475–496 (2011) 22. Harr, A., vom Brocke, J., Urbach, N.: Evaluating the individual and organizational impact of enterprise content management systems. Bus. Process Manag. J. 25(7), 1413–1440 (2019) 23. Jaakonmäki, R., Simons, A., Müller, O., vom Brocke, J.: ECM implementations in practice: objectives, processes, and technologies. J. Enterp. Inf. Manag. 31(5), 704–723 (2018) 24. Schmiedel, T., Recker, J., vom Brocke, J.: The relation between BPM culture, BPM methods, and process performance: evidence from quantitative field studies. Inf. Manag. 57(2), 103–175 (2020) 25. Van Looy, A., Shafagatova, A.: Business process performance measurement: a structured literature review of indicators, measures and metrics. Springerplus 5(1), 1–24 (2016). https:// doi.org/10.1186/s40064-016-3498-1 26. Marutha, N.S., Ngulube, P.: Enterprise content management system implementation readiness to improve medical records management in Limpopo Province, South Africa. Libr. Philos. Pract. 2018 27. Yousfi, A., Saidi, R., Dey, A.K.: Variability patterns for business processes in BPMN. Inf. Syst. e-Bus. Manag. 14(3), 443–467 (2015). https://doi.org/10.1007/s10257-015-0290-7 28. Vasilecas, O., Kalibatiene, D., Lavbiˇc, D.: Rule- and context-based dynamic business process modelling and simulation. J. Syst. Softw. 122, 1–15 (2016) 29. Chang, J.F.: Business Process Management Systems: Strategy and Implementation. Auerbach Publications, Boca Raton, FL (2006) 30. Looy, A.V., Poels, G., Snoeck, M.: Evaluating business process maturity models. J. Assoc. Inf. Syst. 18(6), 461–486 (2017) 31. Petter, S., Straub, D., Rai, A.: Specifying formative constructs in information systems research. MIS Q. 31(4), 623–656 (2007) 32. Stitt-Gohdes, W.L., Crews, T.B.: The Delphi technique: a research strategy for career and technical education. J. Career Tech. Educ. 20(2), 55–67 (2004) 33. Hullavarad, S., O’Hare, R., Roy, A.K.: Enterprise content management solutions—roadmap strategy and implementation challenges. Int. J. Inf. Manage. 35(2), 260–265 (2015) 34. Schmiedel, T., Müller, O., Debortoli, S., vom Brocke, J.: Identifying and quantifying cultural factors that matter to the IT workforce: An approach based on automated content analysis. In: ECIS 2016 - Proc. 24th Eur. Conf. Inf. Syst., pp. 1–16, Istanbul, Turkey (2016)
A Fuzzy Epigenetic Model for Representing Degradation in Engineered Systems Maria Seale1(B) , R. Cody Salter1 , Natàlia Garcia-Reyero2 , and Alicia Ruvinsky1 1 US Army Engineer Research and Development Center, Information Technology Laboratory,
Vicksburg, MS 39180, USA {Maria.A.Seale,Richard.C.Salter,Alicia.I.Ruvinsky}@erdc.dren.mil 2 US Army Engineer Research and Development Center, Environmental Laboratory, Vicksburg, MS 39180, USA [email protected]
Abstract. Degradation processes are implicated in a large number of system failures, and are thus crucial to understanding issues related to reliability and safety. Systems typically degrade in response to stressors, such as physical or chemical environmental conditions, which can vary widely for identical units that are deployed in different places or for different uses. This situational variance makes it difficult to develop accurate physics-based or data-driven models to assess and predict the system health status of individual components. To address this issue, we propose a fuzzy set model for representing degradation in engineered systems that is based on a bioinspired concept from the field of epigenetics. Epigenetics is concerned with the regulation of gene expression resulting from environmental or other factors, such as toxicants or diet. One of the most studied epigenetic processes is methylation, which involves the attachment of methyl groups to genomic regulatory regions. Methylation of specific genes has been implicated in numerous chronic diseases, and thus provides an excellent analog to system degradation. In this paper, we present a fuzzy set model for characterizing system degradation as a methylation process based on a set-theoretic representation for epigenetic modeling of engineered systems. This model allows us to capture the individual dynamic relationships among a system, environmental factors, and state of health. We demonstrate application of the model on a use case of corrosion of a metal specimen. Keywords: Degradation · Fuzzy model · Prognostics and health management · System reliability · Epigenetics
1 Introduction Engineered systems have served critical functions in society for centuries. Some of the earliest known works of engineering include incredible structures, such as the Egyptian pyramids, the Roman Colosseum, and the Pont du Gard aqueduct [1]. History shows a parallel evolution of the complexity of engineering inventions and our increasing © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 420–435, 2022. https://doi.org/10.1007/978-3-031-10464-0_28
A Fuzzy Epigenetic Model for Representing Degradation
421
dependency on them for meeting everyday needs. While the earliest engineering accomplishments, such as weapons, roads, and fortresses, were primarily for the joint good of a group of people, engineered systems today impact the daily personal lives of millions. In modern society, our basic needs of food, clothing, shelter, and water are highly dependent on complex engineered systems, while many other inventions provide non-essential but convenient services that enrich our lives. The increase in the complexity of engineered systems, as well as our dependency on them, necessitates improved means for ensuring the reliability of the systems to avoid catastrophic disruptions in critical areas of human welfare. Prognostics and health management (PHM) is a relatively new discipline that promotes a pre-emptive approach to system reliability [2]. PHM evolved from conditionbased maintenance (CBM) in which data about a component’s state of health is used to determine whether the component is functioning properly or if maintenance is required or recommended. Prognostics – predictive diagnostics – is concerned specifically with detecting degraded states and making predictions of when a system failure is likely to occur [2]. Simpler interval-based maintenance methodologies have proven inadequate for complex systems that can exhibit apparently random failures with little warning. These failures, rather than being random, can actually result from variations in work load, environmental conditions, and manufacturing variations [3]. Monitoring such systems can improve the prediction of failure states, aid diagnostics, and help optimize processes [3]. Goodman and colleagues [2] posit that traditional PHM methods such as modelbased, physics of failure (PoF), and data-driven approaches are not accurate and are even misleading when used to predict failures. They present an approach to prognostics that diverges from these conventional methods by using features constructed from leading failure indicators of condition-based sensor data to detect and monitor the progression of damage caused by degradation, and to estimate and detect functional failure of the component. This approach to failure prognostics is more closely aligned to our goal of representing system health as it evolves in conjunction with environmental and operational influences. For more information on topics and methods of PHM in general, we refer the reader to [4]. The overarching goal of this work is to provide a more effective way to represent engineered systems, factors, and processes that affect the state of health by leveraging concepts from epigenetics. The epigenetic engineering model will support the development of PHM algorithms that seamlessly integrate system state with individualized conditions, thereby improving prediction accuracy over population-based methods. In this paper, we focus specifically on representing the impact of environmental factors on system degradation through fuzzy sets. Representing this process is significant, as degradation processes are responsible to some extent for most system failures [5]. This paper is organized as follows. Section 2 includes background information on the bio-inspired concepts linking epigenetics to engineered systems, as well as a brief description of our previously developed set-theoretic model [6]. Section 3 follows with a description of the systems disease process as an analog to the epigenetic process of methylation. Section 4 extends our original model with fuzzy set constructs capable of representing new concepts for modeling system degradation as a methylation process.
422
M. Seale et al.
We provide our conclusion and directions for future work in Sect. 5, followed by our acknowledgments.
2 Background The field of natural computing involves the development of computational methods that are inspired by phenomena found in nature, along with the investigation of information processing as it occurs in system activities, such as organism growth and brain processes [7]. Natural computing bridges a multitude of physical and biological scientific disciplines with computer science, providing a mechanism for furthering comprehension in each area. Bio-inspired computing is a subset of natural computing that is motivated specifically by biological processes. Well-known examples of these include traditional neural networks [8] and genetic algorithms [9]. However, [7] documents an incredible wealth of natural computing methods that range from DNA and molecular computing to artificial immune systems and swarm intelligence. More recently, epigenetic processes have also been used to inspire computational models. Epigenetics provides insight into how environmental factors affect the expression (phenotype) of the state of an organism. For example, [10] generates algorithms by targeting epigenetic concepts related to both information representation and epigenetic operations, while [11] incorporates epigenetic factors with classical genetic algorithms to create an epigenetic algorithm (EGA). In another example, the integration of an epigenetic mechanism with genetic programming was shown to improve the performance of traffic signal controllers in [12]. As epigenetic processes inherently represent subtle influences of external factors over time that can eventually lead to reduced health, they provide a suitable model for processes that affect engineered systems in a similar manner, resulting in reduced performance or failure. The remainder of this section provides additional background on the analogs of epigenetics and health of engineered systems, as well as a set-theoretic model for representing these concepts. 2.1 Epigenetics and Methylation DNA methylation, a process in which methyl groups are added to DNA molecules, is considered to be a primary interface between the genome and environment [13], and therefore provides a natural model for system degradation. Methylation is a principal mechanism of epigenetics, which is concerned with the study of chemical modifications of genes that determine how genetic information is expressed and used [14]. Methylation can occur in different genomic areas, such as intergenic regions, CpG islands within gene promoters, and the gene body [15]. Each of these areas produces different responses to methylation. For example, hypermethylation of the gene promoter region is associated with downregulation of gene expression, while hypomethylation of the same region is associated with upregulation of gene expression [16]. Gene promoter methylation is notably associated with the silencing of tumor suppressor genes and is thus of great interest in cancer research, as documented in an umbrella review [17]. The review reports a positive association with gene promoter methylation of protein-coding
A Fuzzy Epigenetic Model for Representing Degradation
423
genes and many types of cancer. The study also reveals that hypermethylation of specific combinations of genes is associated with certain types of cancers, such as bladder, breast, and gastric. These findings have considerable implications for cancer prognostics based on methylation patterns. This manner in which methylation of gene promoter regions depresses gene expression and leads to declining health is the basis for this work, and is discussed in more detail in Sect. 3. 2.2 Foundational Formal Model The concept of representing system health degradation as a methylation process is presented here as an extension to a previously developed set-theoretic model for characterizing epigenetic concepts for engineered systems [6]. In this section, we offer the key ideas from that model as background for the current work. The reader is referred to [6] for more details. We begin by asserting parallel views between the two seemingly disparate fields of epigenetic mechanisms in biological systems and prognostics for engineered systems. At the highest level, a system, S, is considered as an organism, and is described by a set of genes that represents its properties. The genes influence the observable properties of the system, understood as the phenotype. This phenotype encodes the health status of the system and is produced by considering environmental influences on the genes during periods of exposure. Examples of relevant environmental factors include humidity, temperature, and physical strain. Over time, these external elements damage system health in a way that may at first be negligible, but eventually leads to degraded performance and eventual failure. This can be compared to the way that long-term exposure to environmental toxicants eventually leads to chronic illnesses such as cancer. Table 1 provides the formal definitions for the elements comprising the epigeneticbased representational model for engineered systems. Some slight changes in notation have been made from the original definitions in [6] to improve clarity. The three operators, phenotypically equivalent, epigenetically influences, and environmentally influences, referenced in the set definitions, are defined as follows. Phenotypically equivalent (≈p ) For < x, y >, < u, v >∈ RS , < x, y >≈p < u, v > iff for < x, y >, < u, v >∈ RS , y = v
(1)
Epigenetically influences (→e ) g→e p for g ∈ GS , p ∈ PS iff Δg →Δp
(2)
Environmentally influences through g (→i:g ) i(→i:g )p for i ∈ IS , g ∈ Gs , p ∈ PS iff a.(∃f (i, g, . . .)→Δp) or b. Δi→Δg and g →e p
(3)
424
M. Seale et al. Table 1. Formal set definitions.
For this work, we are primarily interested in Definition (3)a, which describes a change in the phenotype p resulting from a functional process involving an environmental factor i, and a gene g that influences p. The ellipsis indicates that other potential parameters may be included.
3 Disease Process Model The model described in the previous section provides a framework to represent static concepts within the domain and general operators for environmental and epigenetic influences on the phenotypic expression of a system “genome”. However, to achieve the goal of a paradigm-shifting model for prognostics of engineered systems, we must incorporate dynamic components that capably represent the interactions among environment, system status, and resulting health. Leveraging the ideas presented, we now extend the epigenetic model to encompass concepts of disease and disease states in engineered systems. A mechanism for explaining biological phenotypic variations that begins with environmental stress leading to methylation, followed by transcriptomic and proteomic differences is described in [13]. We adapt the general principles of this process model for our formal representation of degradation in engineered systems, beginning with an examination of failure modes in mechanical systems.
A Fuzzy Epigenetic Model for Representing Degradation
425
3.1 Failure Modes The field of systems engineering has well-defined concepts for issues related to reliability and failure. A brief review of the literature uncovers a wealth of methodologies for predicting reliability at various stages of the engineering design process. Two techniques commonly used to predict reliability are load-strength interference [18] and reliability block diagramming [19]. The load-strength interference method uses probability distributions to represent the various loads experienced by a part over a given period of time and the strength of a part over a given period of time. Failure is reached when the load exceeds the strength of the part at a particular point in time, and the reliability is calculated as the probability that the load will exceed the strength. Unfortunately, this technique can be inaccurate due to the complexity of accurately predicting the loads experienced by a part and the strength expressed by a part over its useful life. These inaccuracies may result in parts that are unnecessarily strong or parts that are too weak to meet requirements. Reliability block diagramming, on the other hand, utilizes collections of series and parallel blocks of defined failure probability to predict the reliability of complex systems. To produce accurate reliability predictions, this technique requires accurate failure probabilities for each component within the block diagram, which can be difficult to obtain. Collectively, these two techniques summarize the inherent complexity of reliability prediction. Engineers define reliability as “the probability that an item will perform a required function without failure under stated conditions for a stated period of time” [18]. This definition has four critical components, but the key to understanding reliability is understanding failure. To understand and predict failure, however, one must first understand the relationships between material properties, operating profile, environmental conditions, and geometry for each critical, unique part within an engineered system [19]. Certain combinations of these variables lead to certain classes, or modes, of failure, which are seen across the engineering discipline. As a result, the observation of certain failure modes can enable the diagnosis of problems within an engineered system. We now consider defined failure modes for mechanical systems as an analog to biological disease. As described in [20] three accepted categories of mechanical failures include those due to (1) operating load, (2) environment, and (3) poor manufacturing quality. Each of these is considered with respect to its engineering definition, impact, and corresponding biological concept. Operating Load. Operating load failures include those due to excessive strain on a system, with subcategorizations including tensile-strength-load failures, compressive strength failures, brittle fractures, fatigue failures, and more. These breakdowns can be responsible for secondary loss, as when the metal in a piston ring cracks from fatigue and scores the cylinder wall, resulting in a corresponding loss of cylinder pressure. Environmental Load. Corrosive environments, oxidation, and temperature can lead to mechanical component failure. Environmental loads cause damage over time that varies in extent based on the magnitude of the condition. For example, simple oxidation of metal bolts in humid environments can reduce strength, eventually resulting in failure related to operating loads.
426
M. Seale et al.
Poor Manufacturing Quality. Manufacturing processes, such as welding, heat treatments, and casting or machining defects, can result in products that do not meet the intended load requirements. These defective products typically fail soon after they are put into use. These general failure modes and conditions can be compared to organismal states in a way that provides insights for definition of a mechanical process that is similar to a biological disease process. Operating loads can be viewed as stress-inducing factors, such as emotional or physical trauma, or high levels of anxiety. Just as biological stress does not always have an immediate phenotypic effect although epigenetic changes have been set in motion, so also do mechanical stressors leave invisible changes that may inevitably lead to failure. The environmental load of physical systems can likewise be compared to environmental influences from diet or atmospheric toxicants on an organism. For both, the magnitude of the load and the duration of exposure are significant factors for determining whether the eventual outcome results in failure/disease or if the progression can be halted or even reversed before an adverse effect is realized. Finally, we consider manufacturing defects to be analogous to an organism’s genetic code at birth, where certain inheritable epigenetic variations are associated with higher-than-normal risks for certain diseases. Just as the inheritance of a genetic code does not guarantee the disease will be expressed during the organism’s life, possibly due to mitigating factors from diet, medications, etc., some engineered components may have quality defects but perform acceptably in the environment in which they operate. 3.2 Methylation as a Failure Model The methylation process can be used to represent each of the three failure modes given. Research reported in [13] investigates the effects of small levels of methylation on final disease phenotype. While [13] is concerned with determining biologically meaningful changes in the outcome, we concern ourselves with the engineering analogy of representing functionally meaningful changes in a system using a methylation model. A functional failure, as defined in [2], is the point at which a component no longer operates within given specifications due to damage. As previously mentioned, methylation is one of the primary mechanisms through which the environment causally influences phenotypic expression and is one of the most studied epigenetic processes [21]. Methylation of CpG islands, for example, has been shown to have a clear association with both the onset and the development of disease conditions [13]. Two distinct models of methylation exist that result in different timelines for expression of disease. In the first model, hypermethylation of certain regulatory regions results in sudden on/off switches to genetic activity. This has been documented in oncological studies where, for example, methylation of known tumor suppressor genes has been implicated in sudden aggressive tumor growth. In the second model, much less dramatic levels of methylation on genes result in phenotypical changes that occur long after methylation has been observed. This model is witnessed more often in chronic diseases like diabetes and is associated with comparatively long latency periods between methylation and the eventual phenotypical effect [13].
A Fuzzy Epigenetic Model for Representing Degradation
427
Both of these models are applicable to engineered systems. While many sudden systems failures (analogous to the hypermethylation model discussed above) are seemingly impossible to predict, there are cases where environmental factors, when considered in conjunction with a sufficient representation of system health, can provide adequate context through which such predictions are conceivable. An example is that of an abrupt structural failure, such as the breaking of a support component, where the impending failure could be anticipated if sufficient real-time conditions were monitored. In particular, PHM ideally can allow us to predict incipient faults before they progress to a failure state [22]. For this research, we consider a reduction of these two cases to a single model in which progressive, low levels of methylation result in sub-optimal health at a future time. Ongoing environmental loads contribute to additional degradation, resulting in an additive methylation model that eventually reaches a failure state. This profile more closely represents the common evolution of component status from an initial healthy state through progressively more degraded states, eventually leading to failure if no intervening measures are taken. 3.3 Model Postulates for Methylation To extend the model described in Sect. 2 to support the representation of a methylation process, we begin with a set of assertions. First, we presume that there is a positive correlation between gene expression and system health. That is, the more a gene is expressed, the healthier the system with respect to the particular phenotype(s) influenced by that gene. Methylation of a gene occurs as a response to environmental factors and acts to repress the gene’s expression. As a result, healthy components begin their service with no methylation, while components with manufacturing defects can be represented with some initial methylation pattern. Next, partial methylation of the gene(s) responsible for a phenotype results in a degradation of health status, while complete methylation of the gene(s) represents the phenomenon of “permanent gene silencing,” which translates to a complete/catastrophic failure event. The expression of these additional functions necessitates extensions to the model definitions, which are presented in the following section.
4 Fuzzy Model The formal model defined in Sect. 2 has been extended to represent the methylation process as an analog to damaging influences exerted on engineered systems. The health status of individual phenotypes is a result of the degree of contribution of the influencing gene(s) to that phenotype plus the expression ability of the gene(s) as represented by the methylation level. The methylation level, in turn, results from specific environmental conditions. Figure 1 illustrates this process for a single system phenotype that depends on N genes. Finally, the overall health of a system (not shown) is taken as a combination of the health indicators of its constituent phenotypes.
428
M. Seale et al.
Fig. 1. General diagram of the epigenetic-based degradation process. Environmental factors impact the health of individual genes through a methylation function (Represented by the Circles with “M”). In this work, increased methylation translates to increased suppression of the gene, and a resulting negative impact on the health of the phenotype. The contribution levels and the resulting health of each gene are combined to produce a revised individual impact score. These scores are then considered together to effect a change in the health of the system phenotype.
We begin by augmenting the original model with new definitions and functionality, then incorporate fuzzy sets to represent the relationships and the degradation process. To illustrate the model, we then provide an example based on a use case of corrosion of a cast iron specimen, first described in [6]. The set representations for the use case are provided in Table 2. In Sect. 2, we noted that hypermethylation of the gene promoter region is associated with downregulation of gene expression [16]. As a result, the impact of methylation on system health can be viewed as a three-step process where the environmental conditions first cause methylation of a gene, and the resulting suppression of gene expression results in degraded health. To represent this relationship, we must associate the three elements of environment, gene, and phenotype.
A Fuzzy Epigenetic Model for Representing Degradation
429
Table 2. Sets for corrosion example. Set of properties/genes GS = {composition, density, mass, specimen dimensions, manufacturing process, surface finish} Set of environmental factors I S = {exposure period, temperature, relative humidity, precipitation, time of wetness, concentration of atmospheric contaminants} Set of phenotypes P S = {change in color, change in texture, mass loss} Set of gene/phenotype pairs RS = {, < surface finish, color >, < composition, texture >, < surface finish, texture >, < composition, mass loss >} Set of environmental factor/phenotype pairs U S = {, < exposure period, texture >, < contaminants, mass loss >} Epigenome ES = {{composition, surface finish}, {composition}}
4.1 Fuzzy Representations of Impact and Methylation The first step is to modify the representation of the epigenome. The original definition shown in Table 1 for ES creates a set of sets of genes such that each member set contributes to an associated phenotype (health attribute). The affected phenotypes are not maintained in this set, but can be obtained through the relations in RS . We modify the definition previously given for ES in Table 1 to associate the relevant phenotype with each member set through the use of tuples, yielding: EGS = RS /≈p = {< {g}, p > | < g, p > ≈p (4) ∀ < g, p > ∈ RS and ∃i ∈ IS s.t. i→i:g p} That is, we (1) restrict the elements in RS to those that meet the environmentally influences definition for the genes g, and (2) maintain the association between each phenotype and its influential set of genes. In this case, the function that links the effects of the environment to a phenotype through interaction with a gene is taken to represent a methylation process. As this process will vary for different phenotypes and systems, it is beyond the scope of this paper to explore specific process representations; however, we describe a generic methylation function later in this section to illustrate the model concepts. We define the term epigene to be a gene that is influential on system health, and is thus associated with at least one phenotype of the system; therefore, genes that are members of a set in EGS are identified as epigenes. As described earlier, methylation in biological systems occurs at gene promoter regions, and the degree of methylation of the region affects the gene’s ability to express itself. The degree of methylation of an epigene and the epigene’s level of influence on ˜ p and L˜ p , the health status of a phenotype are represented as a pair of fuzzy sets, M respectively, for each phenotype p over its epigenome. These sets are defined below, where e is an epigene, eM is the methylation value assigned to e, and eL is the level of impact of e on the health of the phenotype. The fuzzy membership functions are
430
M. Seale et al.
represented by μ. ˜p = M
e, μM˜ p (eM ) ∀e ∈ x s.t. ∈ EGS ,
0 ≤ eM , μM˜ p (eM ) ≤ 1 L˜ p = e, μL˜ p (eL ) ∀e ∈ x s.t. < {x}, p > ∈ EGS , 0 ≤ eL , μL˜ p (eL ) ≤ 1 and
n
eLi = 1
(5)
(6)
i=1
˜ p represents the degree of methylation produced by the environment over The set M a set of epigenes that influence a single phenotype p. These values are used to assess the ability of the gene to contribute to the healthy status of the phenotype. Higher values represent more methylation and therefore a reduced ability to promote health. The set L˜ p represents the impact that a set of epigenes has on the phenotype, and can also be viewed as a degree of relevance of each epigene to the expression of a phenotype. Both the methylation level, eM , and the contribution level, eL , are represented by values between 0 and 1, and we further restrict the membership values, μ eM /L to also fall between 0 and 1. Furthermore, the levels of influence eL of all of the genes in L˜ p must sum to 1. This means that the set of epigenes associated with a particular phenotype represents all the epigenes that influence that phenotype, and there is none excluded from the set. The membership functions will vary between systems, and are identified based on known or approximated physical processes. For this paper we use generic functions that illustrate the concepts of methylation and the impact on system health. We take μL˜ p (eL ) to be the identity function, such that μL (eL ) = eL , for 0 ≤ eL ≤ 1, P
(7)
indicating that the degree of membership in the fuzzy set is the same as the level of impact of a gene on a phenotype. For methylation, we use a parabolic function for illustration, where μM˜ p (eM ) = α(eM )2 + β(eM ) + γ , for 0 ≤ eM ≤ 1.
(8)
This higher order equation better represents the biological model in which methylation can build gradually over time, with less effect in the beginning, but producing more drastic results with additional accumulation. For our illustration here, we omit the second and third terms by setting β, γ = 0 and α = 1. These two membership functions are shown as the red and blue lines in Fig. 2. A simple combination that represents the ability of an epigene, e, to express a healthy phenotype given its impact level and degree of methylation is shown as the green lines of Fig. 2, and is represented by (9). μT˜ p (e) = μL˜ p (eL ) − (μL˜ p (eL ) ∗ μM˜ p (eM ))
(9)
A Fuzzy Epigenetic Model for Representing Degradation
431
Fig. 2. Example membership functions for epigene level of influence on a phenotype (red lines) and environmentally induced methylation level (blue lines). The green lines represent the combination of these that yields an impact score. The green curve in the top plot represents the results when impact and methylation values are equivalent, and is intended to show a general trend. The green curve in the bottom plot represents the specific case where epigene impact is set at 0.6. These figures show how increased methylation of an epigene reduces its ability to contribute to a healthy phenotype.
432
M. Seale et al.
This formulation will always result in a value between 0 and 1, and represents a membership value in a third fuzzy set, T˜ . The second term tempers the degree of methylation by the impact level, which is then subtracted from the impact level to give a health value of the epigene. The value can be interpreted as the ability of the epigene to contribute to the health of the phenotype, based on its degradation from environmental factors. In other words, it assesses how healthy the gene is given its environmental stressors, in consideration with how important that gene is to the specific phenotype. The fuzzy set definition is shown in (10). (10) T˜ p = e, μT˜ p (e) ∀e ∈ x s.t. ∈ EGS Now that each epigene’s health has been calculated, we can use this information to provide an overall assessment of the health of a phenotype, Hp . Since the previous membership functions have accounted for both (1) the ability of each epigene to express a healthy phenotype, and (2) the degree to which this ability has been degraded through a methylation process, we achieve the final scalar health value by summing the membership values in T˜ p and dividing by the number of contributing epigenes, n. n i=1 μT˜ p (ei ) Hp = (11) n 4.2 Example of the Fuzzy Set Model In this section, we provide an example of the application of the fuzzy set model. Earlier, we defined a simple “epigenome” for a cast iron specimen as EGS = {< {composition, surface finish}, color >, < {composition, surface finish}, texture >, < {composition}, mass loss >} Because there are three phenotypes represented in the epigenome, there are three pairs of fuzzy sets, shown below. For simplicity, we omit giving values for eM and eL , and instead provide their fuzzy membership values as already computed. 1. 2. 3. 4. 5. 6.
˜ color = {(composition, .3), (surface finish, .6)} M L˜ color = {(composition, .2), (surface finish, .8)} ˜ texture = {(composition, .7), (surface finish, .6)} M L˜ texture = {(composition, .4), (surface finish, .6)} ˜ mass loss = {(composition, .4)} M ˜Lmass loss = {(composition, 1)}
Note that the same epigene(s) in different combinations can be responsible for multiple phenotypes. Moreover, the methylation and level of impact of the same epigene can be different depending on the phenotype being represented. For example, composition has a bigger impact on texture in L˜ texture than it does on color in L˜ color . Also note that a single epigene is wholly responsible for L˜ mass loss .
A Fuzzy Epigenetic Model for Representing Degradation
433
We now use (9) to determine how capable the epigenes are of supporting a healthy state of each phenotype, based on their level of methylation. This results in the fuzzy sets: T˜ color ={(composition, .14), (surface finish, .32)}. T˜ texture = {(composition, .12), (surface finish, .24)}. T˜ mass loss = {(composition, .6)}. Finally, we sum the respective membership values and divide by the number of corresponding epigenes, per (11), to produce a health indicator per phenotype: .14 + .32 = .23 2 .12 + .24 Htexture = = .18 2 .6 = .6 Hmass loss = 1 Hcolor =
The interpretation of these values, along with the setting of the initial model values and the membership functions, is dependent on the particular system being represented and intended to be used with the guidance of a subject matter expert. The extreme values of 0 and 1, however, always represent completely unhealthy (non-functional) and completely healthy conditions, respectively, and appropriate partitionings of health values can be defined on an individual system and phenotype basis.
5 Conclusion and Future Work The objective of this paper is to demonstrate that degradation of engineered systems can be effectively and intuitively represented by a fuzzy set-based model that mirrors the biological process of DNA methylation. This model extends our previous work in which we developed the initial framework for representing system health through epigenetic concepts. While this work is in its early stages, the essential associations between the biological and engineered systems domains have been established, along with the necessary mechanisms for representing degradation processes, thereby providing a foundation from which to explore more complex process models. The use of fuzzy methods for representing dynamic processes allows us to leverage computational advances in areas such as gene regulatory network inferencing to account for interdependencies among system components. We intend to extend this work to include other epigenetic processes, such as chromatin modification, as well as system-level epigene interactions that must be assessed together to provide a health status for multi-level representations. The extended framework will be compared to existing methods for modeling real-world degradation processes. This research can provide a platform to enable the future of intelligent engineered systems that are able to assess their current health status and predict the potential evolution of degraded states over a future timeline. Acknowledgments. The use of trade, product, or firm names in this document is for descriptive purposes only and does not imply endorsement by the U.S. Government. The tests described and
434
M. Seale et al.
the resulting data presented herein, unless otherwise noted, are based upon work conducted by the US Army Engineer Research and Development Center supported under PE 601102A, Project AB2, Task 04 ‘Unique Biological Processes and Data Analytics’. Permission was granted by the Computational Science and Engineering Division Chief, Information Technology Laboratory, to publish this information. The findings of this report are not to be construed as an official Department of the Army position unless so designated by other authorized documents.
References 1. Smith, R.J.: Engineering, https://www.britannica.com/technology/engineering. Last accessed 12 July 2021 2. Goodman, D., Hofmeister, J., Szidarovszky, F.: Prognostics and Health Management: A Practical Approach to Improving System Reliability Using Condition-Based Data. John Wiley & Sons, Ltd, Hoboken, NJ (2019) 3. Kothamasu, R., Huang, S.H., VerDuin, W.H.: System health monitoring and prognostics – a review of current paradigms and practices. In: Ben-Daya, M., Duffuaa, S.O., Raouf, A., Knezevic, J., Ait-Kadi, D. (eds.) Handbook of Maintenance Management and Engineering, pp. 337–362. Springer, London, London (2009) 4. Geobel, K., Celaya, J., Sankararaman, S., Roychoudhury, I., Daigle, M., Saxena, A.: Prognostics: The Science of Making Predictions. CreateSpace Independent Publishing Platform (2017) 5. Meeker, W., Hong, Y., Escobar, L.: Degradation models and analyses. In: Encyclopedia of Statistical Sciences. John Wiley & Sons (2011) 6. Seale, M.A., Garcia-Reyero, N., Salter, R.C., Ruvinsky, A.: An epigenetic modeling approach for adaptive prognostics of engineered systems. Procedia Comput. Sci. 185, 311–319 (2021) 7. Rozenberg, G., Back, T., Kok, J.N.: Handbook of Natural Computing. Springer, Berlin Heidelberg (2012) 8. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 386–408 (1958) 9. Holland, J.H.: Genetic algorithms. Sci. Am. 66–72 (1992) 10. Stolfi, D.H., Alba, E.: Epigenetic algorithms: a new way of building GAs based on epigenetics. Inf. Sci. (Ny) 424, 250–272 (2018) 11. Birogul, S.: EpiGenetic algorithm for optimization: application to mobile network frequency planning. Arab. J. Sci. Eng. 41(3), 883–896 (2016) 12. Ricalde, E.: A genetic programming system with an epigenetic mechanism for traffic signal control. PhD thesis. Memorial University of Newfoundland. http://arxiv.org/abs/1903.03854 (2019) 13. Leenen, F.A.D., Muller, C.P., Turner, J.D.: DNA methylation: conducting the orchestra from exposure to phenotype? Clin. Epigenetics. 8, 1–15 (2016) 14. Rogers, K., Fridovich-Keil, Judith, L.: Epigenetics. https://www.britannica.com/science/epi genetics. Last accessed 20 July 2021 15. Moore, L.D., Le, T., Fan, G.: DNA methylation and its basic function. Neuropsychopharmacology 38, 23–38 (2013) 16. Sjahputera, O., et al.: Relational analysis of CpG islands methylation and gene expression in human lymphomas using possibilistic C-means clustering and modified cluster fuzzy density. IEEE/ACM Trans. Comput. Biol. Bioinforma. 4, 176–188 (2007) 17. Bouras, E., Karakioulaki, M., Bougioukas, K.I., Aivaliotis, M., Tzimagiorgis, G., Chourdakis, M.: Gene promoter methylation and cancer: an umbrella review. Gene 710, 333–340 (2019)
A Fuzzy Epigenetic Model for Representing Degradation
435
18. O’Connor, P.P., Kleyner, A.: Practical Reliability Engineering. Wiley Publishing, Chichester (2012) 19. Ebeling, C.E.: An Introduction to Reliability and Maintainability Engineering. McGraw-Hill (2004) 20. Verma, A.K., Ajit, S., Karanki, D.R.: Reliability and Safety Engineering: Second Edition. Springer, London (2016) 21. Tollefsbol, T.O.: Chapter 1 – Epigenetics: The New Science of Genetics. In: Tollefsbol, T. (ed.) Handbook of Epigenetics, pp. 1–6. Academic Press, San Diego (2011) 22. Sun, B., Zeng, S., Kang, R., Pecht, M.: Benefits analysis of prognostics in systems. In: 2010 Progn. Syst. Heal. Manag. Conf. PHM ’10 (2010)
A Voting Ensemble Technique for Gas Classification M. Jaleel1(B) , A. Amira2 , and H. Malekmohamadi1,2,3 1 The Gateway House, Leicester, UK
[email protected], [email protected]
2 Division: College of Computing and Informatics, University of Sharjah, Sharjah, UAE
[email protected] 3 Institute of Artificial Intelligence , De Montfort University , The Gateway House,
Leicester, UK
Abstract. This article discusses the factors that influence gas classification results such as the data pre-processing and the type of classifier, these are two important factors in the electronic nose algorithm. Early in the data pre-processing process, machine learning algorithms are predominantly used for classification of the gas data, such as K-Nearest Neighbor (k-NN) and Support Vector Machine (SVM). A number of studies have been conducted throughout the past few years concerning the use of machine learning and neural network for gas classification. The focus of this paper is on gas classification and identification by using individual machine learning (Logistic Regression (LR), Naïve Bayes (NB)s, K-Nearest Neighbours (k-NN), Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF)) and ensemble (Stacking and Voting) techniques. Six different gases and a 4 × 4 sensor array is used for data collection. Using data collected by sensors arrays, it has been proven that our system is more accurate than individual classifiers. An improved accuracy of 98.04% is achieved by using Voting Classifier. Keywords: Artificial intelligence · Committee machine learning · Ensemble learning · Voting classifier · Stacking classifier · Sensor array · Classification
1 Introduction Gas identification systems (GIS) also referred to as machine olfaction systems, or also Electronic Nose (ENs) devices have attracted many researchers’ attention because of their significance in a range of fields such as monitoring health, surroundings, military, food quality, and safety, etc. [1]. For the monitoring of complex environment to a prolonged period of time, it is now possible by using wireless gas sensors that use a system-based approach. There is a great deal of emphasis placed on sensor reliability in the development of these systems because sensors are the most important component for gas or odour identification systems [2]. It is, therefore, reasonable to view safety measures in line with gas classification as a crucial component of gas safety procedures. Also, it is important to understand how classification systems work when dealing with gases [3]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 436–444, 2022. https://doi.org/10.1007/978-3-031-10464-0_29
A Voting Ensemble Technique for Gas Classification
437
As discussed above about the GIS, It requires hardware which contained two main building blocks named processing and sensing. The sensing part may consist of just one or several sensors alike (temperature, pressure, atmospheric pressure, etc.). The layers of sensors are so beneficial for the improvement of selectivity that they will most likely have to make up much of the current technology. As we were working on this project, we were drawn to a particular aspect regarding the processing of data. We used some of these algorithms, in part, during the gas separation process to analyse all the data acquired from the sensing block and to determine some parameters (such as the pressure and temperature), and based on the information acquired. According to some studies, the accuracy of the electronic nose system can be attributed to a number of factors, including the pre-processing algorithm and the classifier architecture. A system based on Zynq system-on-chip (SoC) hardware and MATLAB software implementation for gas classification has been proposed in [3]. Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) based feature reduction algorithm are used. The data was obtained from 4 × 4tin-oxide (SnO2 ) gas sensors array, fabricated in-house and seven commercial Figaro gas sensors. By using LDA as compared to the PCA, provided the best accuracy and the process is on average 50% faster than the PCA in terms of hardware resources and computational time, with an operating frequency of 122 MHz. In addition to classical machine learning [4], Semi-passive Radio Frequency Identification (RFID) tags are used to integrate gas temperature sensors and to reduce the power consumption. The author used 4 × 4 in-house-fabricated Tin Oxide (SnO2 )-and seven commercial Figaro sensors to collect the data. DT and k-NN machine learning techniques are used with data pre-processing to enhance the classification accuracy. Another study related to the feature selection, used Discrete Binary Particle Swarm Optimization (DBPSO) algorithm in [5], which is designed for selecting features and classifying the data by using Cosine Similarity (CS). It is clear from DBPSO that it may be able to identify a combination of features that remain insensitive to drift and that could collectively be applied to the classification of gases when drift is present. This classifier consists of two same classifiers, and the two are similar in terms of their properties. The DBPSO program is run twice during the two-stage classification procedure to achieve the highest level of accuracy. Furthermore to discuss the sensor drift which is also a problem for the gas sensors in terms of classification on an incredibly small scale, a Recursive Feature Elimination (RFE) technique based on RF is used in [6]. The author investigated the use of RF classifier coupled with the RFE algorithm as a method for compensating the small-scale drifts in gas sensors, and this was conducted in the context of classification of six gases used by the sensors. In continuous with sensor drift which causes to reduce the classification accuracy, another approach is used in [7] for the development of k-NN classifier for its evaluation and comparison with three other classification algorithms in the reference of classification. Six different gases are used to make up the dataset by using figaro sensors and 4 × 4 in house Tin-Oxide sensors. Two datasets are defined which then examined by Fuzzy Cluster-Means (FCM) with the respect of classification. In result it has been found that both classification models are very effective without removing or compressing the
438
M. Jaleel et al.
features. The result shows that both Quadratic Discriminant Analysis (QDA) and k-NN are generally effective in classifying data that contain raw sensor data as a feature vector. Additionally on sensor drift where Machine learning (ML) is fully responsible for implementing the proposed approach in [8], does not require re-calibration of the system or replacement of the sensors. For reducing the response time, a multi-classifier tree approach is proposed, which uses the initial transient response of sensors to learn the classifier, addressing the issue of long-term sensor drifts and system response time reductions. Moreover, this study illustrates that the selection of features for compensating drift of gas sensors can produce powerful discriminatory abilities in the presence of longterm drift and presents that a suitable combination of initial transient features is optimal for overcoming time-dependent drift. As a result of the research presented here, a gas classification approach is proposed that combines multiple ML algorithms called ensemble learning. Researchers in the predictive modelling field, such as those working on regression and classification are increasingly using ensemble learning models. A voting ensemble is a statistical model composed of a number of multiple ML models, aimed at achieving better performance than each of the individual models, for improving classification performance overall. Combining multiple machine learning models with similar performance toward a prediction task may ultimately produce models with enhanced accuracy and fewer errors. We proposed a voting ensemble technique for this research work where experimental data set is used to analyse the ability of the designed model to identify gases. Six classifiers (e.g., LR, NB, k-NN, RF, SVM, RF, and DT) are in whole work which divided into two parts. Frits part is used the ML models individually and in second part, the best-performed ML models are ensemble on one platform to enhance the classification accuracy. The reminder of this paper is organized as follows: Sect. 2 presents state of the art machine learning techniques for gas classification. Classification results and analysis are described in Sect. 3, and finally Sect. 4 concludes the paper and future work.
2 Experimental Setup 2.1 Data Collection To acquire the dataset, a 4 × 4 in-house fabricated sensor arrays is used under controlled laboratory conditions in [3]. Figure 1 shows an illustration of the experimental setup. The experiments were done to acquire data on the six target gases which are Carbon monoxide (CO), Ethanol (C2 H6 O), Carbon Dioxide (CO2 ), Propane (C3 H8 ), Ammonia (NH3 )) and one common Hydrogen (H2 ). Table 1 shows the target gases with their concentrations. By measuring the amount of gas flowing through the cylinder, a Mass Flow Controller (MFC) detects the volume of gas being used. To get the desired concentration of the gas to be mixed with dry air, the MFC, as it controls the gas flow, identifies the proportion of the gas to be mixed with dry air based on the amount of gas that is mixed with it. In each of these situations, the sensor array is periodically exposed to the gases being released from the gas chamber as they are being released at intervals.
A Voting Ensemble Technique for Gas Classification
439
Table 1. Target gases with concentration [3]. Target gases
Concentration
CO
30–1000
C2 H6 O
1–30
CO2
25–200
C3 H8
500–10000
NH3
30–300
H2
25–800
To access this signature, we inject an unidentified target gas into the array of sensors after passing dry air by the array. During this time interval, each resistance of the sensors in the wireless electronic nose is measured with the Analog to Digital Converter (ADC), which has been integrated into the device. These values are then transferred to the connected device through RF communication. The acquired data is then used for comparative analysis between classical ML and Committee ML for gas identification which is performed under diverse environments. The use of different gases and sensor types provide diversity to the proposed approach, thereby validating the analysis results of classical ML versus Committee ML. Hence, the conclusions obtained from this research provide the most suitable ML approach for gas identification. Target Gases with Air Contamination Air
MFC
NH3
MFC
CO
MFC
CO2
MFC
C2H6O
MFC
C3H8
MFC
Mass Flow Control Unit
Gas Chamber
Sensor Array Data Acquisition and Control Unit
DAQ Control Unit
Control Unite Classification
Gas Cylinder
Fig. 1. Experimental setup for gas sensor data acquisition and classification [11].
440
M. Jaleel et al.
2.2 Classification Algorithm The process involves two steps. When the first step of the modelling process begins, a classification algorithm is applied to the training data. Next, at least one test dataset will be used to measure the model’s performance, ensuring that the model was trained correctly. Classification involves assigning labels to unlabelled data; as a result, they are categorised. 2.3 Ensemble Learning Ensemble learning is complex in the contrast of individual’s model that performs better and which model is best to use for the classification problem. In this work, six different models are used which are first performed individually on the same dataset before to use in the ensemble technique. Only those models are selected which perform better individually and then used in the Voting ensemble for higher prediction to enhance the accuracy of the whole system. Also, due to the use of a small dataset, it does not need any high computation power. This approach, Voting Classifier, is presented here to obtain a comprehensive decision which is more accurate [9]. Voting Classifier (Vc): Ensemble learning is a machine learning technique that combines multiple models to achieve its goal of getting better results. The most effective and simplest model in ensemble learning technique is voting. For the problems of classification and regression, voting can be used. It is created by the use of two or more sub-models. All the sub-models used in the voting classifier make predictions which are then combined by taking the means or mode of the predictions for the future outcome [10]. Moreover, the voting ensemble is encapsulated as a meta-model, that is, a model of models. The meta-model was designed to work with any existing training set of machine learning models, and the models would not need to be aware that they were in
Fig. 2. Voting classifier
A Voting Ensemble Technique for Gas Classification
441
the ensemble. The goal of a voting ensemble is to use a subset of fit models to decide which predictions should be used for our predictive modelling task. Aside from this, it is also known to offer a lower level of variance in prediction results across individual models as well as a reduced level of variance in classification accuracy. A lower variance or lower level of confidence in the model may lead to a lower mean performance for an ensemble, which raises a concern about accuracy. A voting ensemble algorithm is used with different models which can be seen in Fig. 2, where six classifiers (LR, RF, SVM, DT, k-NN and NB) are used on a single platform to enhance the classification accuracy. Further, I would like to emphasize that a voting ensemble is particularly useful when it comes to machine learning algorithms that use stochastic algorithms and produce different final models regardless of how often the algorithm is fed the same dataset. In terms of stochastic gradient descent neural networks, this is an example of this approach. Stacking Classifier (SC): An algorithm known as a stacked generalization used in ensemble machine learning combine the predictions from more than one machine learning algorithm and teaches how to make the best predictions. A stacking model involves two or more base models, often referred to as level-0 models and a meta-model that combines the predictions of both base models. In addition to combining the strengths of several independent models on a classification task, stacking can produce predictions that are superior to any of the individual models [11]. A SC fitted with used base models and a meta learner is illustrated in Fig. 3.
Fig. 3. Stacking classifier
3 Results and Discussion The results of the experiments conducted in this research are discussed with detail in this section. To test the proposed method, we did experiments using the Python platform. To perform the gas classification by using ML techniques, different datasets sizes are analysed. Datasets are divided into training and testing datasets, and then checked the behaviour of all the used classifier individually and in combination. Our proposed
442
M. Jaleel et al.
technique and others used approaches are compared in Table 2 with the accuracies, when used for different datasets. By using 3-cross-validation on k folds, we can improve measurements of predictive accuracy. This process is repeated three times so that all samples are selected in both the training and testing. The obtained accuracies in both scenario, individual and ensemble methods (Voting Classifier), are shown in Table 3. Our proposed technique is very simple and straight forward where all the features are selected with maximum information to perform the classification. Table 3 shows the results of all the individual classifiers on Dataset 1 to Dataset 4 which means that four sets are used for this purpose and the distribution of used training dataset is 80%, 70%, 60% and 50%. In case of individual classifiers, LR is performing best rather than the other classifiers because of regression. In Table 3, when 60% of the training dataset (Dataset 3) used, it achieves 97.62% accurate classification accuracy without any data pre-processing techniques. SVM is very ideal for classification but in this case the higher performance is 92.86% when 50% of training dataset used (Dataset 4). In ensemble learning, VC is selected and generated the results of the above stated datasets and the results can be seen in Table 3. VC shows better than the individual classifiers results. The reason for best accuracy is that the best performed classifiers are chosen and then combined them on VC platform. After a good combination, VC achieved 98.04% of enhanced classification accuracy which shows that ensemble techniques are Table 2. Comparison existing techniques Ref. No
No. of gases
Target gases
Techniques used
Results %
[3]
6
CO, C2 H6 O, H2 , C3 H8 , NH3 , CO2
DT + Pre-processing
94.99
[4]
6
CO, C2 H6 O, H2 , C3 H8 , NH3 , CO2
KNN/DT + Pre-processing
96.67
[5]
6
C2 H6 O, C2 H4 , NH3, C2 H4 O, C3 H6 O, C7 H8
DBPSO
86.01
[6]
6
C2 H6 O, C2 H4 , NH3, C2 H4 O, C3 H6 O, C7 H8
RFE-RF
90
[7]
6
C3 H8 , Cl2 , CO, CO2 , SO2 , NO2
LDA, QDA, RDA, KNN, FCM
95
[8]
6
C2 H6 O, C2 H4 , NH3, C2 H4 O, C3 H6 O, C7 H8
Heuristic tree classification (SVM, KNN, DA, LR, NB, DT)
87.34
Our Proposed Technique
6
CO, C2 H6 O, H2 , C3 H8 , NH3 , CO2
Voting classifier (LR, 98.04 NB, k-NN, SVM, DT, RF)
A Voting Ensemble Technique for Gas Classification
443
better than the individual classifiers and the data pre-processing techniques which can cause to increase the processing time. To validate our results, dataset is then split into validation, training, and testing. Table 4, illustrate the results which shows that again VC is performs better than the other SC ensemble technique. 98.04% enhanced accuracy is achieved by using 20% validation datasets rather than the SC. Table 3. Results generated by machine learning techniques on different training and testing datasets Classifiers
Dataset 1 (80+20) %
Dataset 2 (70+30) %
Dataset 3 (60+40) %
Dataset 4 (50+50) %
LR
96.07
94.74
97.63
97.62
NB
74.51
77.63
77.22
60.32
SVM
94.12
92.10
95.05
92.86
DT
92.16
90.80
81.20
83.33
RF
90.20
88.16
89.40
91.23
KNN
96.08
89.47
93.10
91.30
VC
98.04
97.37
97.03
96.03
Table 4. Results with validation dataset Dataset (Train + Valid + Test) %
Voting classifier %
Stacking classifier %
50+40+10
96.04
92.40
60+30+10
94.80
86.50
70+20+10
98.04
81.00
4 Conclusion The whole work is presented an overview of current work in gas classification and analysis in this paper. Based on the previous work, it can be concluded that ensemble methods such as the VC and SC produced the best results. In the proposed approach, several base learners with as high diversity are selected to form an ensemble classifier by adopting an ensemble learning strategy. Alternatively, the neural network is an excellent choice for large datasets because of its high accuracy. It is possible to employ neural networks with small datasets and pre-training when big datasets are impossible to obtain. It is difficult to have more input variables with small datasets than those for image recognition. A small dataset is used in this work and if consider a neural network then special training
444
M. Jaleel et al.
methods are required to perform better than the machine learning techniques. In contrast, a voting ensemble is used for this small dataset in the reference of classification accuracy and the performance results are listed in table, where VC performed better than the others because this technique use the majority votes for the output. Additionally, when using the voting classifier for validation dataset, the results are better than those from the stacking classifier. Researchers reported the use of a wide range of techniques for gas classification, but an automated analysis that addresses all the challenges related to the small datasets encountered during classification still remains necessary. Acknowledgment. We would like to thank Prof. Amine Bermak for providing the dataset.
References 1. Marco, S., Gutierrez-Galvez, A.: Signal and data processing for machine olfaction and chemical sensing: a review. IEEE Sens. J. 12(11), 3189–3214 (2012) 2. Shi, M., Bermak, A., Chandrasekaran, S., Amira, A., Brahim-Belhouari, S.: A committee machine gas identification system based on dynamically reconfigurable FPGA. IEEE Sens. J. 8(4), 403–414 (2008) 3. Akbar, M.A., et al.: An empirical study for PCA-and LDA-based feature reduction for gas identification. IEEE Sens. J. 16(14), 5734–5746 (2016) 4. Ali, A.A.S., et al.: Embedded platform for gas applications using hardware/software co-design and RFID. IEEE Sens. J. 18(11), 4633–4642 (2018) 5. Rehman, A.U., Bermak, A.: Drift-insensitive features for learning artificial olfaction in e-nose system. IEEE Sens. J. 18(17), 7173–7182 (2018) 6. Ijaz, M., Rehman, A.U., Hamdi, M., Bermak, A.: Recursive feature elimination with random forest classifier for compensation of small scale drift in gas sensors. In: 2020 IEEE International Symposium on Circuits and Systems (ISCAS), October, pp. 1–5. IEEE (2020) 7. Rehman, A.U., Bermak, A.: Discriminant analysis of industrial gases for electronic nose applications. In: 2018 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA), June, pp. 1–5. IEEE (2018) 8. Rehman, A.U., Belhaouari, S.B., Ijaz, M., Bermak, A., Hamdi, M.: Multi-classifier tree with transient features for drift compensation in electronic nose. IEEE Sens. J. 21(5), 6564–6574 (2020) 9. Shi, M., Bermak, A., Belhouari, S.B., Chan, P.C.: Gas identification based on committee machine for microelectronic gas sensor. IEEE Trans. Instrum. Meas. 55(5), 1786–1793 (2006) 10. Kabari, L.G., Onwuka, U.C.: Comparison of bagging and voting ensemble machine learning algorithm as a classifier. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 9(3), 19–23 (2019) 11. Hatami, N., Ebrahimpour, R.: Combining multiple classifiers: diversify with boosting and combining by stacking. Int. J. Comput. Sci. Netw. Security 7(1), 127–131 (2007)
Neural Networks with Superexpressive Activations and Integer Weights Aleksandr Beknazaryan(B) Institute of Environmental and Agricultural Biology (X-BIO), University of Tyumen, Volodarskogo 6, 625003 Tyumen, Russia [email protected]
Abstract. An example of an activation function σ is given such that networks with activations {σ, ·}, integer weights and a fixed architecture depending only on the input dimension d approximate continuous functions on [0, 1]d . The range of integer weights required for εapproximation of H¨ older continuous functions is derived, which, together with our discrete choice of weights, allows to obtain the number of networks needed to attain a given approximation rate. Combining this number with the obtained speed of approximation and applying an oracle −2β
inequality we get a prediction rate n 2β+d log2 n for neural network regression estimation of an unknown β-H¨ older continuous function with given n samples. Thus, up to a logarithmic factor log2 n, the attained rate coincides with the minimax estimation rate for the prediction error of β-smooth functions. As the network sizes are fixed and their weights are integers, the constructed networks are not only easily encodable but they also reduce the problem of finding the best predictor to a simple procedure of minimization over the finite set of candidates. Keywords: Neural networks Nonparametric regression
1
· Function approximation · Entropy ·
Introduction
A family of activation functions A is called superexpressive if for every input dimension d there are networks of fixed architecture and with activations all belonging to A that arbitrarily well approximate functions from C([0, 1]d ) in the uniform norm. Several examples of simple superexpressive families of activation functions are given in [1]. In particular, in [1], Theorem 3, it is shown that the families {sin, arcsin} and {σ1 , ·} are superexpressive, where σ1 is a real analytic function which is non-polynomial on some interval. Although fixing the networks architecture, the superexpressiveness of activations alone does not tell much about the complexity of approximant classes of networks, which in this case is associated with possible choices of network weights. More precisely, after fixing the network sizes we still want to know how many such networks are needed to approximate any function of given smoothness. The knowledge of this number, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 445–451, 2022. https://doi.org/10.1007/978-3-031-10464-0_30
446
A. Beknazaryan
the logarithm of which is called an entropy of a given collection of networks, is especially crucial for deriving prediciton rates of an unknown regression function with neural networks. Even if we bound the weights required for attaining a given approximation error, the estimation of entropy of network classes may still need the activations to be Lipschitz continuous (see, e.g., [2,3]). Thus, in the present work, to bound the entropy of approximant classes of networks, we not only bound the network weights but also discretize them. Once this is done, we can directly count the number of networks needed to attain a given approximation error. We will therefore consider networks with weights from Z and otherwise our construction will be similar to the one presented in the proof of Theorem 3 of [1]. That proof is based on the density of irrational windings on the torus, which in our case is replaced by an effective Kronecker’s theorem. The latter not only assures that integer multipliers are enough to densely cover the torus but also bounds the integers needed to attain a given covering radius. We will consider the family of activation functions A = {σ, ·}, where the role of the activation σ is to guarantee that the conditions of the Kronecker’s theorem are satisfied and that it gives a small range for the integer multipliers. Having this range we then bound the entropy of approximant networks and use this bound to get for β-H¨older continuous regression functions a convergence rate of order −2β n 2β+d log2 n, where n is the number of observations. Note that our approach is also comparable with the one given in [4], where approximations by deep networks with weights {0, ± 12 , ±1, 2} are considered: in one case we fix a finite set of weights and adjust the network architecture and in the other case we fix the network architecture and adjust the integer weights to attain a certain approximation rate.
2
An Effective Kronecker’s Theorem
In this part we present an effective version of Kronecker’s Theorem given in [5]. To state the theorem we will need the following definitions of absolute values, places and heights on number fields. Definition 1. An absolute value on a number field K is a function | · |ν : K → R+ satisfying 1. |x|ν = 0 if and only if x = 0; 2. |xy|ν = |x|ν |y|ν for all x, y ∈ K; 3. |x + y|ν ≤ |x|ν + |y|ν for all x, y ∈ K. If the third condition above is replaced by a stronger condition |x + y|ν ≤ max{|x|ν , |y|ν }, then the absolute value | · |ν is called non-archimedean and otherwise it is called archimedean. Definition 2. Two absolute values | · |1 and | · |2 on K are equivalent if there exists some λ > 0 such that | · |1 = | · |λ2 . An equivalence class of absolute values on K is called a place of K. The collection of all places of K is denoted by MK .
Neural Networks with Superexpressive Activations and Integer Weights
447
Let Q denote the field of algebraic numbers. N
Definition 3. For α = (α1 , ..., αN ) ∈ Q \ {0} let K be an extension of the field of rational numbers Q of degree [K : Q] such that α ∈ K N . The number 1/[K:Q] . max(|α1 |ν , ..., |αN |ν ) H(α) = ν∈MK
is called an absolute height of α. It can be shown that the absolute height is independent of the choice of K (see [5] for proof and more details regarding the above definitions). For α = N (α1 , ..., αN ) ∈ Q let r := [Q(α1 , ..., αN ) : Q] be the degree of extension field over rationals generated by α. For ε > 0 denote r−1 1 . Q(α, ε) := r(N + 1)2r (H(1, α1 , ..., αN ))r ε The following is a simplified version of Theorem 3.11 from [5]: Theorem 1. Let α = (α1 , ..., αN ) be a vector with algebraic and rationally independent coordinates, that is, {z ∈ QN : z · α ∈ Q} = {0}. Then for every ε > 0 and every (b1 , ..., bN ) ∈ [0, 1)N there is q ∈ Z with |q| ≤ Q(α, ε) such that |φ(qαi ) − bi | ≤ ε, i = 1, ..., N, where φ(x) = x − x. As the choice of the activation function σ in the next part suggests, we will be interested in application of the above theorem to the case α = (21/(N +1) , 22/(N +1) , ..., 2N/(N +1) ). In this case we have that r = N + 1 and, therefore, there are at most N + 1 archimedean places on Q(α) (see [6], Subsect. 1.3.8). Also, as the non-archimedean absolute values of integers are in [0, 1] ([7], Lemma 6A), then N +1 1/(N +1) 2/(N +1) N/(N +1) H(1, (2 ,2 , ..., 2 ) =
max(1, |2
1/(N +1)
|ν , ..., |2
ν∈MQ(α )
=
N/(N +1)
max(1, |2|ν
)=
ν∈MQ(α )
N/(N +1)
|ν ) =
1/(N +1)
max(1, |2|ν
N/(N +1)
, ..., |2|ν
)
ν∈MQ(α ) N/(N +1)
max(1, |2|ν
N
)≤2 .
ν∈MQ(α ) ν archimedean
We thus get the following Corollary 1. For every ε > 0 and every (b1 , ..., bN ) ∈ [0, 1)N there is q ∈ Z with N 2N +3 2 |q| ≤ (N + 1) ε such that |φ(q2i/(N +1) ) − bi | ≤ ε, i = 1, ..., N.
448
3
A. Beknazaryan
Network Selection and Approximation
For a set of p functions g 1 , ..., g p : R → R and two sets of p real numbers {v1 , ..., vp } and {y1 , ..., yp } define ⎛ 1 ⎞⎛ ⎞ ⎛ 1 ⎞ gv 1 y1 g (y1 + v1 ) ⎜ .. ⎟ ⎜ .. ⎟ ⎜ ⎟ .. ⎝ . ⎠⎝ . ⎠ = ⎝ ⎠. . gvpp
yp
g p (yp + vp )
For K, M ∈ N and q ∈ Z denote kM := ((M + 1)d − 1)(M + 1)d /2 and d on [0, 1]d of the form consider a feedforward neural network ZK,M,q d
d
ZK,M,q (x) = ZK,M,q (x1 , x2 , ..., xd )
⎛ ⎛ ⎞ ⎞ ·0 x1 ⎜· ⎟ ⎜x ⎟ 0⎟ ⎜ ⎜ 2⎟ σk ·0 1 0 1 ⎜ ⎜ ⎟ ⎟ d−1 M ) ⎜ . ⎟ (M · Id ) ⎜ . ⎟ − K, = (2Kq, −2K) ·1 (1, M + 1, ..., (M + 1) σk ⎜ . ⎟ ⎜ . ⎟ 0 q 1 ·0 M ⎝ . ⎠ ⎝ . ⎠ ·0 xd
where Id ∈ Rd×d is an identity matrix, · is the floor function and σ : R → R is defined as x−(m−1)m/2 m+1 2 , x, m ∈ N, (m−1)m < x ≤ m(m+1) , 2 2 σ(x) = 0, x ∈ R \ N. In fact, the values of σ on R \ N will not play a role and can thus be defined arbitrarily. Here are the first few nonzero values of σ: σ(1) = 21/2 ; σ(2) = 21/3 , σ(3) = 22/3 ; σ(4) = 21/4 , σ(5) = 22/4 , σ(6) = 23/4 . d as Note that analytically we can write the network ZK,M,q d ZK,M,q (x) = 2Kφ(qσ(kM + gM (x))) − K,
where φ(x) = x − x and gM (x) = 1 +
d
(M + 1)k−1 M xk . For Q ∈ N define
k=1
a set of networks
d d (Q) := {ZK,M,q , |q| ≤ Q}. ZK,M
For β, F ∈ R+ and K ∈ N define
β
Hd (F, K) =
d
β
f : [0, 1] → R : f ∞ < K and |f (x) − f (y)| ≤ F |x − y|∞ for all x, y ∈ [0, 1]
d
.
Neural Networks with Superexpressive Activations and Integer Weights
449
We have the following Theorem 2. For any ε > 0 and any f ∈ Hdβ (F, K) there is a network d d (x) ∈ ZK,M (Q) with M = (2F/ε)1/β and ZK,M,q
N 8K Q = (N + 1)2N +3 , ε d where N = (M + 1)d , such that ZK,M,q (x) − f (x) ∞ ≤ ε.
Proof. Take any ε > 0 and let M = (2F/ε)1/β . Following the works [8] and [1], we consider the function gM (x) = gM (x1 , ..., xd ) = 1 +
d
(M + 1)k−1 M xk
k=1
from [0, 1]d to [1, (M + 1)d ] ∩ Z, which maps each of (M + 1)d sets m d md + 1 m1 m 1 + 1 IM,m := [ , ) × ... × [ , ) ∩ [0, 1]d , M M M M m = (m1 , ..., md ) ∈ [0, M ]d ∩ Zd , to a unique integer from [1, (M + 1)d ]. Denote N = (M + 1)d and let J1 , ..., JN be the enumeration of the sets IM,m , m ∈ [0, M ]d ∩ Zd , such that gM (x) = i for x ∈ Ji , i = 1, ..., N . Take any set of N points yi ∈ Ji and denote bi =
f (yi ) + K ∈ [0, 1)N , 2K
Let kM := ((M + 1)d − 1)(M + 1)d /2 = by Corollary 1, there exists q ∈ Z with
i = 1, ..., N.
(N −1)N . 2
|q| ≤ (N + 1)2N +3 such that |φ(qσ(kM + i)) − bi | ≤
As σ(kM + i) = 2i/(N +1) , then,
8K ε
ε , 4K
Thus, |2Kφ(qσ(kM + i)) − K − f (yi )| ≤
N
i = 1, ..., N. ε , 2
i = 1, ..., N,
and, therefore, for the network d ZK,M,q (x) = 2Kφ(qσ(kM + gM (x))) − K
we have that for x ∈ Ji d |ZK,M,q (x) − f (x)| = |2Kφ(qσ(kM + gM (x))) − K − f (x)|
≤ |2Kφ(qσ(kM + i)) − K − f (yi )| + |f (yi ) − f (x)| ε ≤ + F |yi − x|β∞ ≤ ε. 2 d As [0, 1]d = ∪N i=1 Ji , then ZK,M,q (x) − f (x) ∞ ≤ ε.
450
A. Beknazaryan
Remark 1. Note that if we only assume that a function f ∈ C([0, 1]d ) is continuous on [0, 1]d , then applying the technique presented in the proof above and using uniform continuity of f we can approximate f by networks with integer weights and activations {σ, ·}. This means that the family {σ, ·} is indeed superexpressive.
4
Application to Nonparametric Regression
Let f0 ∈ Hdβ (F, K) be an unknown regression function and let (Xi , Yi ), i = 1, ..., n, be n observed iid pairs following a regression model Yi = f0 (Xi ) + i , where the standard normal noise variables i are assumed to be independent of Xi . Our goal is to choose appropriate Mn , Qn ∈ N so that the empirical risk minimizer n ˆ Zn ∈ arg min (Yi − Z(Xi ))2 d Z∈ZK,M (Qn ) i=1 n
can well approximate f0 . The accuracy of approximation of f0 by the estimator Zˆn is measured by the prediction error R(Zˆn , f0 ) = Ef0 [(Zˆn (X) − f0 (X))2 ], D
where X = X1 is independent of the sample (Xi , Yi ) and the subscript f0 indicates that the expectation is taken over the training data generated by the regression model. 1 1 Choose Mn = (2F ) β n 2β+d and
Nn β Qn = (Nn + 1)2Nn +3 8Kn 2β+d , where Nn := (Mn + 1)d . From [2], Lemma 4, it follows that for any δ ∈ (0, 1] ˆn , f0 ) ≤ R(Z 4 inf
d Z∈ZK,M (Qn )
E[(Z(X) − f0 (X))2 ] + K 2
n
d 18 log2 N (δ, ZK,M (Qn ), · ∞ ) + 72 n
n
+ 32δK ,
(1) d d where N (δ, ZK,M (Qn ), · ∞ ) is the covering number of ZK,M (Qn ) of radius n n δ taken with respect to the · ∞ distance of functions on [0, 1]d . As there are d (Qn ), then for any δ > 0 only 2Qn + 1 networks in ZK,M n d
d
d
log2 N (δ, ZK,Mn (Qn ), · ∞ ) ≤ log2 N (0, ZK,Mn (Qn ), · ∞ ) ≤ log2 (2Qn + 1) ≤ C n 2β+d log2 n,
Neural Networks with Superexpressive Activations and Integer Weights
451
for some constant C = C (β, d, F, K). As f0 ∈ Hdβ (F, K), then, applying Theoβ
rem 2 with ε = n− 2β+d , we get that inf
−2β
d Z∈ZK,M (Qn ) n
E[(Z(X) − f0 (X))2 ] ≤ n 2β+d .
Thus, from (1) we get an existence of a constant C = C(β, d, F, K) such that −2β
R(Zˆn , f0 ) ≤ Cn 2β+d log2 n, which coincides, up to a logarithmic factor, with the minimax estimation rate −2β n 2β+d of the prediction error for β-smooth functions.
5
Conclusion
We presented an approximation of continuous functions by networks with fixed architecture and integer weights and derived the range of weights required for approximating H¨ older continuous functions. While fixing the architecture and choosing only integer weights makes the networks easier to implement, bounding their range also allows to count the exact number of networks required to attain a given approximation rate. Our construction is based on a pure number theoretical effective Kronecker’s theorem. Remarkably, when applied to the statistical problem of regression estimation, bounds obtained from the Kronecker’s theorem lead, up to a logarithmic factor, to the minimax convergence rate of the prediction error.
References 1. Yarotsky, D.: Elementary superexpressive activations. In: Proceedings of the 38th International Conference on Machine Learning, PMLR, vol. 139, pp. 11932–11940 (2021) 2. Schmidt-Hieber, J.: Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48(4), 1875–1897 (2020) 3. Taheri, M., Xie, F., Lederer, J.: Statistical guarantees for regularized neural networks. Neural Netw. 142, 148–161 (2021) 4. Beknazaryan, A.: Function approximation by deep neural networks with parameters {0, ± 12 , ±1, 2}. J. Stat. Theor. Pract. 16, 1–14 (2022). https://doi.org/10.1007/ s42519-021-00229-5 5. Vorselen, T.: On Kronecker’s theorem over the ad`eles. Master’s thesis, Universiteit Leiden (2010) 6. Bombieri, E., Gubler, W.: Heights in Diophantine Geometry. New Mathematical Monographs (4). Cambridge University Press (2006) 7. Schmidt, W.M.: Diophantine Approximations and Diophantine Equations. Lecture Notes in Mathematics, Springer, Heidelberg (1991). https://doi.org/10.1007/ BFb0098246 8. Shen, Z., Yang, H., Zhang, S.: Neural network approximation: three hidden layers are enough. Neural Netw. 141, 160–173 (2021)
Mask Compliance Detection on Facial Images Lorenzo Garbagna(B) , Holly Burrows, Lakshmi Babu-Saheer, and Javad Zarrin Anglia Ruskin University, Cambridge, UK [email protected]
Abstract. The COVID-19 pandemic has significantly changed our ways of living. Government authorities around the world have come up with safety regulations to help reduce the spread of this deadly virus. Covering the mouth and nose using facial masks is identified as an effective step to suppress the transmission of the infected droplets from one human to the other. While the usage of facial masks has been a common practice in several Asian societies, this practice is fairly new to the rest of the world including modern western societies. Hence, it can be noticed that the facial masks are either worn incorrectly (or sometimes not worn) by a significant number of people. Given the fact that the majority of the world population is only getting accustomed to this practice, it would be essential for surveillance systems to monitor if the general population is abiding by the regulatory standards of correctly wearing a facial mask. This paper uses deep learning algorithms to track and classify face masks. The research proposes a mask detection model based on Convolutional Neural Networks to discern between a correct usage of facial masks and its incorrect usages or even lack of it. Different architectures have been tested (even on real-time video streams) to obtain the best accuracy of 98.9% over four classes. These four classes include correctly worn, incorrectly worn on the chin, incorrectly worn on mouth and chin, and not wearing a mask at all. The novelty of this work is in the detection of the type of inaccuracy in wearing the face mask rather than just detecting the presence or absence of the same.
Keywords: Deep learning
1
· Computer vision · Face masks detection
Introduction
The COVID-19 pandemic continues to challenge countries and governments to control the spread of the coronavirus. As one of the important control measures, the general public is advised to cover their mouth and nose using face masks to stop the spread of infectious droplets. The specified places where this is now a legal requirement may vary between nations. However, the general consensus in most of Europe is that face masks should be worn in smaller public spaces, any crowded outdoor settings, and all indoor public buildings [7]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 452–468, 2022. https://doi.org/10.1007/978-3-031-10464-0_31
Mask Compliance Detection
453
A few nations around the globe are more accustomed to this practice; the use of facial masks in Chinese and Japanese populations can be traced back to the beginning of the 20th century [18], implemented as protection from seasonal flu and as a societal ethical component. However, in countries where this practice is in its infancy, challenges arise where the masks are worn incorrectly, thus reducing their efficacy. This brings about further challenges of ensuring face masks are worn correctly where it is deemed necessary to effectively reduce the spread of the disease. Therefore, a solution is required to identify incorrect usage of masks when in public spaces. It would be inefficient, if not impossible for security professionals alone to monitor this especially among large crowds, or where this might pave way for possible confrontations. Considering this, technical solutions for automatically monitoring the correct usage of masks would be advantageous to reduce the spread of coronavirus. The goal of this study is to develop a system that is able to detect and classify if a person is wearing a mask correctly, or not. The system will also detect type of wrong usage if it is not worn correctly. The system is implemented using Convolutional Neural Networks (CNN) to identify the states in which a mask is found on a person’s face. To demonstrate the application of this system in a realworld scenario, the model was also tested on live video stream to detect the changes in mask states in real-time. This automatic real-time monitoring system could be practically implemented in airports, supermarkets, workplaces and schools. The main novelty of this research lies in the ability to classify images into different categories of incorrectly worn masks, as opposed to a binary classification of mask being present or not.
2
Related Work
This section will look into the related research in the domain of face and face mask detection. Loey et al. [11] aimed to annotate medical face masks using real-life images. The research implements a transfer learning approach, developing a model comprising two parts. The initial phase employs feature extraction based upon ResNet-50, and the second detects the presence of a medical face mask, based on the state-of-the-art object detection system YOLO-v2. The dataset used a total of 1415 images, as an amalgamation of Medical Masks dataset (MMD) and Face Mask dataset (FMD), obtained from a Kaggle challenge. The size of the dataset is relatively small, mainly due to the scarcity of quality data in this new domain. This model was only able to achieve an accuracy of 81% and struggled to discern between surgical face masks and other types of masks. The Face Mask dataset from Kaggle was also utilised by Das et al. [5] and this model was able to achieve a 94.8% accuracy on validation by using a cascade classifier to identify faces from the images and load them into the model. Another study was conducted by Loey et al. [12], using feature extraction on a much larger dataset (11570 images) to feed different models and compare performances: decision trees, SVM and an ensemble method constituting k-NN, linear regression and logistic regression. This study was able to reach an accuracy of 99% on a simulated face mask dataset and on a real-world face mask dataset, with around 5000 images.
454
L. Garbagna et al.
Comparatively, Mohan et al. [13] built a CNN model that achieves 99.81% accuracy on a binary classification task, where the classes represent wearing a mask, and not wearing a mask. The results obtained here outperformed the SqueezeNet model. The aim of this work entailed building a model that could be deployed at resource constrained endpoints. The final model had 128,193 trainable parameters and images were passed into the model at size of 32 × 32. Intricate data augmentation was used to increase the size of the dataset to 131,055 images. This work is demonstrative of CNN capabilities even in application areas where resource management is crucial to deployment. Nagrath et al. [14] proposed a deep learning model that makes use of Single Shot multibox Detector (SSD) for face detection in real-life images of peoples’ faces with and without face masks. Transfer learning is applied, where the MobileNetV2 architecture is used for the classifier framework, alongside the pretrained weights from ImageNet. This model is suitable for real-time classification in the application domain due to its lightweight architecture. Due to the scarcity of suitably sized datasets in this area, the work involved using a combination of freely available datasets, such as those from Kaggle challenges. The authors of the work express their reluctance to use a dataset where masks are artificially added in the image. So 5521 images of real-life people wearing, and not wearing, masks were created. The work experimented with various pretrained models on the augmented dataset. Results showed that the proposed model outperforms LeNet-5 and AlexNet in accuracy, in addition to achieving the highest F1 score compared to LeNet-5, AlexNet, VGG16 and ResNet-50. Chavda et al. [4] present a dual stage CNN architecture for detecting facial masks. Firstly, a face detector identifies multiple faces in the same image as Regions of Interest (ROI). These are grouped and forwarded to stage 2 of the architecture, where the CNN classifies into a binary separation of masked or not masked. The output is the input images, where the faces are highlighted with a bounding box and their classification label. The CNN was trained with three popular classification architectures, namely DenseNet121, MobileNetV2 and NASNetMobile. The average inference speed of the three models was also measured, showing that DenseNet121 was the slowest at 0.353 s. It was concluded that NASNetMobile is the most suitable for applications operating in real-time. Kayali et al. [9] explore deep learning methods to accurately detect and classify face masks. To obtain a dataset suitable for the task, the researchers used images from the Labeled Faces in the Wild (LFW) database, and added face masks to the images of peoples’ faces. Three classes were created: correct wearing, incorrect wearing, and no mask present (classes 0, 1, 2). Transfer Learning was utilised, whereby the performance of NASNetMobile and ResNet50 were compared. These pretrained models were chosen due to their contrast in depth of parameters; ResNet50 represents the performance of a deep network for this task, whereas NASNetMobile demonstrates lighter weight network potential. The images were sized at 128 × 128; models were trained for 200 epochs; and a small LR for Adam was used at 0.0000001. Interestingly, it took 80 min to train NASNetMobile, and just 60 min for ResNet50, even though the former is the
Mask Compliance Detection
455
lighter weight network. NASNetMobile showed poor performance for this task, with accurately classifying just 33/499 for class 0 and 35/485 for class 1; but 100% accuracy for class 2. This concludes its’ architecture is not suitable for this problem domain. On the other hand, ResNet50 demonstrated 92.38%, 89.48% and 93.61% for classes 0, 1, 2 respectively. The work concluded that this network has an overall classification accuracy of 92%. Fasfous et al. [6] present a low-power Binary Neural-Network (BNN) to classify face-mask wearing, and the position of the mask on the face. Furthermore, the classifier was deployed to Edge Devices to mitigate chances of data exploitation, and maintain data privacy; the research also describes how using a BNN alongside this deployment method reduces the memory footprint of the network, as parameters are represented in the binary domain. The work used the MaskedFace-Net dataset, and reports that in its original form, a large class imbalance exists, where 51% of the dataset is dominated by correct wearing of face masks. To combat this, the larger classes were sampled randomly in order to increase the contribution of the smaller classes. Heavy data augmentation techniques were then applied to the now balanced data, resulting in 110k images for training and validation, with a large test set of 28k images. With the images sized at 32 × 32, the network was able to achieve a classification accuracy of 98% for the four mask-wearing positions on the face. The work boasts good model generalisability, and therefore reliability when presented with varying facial structures, hair types, skin tones, as well as age groups. Bhuiyan et al. [3] develop an assistive system with Deep Learning which is used to classify the presence of face masks. The work employs a binary classification problem, where each face in an image receives a prediction of Mask or No Mask. The project involved extensive data analysis and preprocessing, where a web-scraping tool was used to pinpoint 650 images of people wearing, and not wearing, facial masks. The data was preprocessed to remove any images considered irrelevant to the task, resulting in 600 for training; there was an even distribution of 50% between the binary input classes. It was necessary to then label the acquired data to ensure its suitability to the task: the use of LabelIMG annotated all data samples. The authors describe the process of drawing bounding boxes in each image, where some contain multiple bounding boxes due to the need to identify any objects detected, and the presence of multiple people. With regard to model development and training, 4000 epochs of training facilitated by Google Colab achieved 96% accuracy and 0.073 loss. With this performance level, the research was able to progress to deploying the model to classify video captured in real-time, achieving on average 17 frames-per-second. Although the results show good promise, the authors conclude the paper with the fact that the dataset used to train is not highly varied. This is likely to be disadvantageous if the application is used in-the-wild, potentially meaning that people are entering crowded, or indoor spaces without wearing a face covering, which is detrimental to public health. Further work for this study is outlined as experimenting with varying object detection, such as RCNN, and YOLOv4 when available for public use.
456
L. Garbagna et al.
Singh et al. [16] begin their works describing that current methods to address this problem domain mainly revolve around using simple CNN networks for binary classification of mask or no mask. However, the work advocates that the first step in the method should always be object detection, whereby bounding boxes are placed around faces in images. The classification of mask wearing should be the second step in the method, so that analysis of compliance can take place. This work brings its’ focus to loss values to judge performance of each network; Transfer Learning with YOLOv3 achieved a validation loss of 0.25 comparatively to Faster RCNN validation loss of 0.15. The authors conclude that although the latter network has a better performance, for real-world deployment, YOLOv3 should be preferred due to its reduced inference time. Koklu et al. [10] experiment with Transfer Learning, Long-Short Term Memory networks (LSTM), and bi-directional LSTM networks for face mask determination. The work involved creating a dataset of 2000 images, where the same person is captured three times to create enough data for four classes: masked, nonmasked, masked but with the nose exposed, and mask under the chin. A total of six experiments were carried out using two pretrained models: AlexNet and VGG16. The first approach was simple Transfer Learning, where the pretrained models are trained on the new dataset; the second involved removing the classification header for both pretrained models and replacing it with LSTM structure; the third, replacing the classification layers with bi-directional LSTM architecture. All experimental results achieved accuracy scores north of 90%, with the most modest result coming from transfer learning with AlexNet at 90.33%, and the best, 95.67% with VGG16 using bi-directional LSTM as the classification layer. The best recall was for no mask present, and the worst was for mask under the chin.
3 3.1
Data Description
The dataset used, at the time of writing, is the largest available containing images of people wearing face masks in real life. There are 250,000 images available in total, comprising 28,000 different people, alongside showing four varying types of face mask. The data is distributed between seven separate folders, available to download from Kaggle [15]. The data is spread across four classes: 0- No mask, 1- Mask but nose and mouth exposed, 2- Mask but nose exposed, 3- Mask is worn correctly. See Fig. 1 for examples of each class.
Fig. 1. Kaggle dataset training examples
Mask Compliance Detection
457
The research is limited by resource constraints, thus unable to utilise the available 250,000 images, and so makes use of parts 1–4 of the available data. This totals 160k images. 30,000 of these images were reserved for the final testing dataset. Allowing 20% of the training dataset for validation resulted in 104k for training. The original images were of varying sizes, and very large, most exceeding 1024 for height and width. Inspection of the dataset demonstrated an even distribution between the four classes, with 32,438 images per class. This will be advantageous to the performance of each model; in work by [8] it is explained that for imbalanced datasets, networks have a tendency to over-classify samples consistently to the class with the most samples. In such circumstances, the minority class is frequently classified incorrectly, resulting in poor performance on unseen data. However, the class balance shown in this dataset mitigates the chance of this occurrence. For neural networks to show the best performance when deployed as an application, good generalisation to unseen data is imperative. This can be improved when the training data has a large variation of samples. A variable in this dataset is the gender of the person shown in each image; men and women present different facial characteristics, thus providing the classifier with some variation. The majority of images in this dataset were labelled with the gender of the person, however 25.55% were marked as None. The dataset is heavily dominated by images of males at 51.43%, and only 23.01% are of women. Refer to Fig. 2(a) for the distribution. An additional variable observed within the dataset is the age of the person in the image. Figure 2(b) shows the distribution of age groups in the training data. It shows that images of people aged 20–30 years old dominate the data, but the range spans 18–79 years. On initial inspection, analysis showed that some images contained incorrect values for age, such as 2020. This inaccurately skewed the analysis, so a Python script was used to find images where the age value exceeded 100, and the persons age was simply estimated.
(a) Gender Distribution
(b) Age Distribution
Fig. 2. Data analytics
458
L. Garbagna et al.
3.2
MaskedFace-Net
MaskedFace-Net is a dataset comprising 137,016 images of people’s faces, all of which have had a surgical mask photoshopped onto them. There are 4 possible classes at this stage: (1) the person in the image is wearing the face mask correctly, with chin, mouth and nose covered; (2) the mask covers the chin only; (3) the mask covers the mouth and chin only; (4) the mask covers the nose and mouth, leaving the chin exposed. The correctly masked class dominates this dataset at 49%. Figure 3 shows some sample images.
Fig. 3. MaskedFace-Net image samples
3.3
Preprocessing
This section describes the preprocessing applied to the datasets before model development and training could commence. Firstly, all images were resized to a uniform 300 × 300; this value was chosen so that significant experimentation could be carried out with respect to imposed hardware constraints, whilst maintaining the significant features of the data. Second, for ease of implementation, the images were organised into folders corresponding to their class. This was achieved through creating a script that extracted the class label from the filename, and using the os library to iterate files and move to a specified directory.
4
Model Experimentation
The work experiments with varied implementations of Convolutional Neural Networks to classify input images into one of four classes. The model demonstrating the best performance during testing is used to classify input captured from realtime video. Each model and it’s performance in relation to accuracy and loss on the training and validation set are explained and analysed.
Mask Compliance Detection
4.1
459
Training and Validation
Three CNN models were trained to test for the highest accuracy. The total number of images used for training is 130k, 20% of which are used in the validation set. Graphs plotting the train accuracy against the validation accuracy are recorded, along with the loss. All models consisted of 2D-Convolutional layers, doubling the number of filters at every layer: the activation function used is ReLU and padding has been set to ‘same’. The padding setting is set this way to enable the application for the video-stream mask detection to work correctly, as difference in padding would result in the methods implemented with OpenCV having inaccurate image shapes sent to the model for classification. MaxPooling2D with a size of 2 × 2 was implemented after every convolutional layer. Inputs are flattened after the filters have been applied, and the data is passed into a Dense layer before classification, a Dense layer with 5 neurons using the SoftMax activation. 4.2
Model A
The first model used input images in the grey-scale colour space and had a total of 2,827,205 trainable parameters. The model architecture is shown in Table 1 Table 1. Model-A layers Layer 1: Convolutional
Conv2D (filters = 16)
Layer 2: Pooling
MaxPool2D (2 × 2)
Layer 3: Convolutional
Conv2D (filters = 32)
Layer 4: Pooling
MaxPool2D (2 × 2)
Layer 6: Convolutional
Conv2D (filters = 64)
Layer 7: Pooling
MaxPool2D (2 × 2)
Layer 8: Flatten Layer 9: Dense
Neurons = 32
Layer 10: Dense (SoftMax) Neurons = 4
Figure 4 shows the training accuracy and loss against validation accuracy and loss respectively. The accuracy of the model on the train data grows over the specified number of epochs. The accuracy of the validation set is steadier but averagely lower, which could indicate that the model is over-fitting slightly. The training loss decreases steadily over time while the validation loss decreases only for three epochs, after that it increases constantly without improving. 4.3
Model B
The second model also used input images in the greyscale colour-space and had a total of 1,424,453 trainable parameters. The model architecture is shown in Table 2.
460
L. Garbagna et al.
(a) Training vs Validation Accuracy
(b) Training vs Validation Loss
Fig. 4. Model-A training Table 2. Model-B layers Layer 1: Convolutional
Conv2D (filters = 16)
Layer 2: Pooling
MaxPool2D (2 × 2)
Layer 3: Convolutional
Conv2D (filters = 32)
Layer 4: Pooling
MaxPool2D (2 × 2)
Layer 6: Convolutional
Conv2D (filters = 64)
Layer 7: Pooling
MaxPool2D (2 × 2)
Layer 8: Convolutional
Conv2D(filters = 128)
Layer 9: Pooling
MaxPool2D (2 × 2)
Layer 10: Flatten Layer 11: Dense
Neurons = 32
Layer 12: Dense (SoftMax) Neurons = 4
Figure 5 shows the training accuracy against validation accuracy and training loss against validation loss respectively. There is an improvement in this model given the fact it is over-fitting less than model-A. As shown in Model-A, the validation loss in this model stops decreasing after epoch 3, but it reaches a lower value and it increases with a smaller magnitude compared to Model-A. 4.4
Model C
Table 3 shows the architecture for the model-C with RGB input images and the total number of 3,047,589 trainable parameters. Note that Batch Normalisation was implemented after each Convolutional layer in the network. Figure 6 shows the training accuracy against validation accuracy and training loss against validation loss respectively. The performances are superior compared to the previous two models, and both accuracy and loss values for training and validation present a smaller gap then previous architectures. This model showed good performance during validation; 0.9681 and 0.1101 for validation accuracy and loss respectively. The model was saved as a JSON file along with the weights. MobileNetV2 and ResNet50 have also been trained using Transfer Learning. Table 4 compares the performance of all five experimental methods.
Mask Compliance Detection
(a) Training vs Validation Accuracy
461
(b) Training vs Validation Loss
Fig. 5. Model-B training Table 3. Model-C layers Layer 1: Convolutional
Conv2D (filters = 16)
Layer 2: Pooling
MaxPool2D (2 × 2)
Layer 3: Convolutional
Conv2D (filters = 32)
Layer 4: Pooling
MaxPool2D (2 × 2)
Layer 6: Convolutional
Conv2D (filters = 64)
Layer 7: Pooling
MaxPool2D (2 × 2)
Layer 8: Convolutional
Conv2D (filters = 128)
Layer 9: Pooling
MaxPool2D (2 × 2)
Layer 10: Convolutional
Conv2D (filters = 256)
Layer 11: Pooling
MaxPool2D (2 × 2)
Layer 12: Flatten Layer 13: Dense
Neurons = 128
Layer 14: Dense (SoftMax) Neurons = 4 Table 4. Model comparison Model-A
5
Model-B
Model-C
MobileNet-V2 ResNet50
Color mode
Greyscale Greyscale RGB
Total params
2,827,172 1,424,420 3,049,444 10,294,788
RGB
RGB 24,406,916
Time/Epoch
187 s
160 s
277 s
171 s
750 s
Train accuracy
0.8797
0.9462
0.9833
0.7433
0.9502
Validation accuracy 0.8488
0.9250
0.9681
0.7277
0.9261
Train loss
0.3188
0.1498
0.0521
0.6514
1.924
Validation loss
0.4101
0.2179
0.1101
0.6855
4.0052
Application
This section describes the use of the proposed end application as a proof of concept. A script takes the JSON and weights files of each model, and loads a prediction method that returns the state of the mask. The argmax function
462
L. Garbagna et al.
(a) Training vs Validation Accuracy
(b) Training vs Validation Loss
Fig. 6. Model-C training
from the Numpy library (np.argmax) is used to load the ID of the class (0 to 3), instead of the probability for each class. OpenCV [2] is then used to import the Haar classifier [17]: when the video-stream from the webcam is activated, or an image file is presented, the classifier detects any faces present. The Region of Interest (ROI), the face inside the bounding-box, is resized to 300 × 300 and fed to the prediction method; the state is classified, and the text associated with the ID of the class is shown on top of the bounding box around the face. The models were tested on unseen images and webcam footage and results are described in a latter section of the paper. Deployment of the classifier on unseen images and webcam feed is representational of how it might behave in-the-wild. Figure 7 shows an example of the concept used for the application by using four images with four different mask states.
(a) Mask
(b) No Mask
(c) Nose
(d) Mouth Nose
Fig. 7. Example of application by image classification
6
Results
Table 5 shows the result of the CNN models using the unseen test set: 7,500 images per class. Between the three custom models, the only one using RGB images as input, Model-C, reaches the highest accuracy score of 0.9663 and lowest loss value of 0.1272. Comparing the two models that used greyscale images, Model-B outperformed Model-A by reaching an accuracy of 0.9241 against 0.8488: even with fewer parameters, the additional Convolutional Layer allowed the network to learn more significant features, thus generalising better.
Mask Compliance Detection
463
Table 5. CNN models test results Color mode Accuracy Loss Model-A
Greyscale
0.8488
0.4104
Model-B
Greyscale
0.9241
0.2323
Model-C
RGB
0.9663
0.1272
MobileNet-V2 RGB
0.7353
0.6772
ResNet50
0.9261
4.2984
RGB
Due to previous publications using pretrained models on new datasets, this research also implemented two of these architectures: MobileNetV2 and ResNet50. The first one under-performed, positioning itself last: the smaller image sizes (224 × 224) and fewer parameters compared to other models influenced its poor accuracy score of 0.7353. On the other hand, ResNet50 was the second best model for accuracy at 0.9261, but presented a significant loss of 4.2984: the model could predict the class correctly most of the time, but it displayed high uncertainty about the decision. To get a better understanding of these results, confusion matrices for each architecture have been plotted in Fig. 8 and their numerical values noted in Tables 6, 7, 8, 9 and 10.
(a) Model-A
(b) Model-B
(c) Model-C
(d) MobileNet
Fig. 8. Confusion matrices Table 6. Model-A confusion matrix Mask No mask Mouth nose Nose Mask No mask Mouth nose Nose
6701
184
61
554
105
6228
955
212
99
979
6100
322
609
285
171
6435
Table 7. Model-B confusion matrix Mask No mask Mouth nose Nose Mask No mask Mouth nose Nose
7167
91
18
224
40
7077
325
58
44
1060
6286
110
135
103
69
7193
(e) ResNet50
464
L. Garbagna et al. Table 8. Model-C confusion matrix Mask No mask Mouth nose Nose Mask
7367
34
4
95
No mask
23
7320
133
24
Mouth nose
18
500
6906
76
Nose
69
23
13
7395
Table 9. MobileNet confusion matrix Mask No mask Mouth nose Nose Mask No mask Mouth nose Nose
6057
396
219
828
575
5329
1325
271
363
920
5796
421
1555
454
613
4878
Table 10. ResNet50 confusion matrix Mask No mask Mouth nose Nose Mask
6900
10
No mask
29
7007
414
50
Mouth nose
18
496
6777
210
349
13
38
7100
Nose
13
577
Tables 6, 7, 8, 9 and 10 confirm what is shown by the evaluation on the test sets: Model-C has very high performance in terms of any confusion. The model can identify nearly perfectly the Mask and Nose classes by classifying correctly 7367 and 7395 respectively, against the 7500 total images per class. Although it misclassified 180 images for the No Mask class and 594 for the Mouth Nose class, the generalisation on test and unseen images was highly acceptable. The only other model that has acceptable performances is ResNet50: although performing worse, especially in the Mouth Nose class where it misclassified 723 images, it confirms the high accuracy scored on the dataset, even considering its low confidence in classification. The other models present a lot of confusion between the classes, classifying incorrectly many of the images, particularly in the Mouth Nose class. This is most likely due to the additional features needed to thoroughly map the contours on the masks, and to detect the edges on the chin. On the other hand, Table 11 shows the result of the three same models tested on the MaskedFace-Net test set: at first glance Model-C achieves an even higher accuracy and a lower loss compared to the second dataset, scoring an accuracy value of 0.9896 against 0.9706 and a lower loss at 0.0469 against 0.1115. ModelA and Model-B also showed better performances on this dataset: the lowest accuracy scored was achieved by Model-A at 0.9777. Although more accurate, the
Mask Compliance Detection
465
MaskedFace-Net models performances on generalisation were poor, as described in the Discussion. Table 11. CNN models test results for MaskedFace-Net Color mode Accuracy Loss
6.1
Model-A Greyscale
0.9777
0.1239
Model-B Greyscale
0.9833
0.0892
Model-C RGB
0.9896
0.0469
Real-Time Classifier
The web framework Flask [1] was used to show the feed captured by the webcam to the internet browser, so the output can be shown to the user. The output consists of a bounding box around the face(s), and the class prediction from the model in real-time, as demonstrated in Fig. 9.
(a) Mask
(b) Chin
(c) Mouth Chin
Fig. 9. Real-time mask classification examples
Although working correctly, some limitations have been observed. Utilising the model trained on MaskedFace-Net, for the video-stream, a greyscale model trained on four classes, without the no mask label, have been utilised. The Haar classifier from OpenCV used here had some constraints. It has been noted that in certain conditions of poor or excessive lighting, the classifier had difficulties to detect both the face and the mask of the subject.
7
Discussion
This work contributes a novel perspective for the application domain of Face Mask Classification. It uses a large dataset consisting of images where people are wearing physical face masks, in comparison to other relevant work that uses a photoshopped technique. This work is thus more representational of how a system such as the proposed would behave if deployed in a real-life setting, such as entrances to public buildings and transport. The paper provides an in depth analysis of various CNN performance in the application domain. Custom
466
L. Garbagna et al.
architectures are presented for the task; the performance of varying pretrained models with transfer learning are investigated to understand the capabilities of shallow and deeper networks; and a comparison between available datasets for the task is provided. The best performing network, Model-C, was trained using RGB images to increase the numbers of trainable parameters in order to improve training. By passing RGB inputs, time per epoch increases slightly but the architecture is able to learn more features by mapping the colours between the persons’ skin and the mask. Another advantage is the possibility for the network to learn different mask colours in relation to skin tones that might closely resemble the mask, whereas a greyscale model might get confused by the closer intensity of the pixel values. Although the model trained on MaskedFaceNet outperformed its results using the second dataset, some major drawbacks have been found. Due to the nature of MaskedFace-Net, during different tests it has been observed that the model often struggles to correctly identify images where the person has a darker skin tone, especially if combined with a darker face mask. Furthermore, as the images contained only surgical masks applied with Photoshop, the model struggles to generalise to other types of masks. Due to these problems, Model-C generalises best when trained on the real masks dataset, as it is able to achieve a high accuracy score whilst also being able to categorise different types of masks applied to various ethnicities. This area of research is considered relatively new given the short amount of passed time since the beginning of the pandemic. Therefore, the available data varies in size and suitability. The MaskedFace-Net dataset provides a large pool of images, and, to the best of the authors knowledge, was the first to cover all variations of incorrect mask wearing behaviour. However, due to the photoshopped masks, it lacks the degree of realism that the main dataset used in this paper provides. In addition, MaskedFace-Net also demonstrates a large imbalance between classes, where 51% represents incorrectly masked faces, but only 10% of this is populated by nose exposure images. Without addressing the issue, this will inevitably lead to under-performing models. Using the other dataset somewhat mitigates this issue, proven through an improved generalisability.
8
Conclusion and Further Work
Multiple neural networks were trained using a large dataset consisting of people wearing real masks in one of four ways, to detect the variations in maskwearing behaviour. The experimentation makes use of three models using a simple CNN architecture, where two of these networks use greyscale images, and the remaining using RGB input. The models were compared and their performance evaluated against existing pretrained model performance with transfer learning, namely MobileNet and ResNet50. These specific networks were chosen to compare and contrast the abilities of shallow networks, and those using more layers, for this specific problem domain. The work concludes that a simpler CNN architecture taking RGB images as input yields the best performance. This particular model was able to classify images in the test dataset with an
Mask Compliance Detection
467
accuracy of 96.63%. The OpenCV implementation demonstrated that the systems’ capability to classify mask state accurately in real-time, promoting the proof of concept. Taking the limitations of both the model and application into consideration, further improvements can be applied to both. The dataset can be expanded by introducing more pictures with different kinds of masks, apart from surgical ones. Data augmentation could be applied in case the number of new pictures would not be enough to bring any significant improvements to the model. On the application side, a new face detection system could be applied to reduce the limitations imposed by the Haar classifier. The work could benefit from testing the performance of other pretrained models, such as NASNetLarge, providing that the suitable resources are available for such heavy computation.
References 1. Flask (2010) 2. OpenCV (2021) 3. Bhuiyan, R., Khushbu, SA., Islam, S.: A deep learning based assistive system to classify COVID-19 face mask for human safety with YOLOv3. In: 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–5 (2020) 4. Chavda, A., Damani, A., Dsouza, J., Badgujar, S.: Multi-stage CNN architecture for face mask detection. In: 2021 IEEE 6th International Conference for Convergence in Technology (I2CT), Pune, India, pp. 1–8 (2021) 5. Das, A., Ansari, M., Basak, R.: Covid-19 face mask detection using TensorFlow, Keras and OpenCV. In: 2020 IEEE 17th India Council International Conference (INDICON), New Delhi, India, pp. 1–5 (2020) 6. Fasfous, N., et al.: BinaryCoP: binary neural network-based COVID-19 face-mask wear and positioning predictor on edge devices. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 108–115 (2021) 7. European Centre for Disease Prevention and Control. Using face masks in the community: first update (2021) 8. Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 1–54 (2019) 9. Kayali, D., Dimililer, K., Sekeroglu, B.: Face mask detection and classification for COVID-19 using deep learning. In: 2021 International Conference on INnovations in Intelligent SysTems and Applications (INISTA), pp. 1–6 (2021) 10. Koklu, M., Cinar, I., Taspinar, Y.S.: CNN-based bi-directional and directional long-short term memory network for determination of face mask. Biomed. Sign. Process. Control 71, 103216 (2022) 11. Loey, M., Manogaran, G., Taha, M.H.N., Khalifa, N.E.M.: Fighting against COVID-19: a novel deep learning model based on YOLO-v2 with ResNet-50 for medical face mask detection. Sustain. Urban Areas 65, 102600 (2021) 12. Loey, M., Manogaran, G., Taha, M.H.N., Khalifa, N.E.M.: A hybrid deep transfer learning model with machine learning methods for face mask detection in the era of the COVID-19 pandemic. Measurement 167, 108288 (2021) 13. Mohan, P., Paul, A.J., Chirania, A.: A tiny CNN architecture for medical face mask detection for resource-constrained endpoints. In: Mekhilef, S., Favorskaya,
468
14.
15. 16.
17.
18.
L. Garbagna et al. M., Pandey, R.K., Shaw, R.N. (eds.) Innovations in Electrical and Electronic Engineering. LNEE, vol. 756, pp. 657–670. Springer, Singapore (2021). https://doi.org/ 10.1007/978-981-16-0749-3 52 Nagrath, P., Jain, R., Madan, A., Arora, R., Kataria, P., Hemanth, J.: A real time DNN-based face mask detection system using single shot multibox detector and MobileNetV2. Sustain. Urban Areas 66, 102692 (2021) Roman, K.: 500 GB of images with people wearing masks. Part 1 (2021) Singh, S., Ahuja, U., Kumar, M., Kumar, K., Sachdeva, M.: Face mask detection using YOLOv3 and faster R-CNN models: COVID-19 environment. Multimedia Tools Appl. 80(13), 19753–19768 (2021) Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, pp. 511–518 (December 2001) Yang, J.: A quick history of why Asians wear surgical masks in public (2014)
Urban Tree Detection and Species Classification Using Aerial Imagery Mahdi Maktab Dar Oghaz(B) , Lakshmi Babu Saheer, and Javad Zarrin Faculty of Science and Engineering, Anglia Ruskin University, Cambridge, UK [email protected]
Abstract. Trees are essential for climate change adaptation or even mitigation to some extent. To leverage their potential, effective forest and urban tree management is required. Automated tree detection, localisation, and species classification are crucial to any forest and urban tree management plan. Over the last decade, many studies aimed at tree species classification using aerial imagery yet due to several environmental challenges results were sub-optimal. This study aims to contribute to this domain by first, generating a labelled tree species dataset using Google Maps static API to supply aerial images and Trees In Camden inventory to supply species information, GPS coordinates (Latitude and Longitude), and tree diameter. Furthermore, this study investigates how state-of-the-art deep Convolutional Neural Network models including VGG19, ResNet50, DenseNet121, and InceptionV3 can handle the species classification problem of the urban trees using aerial images. Experimental results show our best model, InceptionV3 achieves an average accuracy of 73.54 over 6 tree species.
Keywords: Urban tree detection Aerial imagery
1
· Convolutional Neural Network ·
Introduction
Trees are well recognised for their importance to the climate and human life. Environmentally, trees slow surface runoff from rainfall, reducing the risk of flood, water pollution and soil erosion [4]. In urban areas, trees improve overall air quality by absorbing particulate matter and create a cooling effect which helps in adapting to the “heat island” effect [17]. Moreover, urban trees play a key role in climate change adaptation or even mitigation by reducing CO2 levels, the main contributor to climate change. Urban trees also improve the perception of an area by blocking noise, dust, wind and glare [7]. Studies indicate, urban trees can reduce indoor heating and cooling expenses by blocking the wind, weather and casting shade around the housing area [33]. In order to exploit this potential, effective forest and urban tree monitoring and management is essential. This requires information about composition, species, age, health, and location of trees which helps in better planning of plantation programs, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 469–483, 2022. https://doi.org/10.1007/978-3-031-10464-0_32
470
M. M. D. Oghaz et al.
growth monitoring and pruning. This also facilitates biodiversity of the vegetation and promotes robust ecosystem with greater resilience to disease and pests and better productivity [1,9]. Such management system demands for a reliable yet economically viable platform to automatically detect, classify and monitor forests and urban trees to guide policy makers to devise better long term management strategy and ensures long term sustainability. Historically, experts and volunteers on the ground were in charge of this laborious and time-consuming management system. However, advent of aerial, satellite and LiDAR imagery has now put a new dimension to these practices [25]. LiDAR technology repeatedly used to estimate the number of trees in an area [32] and categorise their species [12,13,22]. Despite numerous advantages, LiDAR surveys are costly due to the specialist equipment and skilled analysts requires to interpret it [24]. An alternative technology for tree management and monitoring is the use of hyperspectral and remote sensing satellite images. These techniques have advanced significantly over the last couple of decades and are now able to produce highresolution images which facilitates individual tree crown detection and species classification [6,8,18]. A limited number of studies are looking into urban tree classification using RGB aerial images [27,30]. The study by Wegner & Branson [30] have proposed a CNN based system to catalogue and classify urban trees using publicly available Google satellite images. Their model has been trained and tested on tree crown images from the Pasadena region of California. In another research, Nezami [21] achieved 97% accuracy in tree classification using aerial images and convolutional Neural Networks that testifies the effectiveness of these approaches. Despite staggering accuracy, this study focused on only three species which is limited and less practical in the real-world. The aim of this study is to first generate a labelled tree species dataset of aerial images to facilitate detection, classification and localization of urban trees using publicly available Google Maps aerial images and Trees In Camden inventory to supply with GPS locations (Latitude and Longitude), diameter and species information for trees. This study also aims to assess performance of various state of the art pre-trained deep convolutional neural networks including VGG19, ResNet50, DenseNet121 and InceptionV3 in tree species classification under various training scenarios and parameters. The next section of this paper outlines the dataset preparation process.
2
Dataset Generator Framework
This study proposes a Dataset Generator framework, designed to generate labelled dataset of tree species using aerial RGB images and any given tree inventory to supply species information, GPS coordinates (Latitude and Longitude) and tree diameter. This study uses Google Maps static API to supply aerial RGB images which is a quick and cost effective way of image data collection. This method is especially useful for urban trees as Google offers aerial images with significantly high quality in urban areas. To supply species information, GPS coordinates and tree diameter, this study uses Trees In Camden inventory
Urban Tree Detection and Species Classification
471
which contains over 23,000 locations of Council owned trees on highways and in parks and open spaces in London Borough of Camden. Each data point contains tree species, height, spread, diameter at breast height (DBH), and maturity [5]. While this inventory consists of hundreds of different tree species, this study only investigates top six species with highest frequencies including Ash, Silver Birch, Common Lime, London Plane, Norway Maple and Sycamore. The data is split into subsets with 70% for training, 20% for validation and 10% reserved as unseen test data. The proportional representation of each species is preserved across the subsets so that any class imbalance is retained at each stage. The latitude and longitude co-ordinates of each tree were used as the centre point for each aerial image. A patch of 600 × 600 with the zoom level of 20 covers large enough area to contain any tree in Trees In Camden inventory. A 2D Gaussian kernel which is centred to the tree’s GPS coordinates and expands across tree’s diameter has been used to generate the ground-truth density maps. Tree images along with density maps will be used to train tree species classification and localisation deep models however this study only focuses on the classification issue. Figure 1 shows some examples of urban tree images with their corresponding ground-through density maps. Since, the number of data samples in the training set is fairly limited and traditionally convolutional neural networks require a very large number of data samples for effective training process. In order to tackle this issue and increase the size and variation of the training data, image augmentation techniques including Rotation, width and height shift, horizontal flip, zoom and brightness are used, thus artificially expanding the size of the training set with new, plausible examples as shown in the Fig. 2 [14].
Fig. 1. Examples of urban tree images with their corresponding ground-through density maps.
3
Tree Species Classification
The project explores performance of various state of the art deep convolutional neural networks including VGG19, ResNet50, DenseNet121 and InceptionV3 in tree species classification. Each model was trained with three different training configurations including fully pre-trained, fine-tuning and training from scratch. In all cases, the top fully connected classification layers are modified to accommodate the 6 tree species of our dataset. All the models in this study are trained and tested based on the training, validation and testing sets shown in the Fig. 3. This study also investigates the possibility of a reliable tree localisation through class activation mapping which demonstrates the discriminative region of the image,
472
M. M. D. Oghaz et al.
Fig. 2. Examples of the augmentation applied to images in the training data subset
which influenced the deep learning model to make the decision [34]. Other training parameters including learning rate, learning decay, loss function, batch size and optimiser are held constant across all models and training configurations. Training samples are shuffled prior to the training process to avoid possible skew toward a certain class and to ensure uniform distribution of classes (tree species) across batches and maximise the information gain per iteration. We used categorical cross-entropy loss function across all models in this study while optimiser of choice is set to Adam. The maximum number of epochs is set to 200 and callbacks are implemented to monitor the validation loss in order to stop the model training if the loss has failed to improve after 10 consecutive epochs. During training iterations, the model with the least amount of loss is saved and used as a benchmark for comparison with other models. 3.1
VGG19 Model
VGG19 [28] is arguably one of the most popular CNN models in image classification. This was the chosen model in similar tree species classification studies by Branson, et al. [2] and Lang [15]. Hence, this study adopted the VGG19 model as it is likely to yield desirable results. Three different training configurations including pre-trained, fine-tuning and training-from-scratch are used to train the VGG19 model. In pre-trained (ImageNet) configuration, weights and biases across all convolutional blocks (feature extractors) are frozen while fully connected layers have been reshaped and retrained to accommodate the six tree species of our dataset. In fine-tuning configuration, we have adopted two methods, the first of which unfreezes and retrains 4th and 5th convolutional blocks while keeping the first three blocks frozen. The second method only unfreezes and retrains the last (5th) convolutional block. Similar to pre-trained configuration, fully connected layers have been reshaped and retrained to accommodate the 6 tree species of our dataset. In training-from-scratch configuration, weights and biases across all convolutional and fully connected layers are initialised using Glorot uniform algorithm.
Urban Tree Detection and Species Classification
473
Fig. 3. Training, validation and testing sets counts across top 6 species in camden dataset
Regardless of training configuration, categorical cross-entropy loss function and Adam optimiser are used to train the VGG19 model. 3.2
ResNet50 Model
ResNet, short for Residual Networks is a classic neural network used as a backbone for many computer vision tasks. The fundamental breakthrough with ResNet was it allowed us to train extremely deep neural networks with 150+ layers without facing problems like vanishing gradients. ResNet uses skipconnections that allows gradients to flow easily from layer to layer and helps even the deepest layer receive activations from the top layers. ResNet50 model is chosen in many similar tree classification studies including [3,19]. Analogous to VGG19, pre-trained, fine-tuning, and training-from-scratch configurations used to train ResNet50 Model. In fine-tuning configuration, all layers prior to conv5 block2 add remained frozen while the subsequent layers have been unfrozen and fine-tuned. Also, the last fully connected dense layer has been reshaped to accommodate the 6 tree species of our dataset. In trainingfrom-scratch configuration, weights and biases across all convolutional and dense layers are initialised using Glorot uniform algorithm. In pre-trained (ImageNet) configuration, we have only reshaped and retrained the last dense layer to accommodate the six tree species of our dataset. Categorical cross-entropy loss function and Adam optimiser used to train the ResNet50 Model across all training configurations.
474
M. M. D. Oghaz et al.
3.3
DenseNet121 Model
DenseNet model is similar to ResNet with some structural difference. ResNet uses addition (+) that merges the previous layer (identity) with the future layer, whereas DenseNet concatenates (.) the output of the previous layer with the future layer. DenseNets connects all layers with matching feature-map sizes directly with each other. To preserve the feed-forward nature, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers [10]. DenseNet aims to address vanishing gradients with significantly lesser number of parameters compared to ResNet. DenseNet model was employed in many similar tree classification studies [16,20]. Similar to VGG19 and ResNet50 models, pre-trained, fine-tuning, and training-from-scratch configurations are used to train DenseNet121 Model. Pre-trained and training-fromscratch configurations are both using similar training strategy and parameters as previous models while fine-tuning configuration, unfreezes and fine-tunes layers subsequent to conv5 block15 concat and reshapes the final dense layer to accommodate the six tree species of our dataset. Similar to previous models, categorical cross-entropy loss function and Adam optimiser are used to train the DenseNet121 Model across all training configurations. 3.4
InceptionV3 Model
InceptionV3 is the third evolution of Inception architectures family by Google. Inception v3 mainly focuses on lowering computational power by modifying the previous Inception architectures. InceptionV3 model features techniques like factorized convolutions, regularization, dimension reduction, and parallelised computations which set it apart from the competition [29]. Several tree classification studies including [23,26] employed InceptionV3 model which urged us to explore its efficiency in this research. Similar to previous models, pre-trained, fine-tuning, and training-from-scratch configurations are used to train InceptionV3. In fine-tuning configuration, all layers prior to mixed9 remained frozen while the subsequent layers have been unfrozen and fine-tuned. In other words, we have attempted to retrain the last Inception module. Moreover, the top fully connected dense layer has been reshaped to accommodate the six tree species of our dataset. Pre-trained and training-from-scratch configurations are both using similar training strategy and parameters as previous models. Similar to previous models, categorical cross-entropy loss function and Adam optimiser are used to train the InceptionV3 Model across all training configurations. Also, unlike other models in this study, InceptionV3 has been trained on input image size of 299 × 299 × 3. Hence, we made the necessary adjustments in the pre-processing and training process to address this issue.
Urban Tree Detection and Species Classification
4
475
Results and Discussion
The training, validation and testing process are performed using the 6 tree species including Ash, Silver Birch, Common Lime, London Plane, Norway Maple and Sycamore. The VGG19, ResNet50, DenseNet121, and InceptionV3 have been trained in three different configurations including pre-trained, finetuning (FT), and training-from-scratch (TFS). Categorical cross-entropy has been employed as the loss function of choice across all experiments in this study. All models in this study use Adam optimiser with initial learning rate of 1e−2 and scheduled exponential decay to lower the learning rate as the training progresses. Batch size of 32 has been used across all experiments. The VGG19 with over 140 million parameters is the most computationally expensive model in this study and consequently took the longest to train and fine-tune. The VGG19 model with fine-tuned 4th and 5th convolutional blocks achieved an accuracy of 71.84 and F1-score of 0.626, outperformed other training configurations of VGG19 model with a reasonable margin. Freezing the 4th convolutional block led to a slight reduction across majority of the evaluation metrics. Freezing the entire convolutional blocks (pre-trained configuration) further reduced the model performance. This indicates textural and geometric features of Google Map’s aerial images of urban trees are slightly different to ImageNet’s and fine-tuning can positively impact the model performance. Experimental results also show that training-from-scratch (TFS) consistently outperformed other configurations possibly due to lack of training samples. The performance measures obtained by the VGG19 model are recorded in Table 1. Figure 4 shows confusion matrices of different VGG19 training configurations. It appears that regardless of the training configuration, VGG19 struggles at identifying Ash tree species. Table 1. Evaluation metrics for different VGG19 training configurations Model
Loss Accuracy (%) Avg class precision (%)
Avg class recall (%)
F1-score
VGG-19 (4th, 5th FT)
1.08 71.84
0.628
0.633
0.626
VGG-19 (5th FT)
1.12 69.42
0.595
0.600
0.596
VGG-19 (Pre-trained)
1.14 68.08
0.583
0.595
0.585
VGG-19 (TFS)
1.21 66.99
0.565
0.583
0.570
The ResNet50 model with just over 25 million parameters is significantly faster in training and inference. In general, ResNet50 consistently underperformed VGG19 regardless of training configurations. Just like VGG19, ResNet50’s performance topped at fine-tuned training configuration, where the
476
M. M. D. Oghaz et al.
Fig. 4. Confusion matrices for different VGG19 training configurations
maximum accuracy of 68.93 and F1-score of 0.583 have been registered. We believe ResNet50’s performance could be further improved by investigating different fine-tuning (freezing/unfreezing) possibilities. Experiments shows significant drop in ResNet50 performance in both pre-trained and training-fromscratch (TFS) configuration. This is consistent with what has been observed in the VGG19 experiments. We believe insufficient training samples is the main reason of such behaviour. The performance measures obtained by the ResNet50 model are recorded in Table 2. Figure 5 shows confusion matrices of different ResNet50 training configurations. It appears ResNet50 not only struggles with identification of Ash tree species but also performs poorly at Silver Birch classification.
Urban Tree Detection and Species Classification Table 2. Evaluation metrics for different ResNet50 training configurations Model
Loss Accuracy (%) Avg class precision (%)
Avg class recall (%)
F1-score
ResNet50 (FT)
1.12 68.93
0.576
0.601
0.583
ResNet50 (Pre-trained)
1.44 62.01
0.486
0.502
0.491
ResNet50 (TFS)
1.49 61.77
0.486
0.503
0.493
Fig. 5. Confusion matrices for different ResNet50 training configurations
477
478
M. M. D. Oghaz et al.
The DenseNet121 with just over 8 million parameters is the lightest and fastest model to train in this research. However, this comes with the cost of performance. At its peak, the DenseNet121 achieved accuracy of 66.55 and F1-score of 0.56 which is considerably lower than its counterparts in this study. Just like ResNet50 and VGG19 the highest performance observed under fine-tuning configuration. DenseNet121 under training-from-scratch (TFS) configuration, registered the lowest accuracy (59.95) and F1-score (0.461) across all the experiments in this study. The performance measures obtained by the DenseNet121 model are shown in Table 3. Figure 6 shows confusion matrices of different DenseNet121 training configurations. Similar to ResNet50, DenseNet121 struggles with classification of Ash and Birch Silver tree species. Table 3. Evaluation metrics for different DenseNet121 training configurations Model
Loss Accuracy (%) Avg class precision (%)
Avg class recall (%)
F1-score
DenseNet121 (FT)
1.24 66.55
0.548
0.570
0.560
DenseNet121 (Pre-trained)
1.59 60.32
0.470
0.473
0.468
DenseNet121 (TFS)
1.67 59.95
0.465
0.463
0.461
Last but not least, InceptionV3 model with over 23 million parameter is the second fastest model in this study. However unlike DenseNet121 (fastest model in this study) its speeds comes with no performance penalty. The InceptionV3 model achieved impressive accuracy of 73.54 and F1-score of 0.646, the highest recorded across all the experiments in this study. We believe InceptionV3’s performance could be even further improved by investigating different fine-tuning (freezing/unfreezing) possibilities. The performance measures obtained by the InceptionV3 model are shown in Table 4. Figure 7 shows confusion matrices of different InceptionV3 training configurations. A sensible improvement in classification can be observed across all tree species but similar other models in this study, InceptionV3 struggles with segregation of Ash and London Plane species. Table 4. Evaluation metrics for different InceptionV3 training configurations Model
Loss Accuracy (%) Avg class precision (%)
Avg class recall (%)
F1-score
InceptionV3 (FT)
0.96 73.54
0.650
0.650
0.646
InceptionV3 (pre-trained)
1.09 69.90
0.596
0.595
0.595
InceptionV3 (TFS)
1.20 66.14
0.540
0.550
0.548
Urban Tree Detection and Species Classification
479
Fig. 6. Confusion matrices for different DenseNet121 training configurations
A deeper investigation into model’s training behaviours shows that VGG19 suffers from a considerable amount of overfitting. Due to the fact that the VGG19 features huge number of parameters (140 million), training with small datasets like Trees In Camden leads to issues like overfitting. This can be slightly mitigated by introduction of Regularisation and Dropouts into the model. In general, we have realised reported performance measures across all experiments were effected by insufficient training samples. One possible solution is to combine images from other repositories such as Pasadena Urban Trees [31]. An in-depth investigation into other possible fine-tuning configurations could also mitigate this issue. It is worth mentioning that some images in our dataset may contain more than one tree if they are situated close together and, depending upon the accuracy of the location data, the labelled tree may not necessarily be accurate. Although the Camden tree inventory contains a large amount of detailed data, improvements could be made to ensure that common name species labels
480
M. M. D. Oghaz et al.
Fig. 7. Confusion matrices for different InceptionV3 training configurations
are correct. For example, Maple - Crimson King Norway is a separate category to Maple - Norway. Combining these would increase the number of images in Maple-Norway class by 9%. Similarly, Wier-maple and Maple-Silver are distinct categories, however the weir-maple is a type of silver-maple and so grouping these together would triple the size of this class. Also, we have realised imbalance nature of our dataset adversely impacted the results. The attempt to add class weights to account for the data imbalance could be a possible mitigation plan. Further research into handling imbalanced data could be conducted to reduce bias towards the larger class (London Plane). One such method for this would be to over or under sample the training images to create balance in this data set [11].
Urban Tree Detection and Species Classification
5
481
Conclusion
Tree detection and species classification using aerial or satellite imagery was an inherently expensive and time-consuming task. This research examined the possibility of urban-tree detection and species classification using Google Maps aerial images and publicly available tree inventories such as Trees In Camden to supply GPS coordinates and tree species information. This can significantly reduce the cost of surveying and data collection and overall helps to leverage effective forest and urban tree management. The work involved investigating several state of the art deep convolutional neural network models including VGG19, ResNet50, DenseNet121 and InceptionV3 at three different training configurations including fully pre-trained, fine-tuning and training from scratch. Results shows, a fine-tuned InceptionV3 model is able to classify up to six different species with over 73% accuracy and 0.646 F1-score. While this is far from an ideal solution, this study shows the possibility of urban-tree species classification using free and publicly available Google Map’s images. Future work such as investigating other known popular models such as AlexNet, InceptionResNetV2 and Xception or other possible fine-tuning configurations could likely to improve the metrics.
References 1. Baeten, L., Bruelheide, H.: Identifying the tree species compositions that maximize ecosystem functioning in European forests. J. Appl. Ecol. 56(3), 733–744 (2018) 2. Branson, S., Wegner, J.D., Hall, D., Lang, N., Schindler, K., Perona, P.: From Google Maps to a fine-grained catalog of street trees. ISPRS J. Photogramm. Remote. Sens. 135, 13–30 (2018) 3. Cao, K., Zhang, X.: An improved Res-UNet model for tree species classification using airborne high-resolution images. Remote Sens. 12(7), 1128 (2020) 4. Chandler, K., Stevens, C., Binley, A., Keith, A.: Influence of tree species and forest land use on soil hydraulic conductivity and implications for surface runoff generation. Geoderma 310, 120–127 (2017) 5. London Borough of Camden Council. Trees in Camden: open data portal (May 2021) 6. Dalponte, M., Frizzera, L., Gianelle, D.: Individual tree crown delineation and tree species classification with hyperspectral and lidar data. PeerJ 6, e6227 (2019) 7. Donovan, G.H., Landry, S., Winter, C.: Urban trees, house price, and redevelopment pressure in Tampa, Florida. Urban Forest. Urban Greening 38, 330–336 (2019) 8. Fricker, G.A., Ventura, J.D., Wolf, J.A., North, M.P., Davis, F.W., Franklin, J.: A convolutional neural network classifier identifies tree species in mixed-conifer forest from hyperspectral imagery. Remote Sens. 11(19), 2326 (2019) 9. Gamfeldt, L., et al.: Higher levels of multiple ecosystem services are found in forests with more tree species. Nat. Commun. 4(1), 1340 (2013). https://doi.org/10.1038/ ncomms2328 10. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
482
M. M. D. Oghaz et al.
11. Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pp. 935–942 (2007). https://doi.org/10. 1145/1273496.1273614 12. Kim, S., Schreuder, G., Mcgaughey, R., Andersen, H.E.: Individual tree species identification using LiDAR intensity data. In: ASPRS 2008 Annual Conference, Portland (2008) 13. Koch, B., Heyder, U., Weinacker, H.: Detection of individual tree crowns in airborne LiDAR data. Photogram. Eng. Remote Sens. 72(4), 357–363 (2006) 14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) 15. Lang, N.: Deep learning and google maps for tree monitoring (2020) 16. Li, H., Hu, B., Li, Q., Jing, L.: CNN-based tree species classification using airborne LiDAR data and high-resolution satellite image. In: 2020 IEEE International Geoscience and Remote Sensing Symposium, IGARSS 2020, pp. 2679–2682. IEEE (2020) 17. Manickathan, L., Defraeye, T., Allegrini, J., Derome, D., Carmeliet, J.: Parametric study of the influence of environmental factors and tree properties on the transpirative cooling effect of trees. Agric. For. Meteorol. 248, 259–274 (2017) 18. Maschler, J., Atzberger, C., Immitzer, M.: Individual tree crown segmentation and classification of 13 tree species using airborne hyperspectral data. Remote Sens. 10(8), 1218 (2018) 19. Natesan, S., Armenakis, C., Vepakomma, U.: ResNet-based tree species classification using UAV images. Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. XLII–2/W13, 475–481 (2019) 20. Natesan, S., Armenakis, C., Vepakomma, U.: Individual tree species identification using dense convolutional network (DenseNet) on multitemporal RGB images from UAV. J. Unmanned Veh. Syst. 8(4), 310–333 (2020) 21. Nezami, S., Khoramshahi, E., Nevalainen, O., P¨ ol¨ onen, I., Honkavaara, E.: Tree species classification of drone hyperspectral and RGB imagery with deep learning convolutional neural networks. Remote Sens. 12(7), 1070 (2020) 22. Nilsson, M.: Estimation of tree heights and stand volume using an airborne LiDAR system. Remote Sens. Environ. 56(1), 1–7 (1996) 23. Onishi, M., Ise, T.: Automatic classification of trees using a UAV onboard camera and deep learning. arXiv preprint arXiv:1804.10390 (2018) 24. Rezatec: Satellites vs. Lidar for forestry management? (2020) 25. Rust, S.: Tree inventory, risk assessment and management. In: Roloff, A. (ed.) Urban Tree Management: For the Sustainable Development of Green Cities, pp. 178–210. Wiley, Gottingen (2016) 26. Safonova, A., Tabik, S., Alcaraz-Segura, D., Rubtsov, A., Maglinets, Y., Herrera, F.: Detection of fir trees (Abies sibirica) damaged by the bark beetle in unmanned aerial vehicle images with deep learning. Remote Sens. 11(6), 643 (2019) 27. Saheer, L.B., Shahawy, M.: Self-supervised approach for urban tree recognition on aerial images. In: Proceedings of the 17th International conference on Artificial Intelligence Applications and Innovations (2021) 28. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv arXiv:1409.1556 (2014) 29. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Urban Tree Detection and Species Classification
483
30. Wegner, J.D.: Cataloging public objects using aerial and street-level images - urban trees. Accessed 1 May 2020 31. Wegner, J.D., Branson, S., Hall, D., Schindler, K., Perona, P.: Cataloging public objects using aerial and street-level images-urban trees. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6014–6023 (2016) 32. Wilkes, P., Disney, M., Vicari, M.B., Calders, K., Burt, A.: Estimating urban above ground biomass with multi-scale LiDAR. Carbon Balance Manage. 13(1), 1–20 (2018). https://doi.org/10.1186/s13021-018-0098-0 33. Wolf, K.L.: Business district streetscapes, trees, and consumer response. J. Forest. 103(8), 396–400 (2005) 34. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)
Rectifying Homographies for Stereo Vision: Analytical Solution for Minimal Distortion Pasquale Lafiosca1(B) 1
and Marta Ceccaroni2
Integrated Vehicle Health Management Centre, Cranfield University, Bedfordshire, UK [email protected] 2 School of Aerospace, Cranfield University, Bedfordshire, UK
Abstract. Stereo rectification is the determination of two image transformations (or homographies) that map corresponding points on the two images, projections of the same point in the 3D space, onto the same horizontal line in the transformed images. Rectification is used to simplify the subsequent stereo correspondence problem and speeding up the matching process. Rectifying transformations, in general, introduce perspective distortion on the obtained images, which shall be minimised to improve the accuracy of the following algorithm dealing with the stereo correspondence problem. The search for the optimal transformations is usually carried out relying on numerical optimisation. This work proposes a closed-form solution for the rectifying homographies that minimise perspective distortion. The experimental comparison confirms its capability to solve the convergence issues of the previous formulation. Its Python implementation is provided. Keywords: Stereo vision · Stereo image processing rectification · Epipolar geometry
1
· Image
Introduction
Stereo vision gained a prominent role among Computer Vision technologies as it allows machines to perceive depth. Thanks to pinhole camera model, calibration and epipolar geometry, a considerable simplification can be performed before attempting to solve the stereo correspondence. Among these simplifications, rectification is almost always conducted to obtain horizontal and aligned epipolar lines, so that the following stereo matching algorithm can work along the x-axis only, with significant performance increase. However, rectification transformations generally introduce distortion, which, in turn, impairs the performance of the following stereo matching algorithm. For this reason, minimal-distortion algorithms are proposed in the literature. In this paper the closed-form solution of the rectifying homographies that minimise perspective distortion is derived by means of the metric introduced by Loop and c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 484–503, 2022. https://doi.org/10.1007/978-3-031-10464-0_33
Rectifying Homographies: Analytical Solution
485
Zhang [12]. This is applied to a calibrated stereo rig, where all the intrinsic and extrinsic parameters are known. The hereby presented formulation of the minimising solution is enabled by a new geometrical interpretation of the problem, which simplifies the distortion metric expression. The solution found is general and, therefore, valid for every relative position of a given stereo rig, even in extreme configurations and when the previous algorithm [12] fails to provide an initial guess for minimisation (see Sect. 5). Furthermore, the proposed method overcomes the need for optimisation libraries and is thus computationally more efficient by avoiding numerical minimisation loops. 1.1
Background
Emulating the human vision system, stereo vision derives 3D information by comparison of 2D digital images taken from two distinct camera positions, usually known from calibration [22]. Given a stereo image pair, the third coordinate of each object in the scene is extracted by first solving the stereo correspondence problem [8,10], namely the problem of locating the projections of that object in the two images, and then applying triangulation [2] to recover the position of the object in the 3D world. Taking advantage of the geometric relations between the 3D points and their projections onto the 2D images, the so-called epipolar constraint, the stereo correspondence problem can be reduced to one dimension. This means that the search of each pair of corresponding points is thus carried on along one, usually oblique, epipolar line in each picture. To further simplify the problem, image rectifications can be applied, transforming the stereo correspondence problem into a search along a horizontal line, with a consequent, significant improvement in efficiency. Image rectifications are a family of 2D transformations that can be applied to a couple of non-coplanar stereo images to re-project them onto a rectifying plane such that, on the transformed images, corresponding points will lie on the same horizontal line. The drawback of rectification is that it introduces a certain amount of distortion in the resulting images, that decreases the accuracy of the subsequent stereo matching algorithms. Hence, the importance of identifying, among the rectifying transformations in the family, the ones that minimise distortion. Hereafter, in continuity with the previous algorithm [12], we will refer to perspective distortion simply as distortion. This paper make the rectifications minimising distortion explicit, finding their analytic formulation via a geometrical derivation. The basic notation and metric are given in Sect. 2. Our novelties, the geometrical interpretation and derivation of the closed form solution, are reported in Sect. 3. The main steps of the algorithm are listed in Sect. 4, followed by a discussion and examples in Sect. 5. Conclusions follow in Sect. 6. The Python 3 implementation of the algorithm is also provided.
486
1.2
P. Lafiosca and M. Ceccaroni
Previous Work
Several algorithms for stereo image rectification are available in the literature. Among the mostly accepted and applied, the one by Fusiello et al. [5] essentially fixes the rectifying plane using the sole orientation of one of the two cameras. Despite being a simple and compact, this algorithm is based on an arbitrary nonoptimal choice of one rectifying plane (see Sect. 5.1). This choice leads, for highly skew configurations of the cameras, to the introduction of a massive amount of distortion in the transformed images, with the consequence of worsening the performances of the algorithm dealing with the stereo correspondence. The importance of minimising the distortion introduced by the rectifying transformations is highlighted by Loop and Zhang [12]. In their work, the authors decompose rectifying homographies as a combination of a perspective, a similarity and a shearing sub-transformations, and minimise the global distortion. A metric for measuring perspective distortion is thus presented. Finally, using an iterative, numerical procedure, the pair of homographies introducing the minimum perspective distortion is found. However, the initial guess thereby suggested cannot always be found [17]. With the aim to minimise distortions at pixel level, a different distortion metric was proposed by Gluckman and Nayar [7], making use of the Jacobian of the transformation to track changes in local image areas. This method proposes an approximate solution or requires to follow a complex iterative procedure to obtain the optimal one. A first-attempt of a closed-form solution was proposed by Sun [18]. Here the rectification transformations are readily found estimating the fundamental matrix. However, it leads to a particular solution without the criteria of reducing perspective distortion. Lately the attention of researchers has been focused on strategies for rectifying uncalibrated stereo rigs [3,4,11,15,16]. These methods do not require a calibration process and try to estimate camera parameters using the scene itself. For example, a handy solution for uncalibrated dual lens cameras [16] relies on key-points that have to be matched in the images. This approach is based on the very limiting assumption of a small-drift between the cameras poses. Obviously this does not apply to general camera poses, like verging cameras, and calibration is still important to reach high accuracy and efficiency levels. Different approaches focus on rectifying fisheye images [1,21,23]. Uncalibrated and fisheye rectification algorithms are outside the scope of this article. To date, the vast majority of stereo vision systems employs calibration, that allows for a metric reconstruction with the subsequent triangulation.
2
Setting the Problem
This Section summarises the work of Loop and Zhang [12] to provide the basis for the following closed-form solution.
Rectifying Homographies: Analytical Solution
2.1
487
Pinhole Cameras and Epipolar Geometry
Let us consider a calibrated stereo rig composed of two pinhole cameras, where distortion caused by lens imperfections has been already corrected. As most definitions hereafter will be analogous for both cameras, we will define them only for one camera (definitions for the second camera will be obtained replacing the subscript with 2). Let A1 ∈ R3×3 be the intrinsic matrix of the left camera, with R1 ∈ R3×3 and t1 ∈ R3 its rotation matrix and translation vector, respectively, describing the position of the first Camera Coordinate System (CCS) with respect to World Coordinate System (WCS), as represented in Fig. 1, with a slight abuse of notation for axes names to ease visualisation.
Fig. 1. Representation of coordinate systems. CCSs are in red, dot-dashed. ICSs are in blue. WCS is in grey and the baseline is dashed black.
Call o1 = −R−1 1 · t1 the position of the optical center in the WCS. Hereafter, unless otherwise stated, elements are expressed in WCS. The baseline is the vector b = o2 −o1 going from the first to the second camera center. Additionally, an Image Coordinate System (ICS) is defined on the image plane of each camera, where the left image I1 forms (located at z = 1 in CCS) with the x and y axes parallel to the respective CCS ones and origin in the upper left corner of [A1 ]13 [A1 ]23 the image, corresponding to − [A1 ]11 − [A1 ]22 1 in CCS. The notation [A1 ]ij indicates the (i, j) element of A1 . The ICS system is shown in Fig. 1 as well. Given two corresponding image points p1 ∈ I1 and p2 ∈ I2 , each one in its ICS and expressed as homogeneous coordinates (i.e. p1 ∈ R3 , with unitary 3rd coordinate), the epipolar geometry constraint is defined as: pT2 · F · p1 = 0
(1)
488
P. Lafiosca and M. Ceccaroni
where F is the fundamental matrix, a 3 × 3 matrix of rank 2 that can only be determined up to a scale factor [8,14], assumed as known. F maps a point on I1 to a line on I2 , and vice versa (using its transpose). Given that we work in a projective space, all the points are defined up to an arbitrary scaling factor. On each image all the epipolar lines will intersect in a single point called epipole. Let e1 ∈ I1 and e2 ∈ I2 be the two epipoles in homogeneous ICS, they can be defined as the left and right kernels of F, so that: T F · e1 = 0 0 0 = FT · e2
(2)
Geometrically, e1 is the projection of o2 on the image I1 . Similarly for e2 . It is worth noticing that each pair of corresponding epipolar lines lies on a same plane together with the baseline. 2.2
Rectification
Rectification is the problem of determining two homographies that map corresponding epipolar lines onto parallel horizontal lines sharing the same ycoordinate (i.e. sending the epipoles to ∞). Thus, rectified images must have the new fundamental matrix in the form [8]: ⎡ ⎤ 00 0 F = ⎣0 0 −1⎦ (3) 01 0 Defining the rectified points p1 = H1 · p1 and p2 = H2 · p2 , then, from Eq. (1), we get: · F · H−1 pT2 · F · p1 = pT2 · H−T 2 1 · p1 = pT2 · F · p1 = 0
(4)
where H1 and H2 are the left and right sought-after homographies, respectively. 2.3
Perspective Distortion
Following the original procedure [12], given a generic homography H1 , we scale it by its last element and decompose as: ⎤ ⎡ u1a u1b u1c (5) H1 = ⎣ v1a v1b v1c ⎦ = H1a · H1p w1a w1b 1 where H1a is an affine transformation and H1p is a purely perspective transformation in the form: ⎤ ⎡ 1 0 0 (6) H1p = ⎣ 0 1 0⎦ w1a w1b 1
Rectifying Homographies: Analytical Solution
489
H1p is the sole responsible for introducing perspective distortion. The affine transformation will then be: ⎤ ⎡ u1a − u1c w1a u1b − u1c w1b u1c (7) H1a = ⎣ v1a − v1c w1a v1b − v1c w1b v1c ⎦ 0 0 1 H1a can be decomposed further in a shearing transformation, followed by a similarity transformation [12]. The same decomposition is derived for H2 . Let p1 = [x1 y1 1]T be a generic point on I1 , in homogeneous ICS, then the perspective transformed point is: p 1 = H1p · p1 ⎡
⎤ x1 ⎦ y1 =⎣ w1a x1 + w1b y1 + 1 ⎡ ⎤ x1 = ⎣ y1 ⎦ w1T · p1 ⎡ x ⎤
(8)
1
wT ·p
⎢ 1 1⎥ ∝ ⎣ wTy1·p1 ⎦ 1 1 T where w1T = w1a w1b 1 . The codomain of I1 through H1p must be intended as including the point ∞, as per the hyperplane model defined to handle perspective transformations [6]. Notice that if w1a = w1b = 0 there is no perspective component in the rectifying transformations, then H1 is a purely affine transformation. If this is true also for H2 , then the stereo rig is perfectly frontoparallel and the image pair is already rectified. However this, in general, does not happen. 2.4
Distortion Metric
We refer to the distortion metric introduced by Loop and Zhang [12]. The aim is to select H1 and H2 “as affine as possible”, meaning that the elements w1a , to introduce less distortion. w1b , w2a and w2b should be chosen so n1 p1i as reference, with n1 total numTaking the average point p1c = n11 i=1 ber of pixels of I1 , the amount of distortion on I1 is defined as: n1
w1T · (p1i − p1c ) w1T · p1c i=1
(9)
The goal is to find the global minimum of the sum of the distortion of both images: n1 n2
w1T · (p1i − p1c ) w2T · (p2i − p2c ) + (10) w1T · p1c w2T · p2c i=1 i=1
490
P. Lafiosca and M. Ceccaroni
Then (10) can be rewritten in matrix form as: w1T · P1 · PT1 · w1 wT · P2 · PT2 · w2 + T2 T T w1 · Pc1 · Pc1 · w1 w2 · Pc2 · PTc2 · w2 where:
⎤ ⎡ 2 w −1 0 0 wh ⎣ 0 h2 − 1 0⎦ P1 · PT1 = 12 0 0 0 ⎡
and: Pc1 · PTc1 =
(11)
(12) ⎤
(w−1)2 (w−1)(h−1) (w−1) 4 4 2 ⎢ (w−1)(h−1) (h−1)2 (h−1) ⎥ ⎦ ⎣ 4 4 2 (w−1) (h−1) 1 2 2
(13)
where w and h are pixel width and height of I1 . Similarly applies to I2 .
3
Geometric Interpretation and Analytical Derivation of the Minimum
A geometric interpretation of the family of rectifying homographies is introduced hereafter, which paves the way to find the analytical formulation for the global minimum of Eq. 10 and build the corresponding rectifying homographies that minimise perspective distortion. 3.1
Geometric Interpretation
It is known that finding rectifying transformations can alternatively be seen as determining a new common orientation for a pair of novel virtual cameras projecting the images on the same principal plane, such that epipolar lines become horizontal (Fig. 2). In this subsection we will demonstrate that this can only be achieved if the x ˆ-axis of the new common orientation is chosen parallel to the baseline, while the zˆ-axis can be arbitrarily chosen, thus generating the full rectifying family. The new common orientation will thus be defined as: ⎡ T⎤ x ˆ ˆT ⎦ (14) Rnew = ⎣y ˆ zT b where x ˆ = b (the versor of the baseline) while ˆ z and y ˆ can be chosen accordingly to form a Cartesian coordinate system.
Rectifying Homographies: Analytical Solution
491
Fig. 2. Common orientation of the virtual camera pair (red), projecting on a common plane (gray). x ˆ is parallel to the baseline b (black, dashed). The corresponding epipolar lines (blue, green) and ˆ z are identified by y1 (magenta). Rectified epipolar lines are green and blue, dashed.
Each homography will have to cancel the corresponding intrinsic camera matrix and camera rotation, re-orient the camera with the new chosen orientation and apply another affine transformation, namely: H1 = K1 · Rnew · (A1 · R1 )−1
(15)
H2 = K2 · Rnew · (A2 · R2 )−1 where K1 , K2 ∈ R3×3 are arbitrary affine transformations. The geometric interpretation above is based on the following Theorem.
Theorem 1. The epipolar lines of two cameras are corresponding horizontal lines if and only if the two cameras share the same orientation, with the x ˆ-axis parallel to their baseline. To prove this theorem we first introduce two Lemmas. Lemma 1. The fundamental matrix F, with an abuse of notation, can be written as: ⎡ ⎤ [G]2 [H]3∗ − [G]3 [H]2∗ F ∝ G × H = ⎣[G]3 [H]1∗ − [G]1 [H]3∗ ⎦ (16) [G]1 [H]2∗ − [G]2 [H]1∗ −1
where G = A2 · R2 · b ∈ R3 and H = A2 · R2 · (A1 · R1 ) of a matrix is denoted as [ ]i∗ .
∈ R3×3 . The ith row
492
P. Lafiosca and M. Ceccaroni
Proof (Lemma 1): let be X ∈ R3 a point in WCS and x1 ∈ I1 and x2 ∈ I2 its projections in ICS, then X is a non-trivial solution of the linear system [19]: A1 · (R1 · X + t1 ) = λx1 (17) A2 · (R2 · X + t2 ) = μx2 = A1 · (R1 · X + t1 ), yields: where λ, μ ∈ R. Setting X I3 · X = λx1 − G = μx2 H·X where I3 ∈ R3×3 is the identity matrix. Therefore: I3 0 x1 0 Det =0 H G 0 x2
(18)
(19)
which is linear in both x1 and x2 and can thus be rewritten as: xT2 · F · x1 = 0 with:
ˆ i,j ), [F]ij = (−1)i+j Det(Q
(20)
∀i, j = 1, 2, 3. (21) ˆ i,j ∈ R4×4 is equal to the matrix Q = I3 0 ∈ R6×6 with the ith and where Q HG (3 + j)th rows dropped. Furthermore:
[F]ij = (−1)i+1 Det(S¯i,j )
(22)
with the first column of S¯i,j ∈ R2×2 equal to the j th column of the matrix H with the ith row dropped, and the second column equal to G with the ith element dropped. Then Eq. (22) is equivalent to Eq. (16) . Remark: it can be shown that the matrix H is the matrix representing the G . Therefore axes of the first CCS as seen from the second CCS and that e2 = G the fundamental matrix can be expressed as F ∝ e2 × H, similarly to Eq. (16). Lemma 2. The epipolar lines of the second camera are horizontal if and only if the x-axis of the second camera is parallel to the baseline. Proof (Lemma 2): by the definition of the fundamental matrix (see Sect. 2.1), the epipolar lines of the second camera are horizontal if and only if the first row of the fundamental matrix F is null. Using Lemma 1 the first row of F can be expressed as: (23) [G]2 [H]3∗ − [G]3 [H]2∗ From the remark above, the rows of H form a base of linearly independent vectors, therefore the only linear combination that can set the formula (23) to zero is the trivial one, with [G]2 = [G]3 = 0, that is equivalent to state that the x-axis of the second camera is parallel to the baseline .
Rectifying Homographies: Analytical Solution
493
We can now demonstrate Theorem 1. Proof (Theorem 1): by Lemma 1 and Lemma 2, if the epipolar lines of the second camera are horizontal, the fundamental matrix can be reduced to the form: ⎤ ⎡ 0 (24) F ∝ ⎣ − [H]3∗ ⎦ [H]2∗ Using Eq. (3) we impose: [H]2∗ = 0 1 0 [H]3∗ = 0 0 1
(25)
That is equivalent to require that the y and z axes of the first CCS, as seen from the second CCS system of reference (respectively [H]2∗ and [H]3∗ ), must be parallel to the corresponding axes of the second camera . 3.2
Analytic Derivation of the Minimising Rectification
By Eqs. (14) and (15), finding the global minimum of Eq. (10) is equivalent to ˆ is find the optimal common orientation Rnew of the virtual cameras. Since x imposed by the baseline, the problem is reduced to finding the ˆ z so that distortion is minimised. Setting y ˆ=ˆ z×x ˆ, will then determine Rnew . Choosing the minimising ˆ z, in turn, is equivalent, by construction, to finding the pair of corresponding epipolar lines to become the new horizon lines (see Fig. 2), which will lay at the intersection of the plane y ˆ = 0 and the rectified images. Actually only one horizon line must be determined, as the corresponding one is readily found by means of F. Finally, the minimisation problem is reduced to a single parameter problem, noticing that an epipolar line is uniquely determined by its y-intercept (0, y1 ) ∈ I1 in ICS. Such y-intercept, in WCS, takes the form: T −1 − t1 ) y1 = R−1 1 · (A1 · 0 y1 1
(26)
The ˆ z axes of the virtual cameras will thus be the direction of the line z perpendicular to b, passing through y1 , calculated as the difference between the vector going from o1 to y1 and its projection on the baseline (its derivation is done by means of the outer product ⊗ [13]): ˆ]ˆ x z = (y1 − o1 ) − [(y1 − o1 )T · x = (y1 − o1 ) − x ˆ⊗x ˆ · (y1 − o1 ) Then follows: ˆ z=
z z
(27)
(28)
494
P. Lafiosca and M. Ceccaroni
Using Eqs. (14), (15) and (27), yields: w1T = [H1 ]3∗ = [K1 · Rnew · (A1 · R1 )−1 ]3∗ =ˆ zT · (A1 · R1 )−1 zT · (A1 · R1 )−1 z ⎛ ⎡ ⎤ ⎡ ⎤⎞T 0 0 −1 ⎣ ⎦ −1 ⎣ ⎦⎠ ˆ⊗x ˆ · R−1 = ⎝R−1 1 · A1 · y 1 − x 1 · A1 · y 1 1 1 =
(29)
· (A1 · R1 )−1 = 0 y1 1 · (A1 · R1 )−T · (I3 − x ˆ⊗x ˆ) · (A1 · R1 )−1 rd where K1 has been omitted as it does not affect the 3 row of what follows (being an affine transformation, its last row is 0 0 1 ), and z has been discarded noticing that in Eq. (11) it appears both at numerator and denominator. Rearranging: ⎡ ⎤ 0 ˆ⊗x ˆ) · (A1 · R1 )−1 · ⎣y1 ⎦ w1 = (A1 · R1 )−T · (I3 − x 1 ⎡ ⎤ (30) 0 = L1 · ⎣y1 ⎦ 1
where the matrix: L1 = (A1 · R1 )−T · (I3 − x ˆ⊗x ˆ) · (A1 · R1 )−1 Similarly for w2 :
where:
⎡ ⎤ 0 w2 = L2 · ⎣y1 ⎦ 1
L2 = (A2 · R2 )−T · (I3 − x ˆ⊗x ˆ) · (A1 · R1 )−1
(31)
(32)
(33)
Our goal is to minimise the distortion from Eq. (11), that can now be written as:
T 0 y1 1 · M1 · 0 y1 1 T 0 y1 1 · C1 · 0 y1 1 T 0 y1 1 · M2 · 0 y1 1 + T 0 y1 1 · C2 · 0 y1 1
(34)
Rectifying Homographies: Analytical Solution
where:
495
M1 = LT1 · P1 · PT1 · L1 M2 = LT2 · P2 · PT2 · L2 C1 = LT1 · Pc1 · PTc1 · L1
(35)
C2 = LT2 · Pc2 · PTc2 · L2 The terms of Eq. (34) can be now expanded as polynomial expressions in y1 .
Fig. 3. Possible trend of the distortion in Eq. (34) as a function of y1 . Global minimum is identified by dashed lines.
It can be verified that Eq. (34) takes the form: f1 (y1 ) (g1 (y1 ))
2
+
f2 (y1 ) (g2 (y1 ))
2
(36)
with f1 = f1 (y1 ), f2 = f2 (y1 ) second degree polynomials and g1 = g1 (y1 ), g2 = g2 (y1 ) first degree polynomials. Deriving to find the extreme points yields:
f1 f2 + 2 g12 g2
=
g1 g23 f1 + g13 g2 f2 − 2f1 g23 g1 − 2f2 g13 g2 g13 g23
(37)
496
P. Lafiosca and M. Ceccaroni
where the terms of 5th degree cancel out. Discarding the denominator1 , the extreme points are thus found as the solution of a 4th degree polynomial: ay14 + by13 + cy12 + dy1 + e = 0
(38)
with a, b, c, d, e ∈ R. The solutions of Eq. (38) can be found using any solving formula for homogeneous polynomials of 4th degree. Full calculations are reported in Appendix A. Figure 3 shows a possible behavior of Eq. (34). Among the acceptable real solutions of Eq. (38), the one representing the global minimum depends on the initial position of the cameras, and therefore can only be determined by directly comparing the value of the distortion in Eq. (34) for each solution. It must be remarked that, in the very peculiar (unrealistic) case in which the two cameras are identical (i.e. A1 = A2 , P1 = P2 and Pc1 = Pc2 ) and share the exact same orientation R1 = R2 (not necessarily frontoparallel), then (36) takes the form: f (y1 ) (39) 2 2 (g(y1 )) leading, once derived, to a first degree polynomial, thus to a single minimum. The analytical expression of w1 and w2 minimising perspective distortion is therefore found, so that H1p and H2p are directly determined. In order to find the complete expression of the rectifying homographies H1 and H2 , the respective affine components, as shearing and similarity transformations, can be easily calculated [12].
4
Algorithm Summary
The steps of the direct rectifying algorithm explained above are summarised as follows: 1. The weights w1 and w2 of the perspective components of the rectifying homographies are written as polynomial expression in y1 , the y-intercept of the horizon epipolar line on I1 . 2. Distortion is written as function of y1 and auxiliary matrices are defined to calculate the coefficients of the quartic polynomial. 3. The acceptable solutions (either 2 or 4) are calculated. 4. The global minimum is determined by direct comparison of the distortion values obtained. 5. The minimising w1 and w2 are calculated as last row of the matrices in Eq. (15). 6. The similarity and shearing transformations are calculated [12]. 7. The rectification transformations H1 and H2 are fully determined. The full Python 3 code is made available at https://github.com/decadenza/ DirectStereoRectification. 1
It can be shown that approaching the roots of the denominator of the distortion function both from the upper and lower limit, the function always goes to +∞, therefore the global minimum never reaches −∞, as in Fig. 3.
Rectifying Homographies: Analytical Solution
5
497
Discussion
Rectifying homographies are transformations that dramatically reduce the search space for correspondences between a couple of stereo images, thus making the stereo matching problem computationally affordable. The optimal transformations are traditionally found making use of numerical minimisation (see [12]), which prevents the use of rectification when low computational power is available (e.g. space assets). Furthermore, the application of the traditional method is limited by a convergence issue, due to the matrices decomposition used therein and thus not avoidable, that will be discussed in the reminder of this Section. For both these reasons, most present applications [9,20] prefer to use the arbitrary, non-optimal solution proposed by Fusiello et al. [5], that, for peculiar configurations introduces high distortion levels (Sect. 5.1), thus impairing performance of the following stereo matching algorithm. This paper explicits and demonstrates the formula for the optimal rectifying homograpies. The formula is valid for every pair of stereo images, independently from the configuration parameters. Being an exact formulation, it eliminates the need for minimisation, while still providing the optimal transformation, thus enabling the use of rectification in scenarios with very limited computational capabilities, as for example for autonomous navigation of space satellites. Future work will concentrate on such applications. The reminder of this section discusses a limiting convergence issue of the traditional method proposed by Loop and Zhang [12]. Indeed, the convergence of this algorithm cannot be guaranteed as it depends upon finding a suitable initial guess, derived assuming A and A to be positive-definite, which cannot be guaranteed for all configurations [17]. Consider, for example, the case in which the first CCS is coincident with T the WCS and an identical second camera is placed in 1 a b (with a, b ∈ R), oriented as the first one but rotated around the x-axis of an angle θx . For all cases in which b = a tan θx , the first element of A will be null, thus, as a straightforward consequence of Sylvester’s Theorem, A will not be positivedefinite. Then, Loop and Zhang’s algorithm fails. Moreover, numerical simulations confirm that the configurations for which positive-definite assumption is violated are a numerous and therefore relevant in limiting the applicability of the traditional algorithm. Indeed, considering a setup with fixed intrinsic and randomly generating one million of extrinsic parameters (i.e. relative position and rotation between the two cameras), it was found that Loop and Zhang’s algorithm failed in over the 50% of the cases, in spite of all the configurations being perfectly legitimate. This happened mostly because of numerical errors in computing Cholesky decomposition of A and A . The proposed analytic method, instead, does not need an initial guess and directly provides the optimal solution. To the best of the authors’s knowledge, for the cases mentioned above, all algorithms present in the literature fail in providing the optimal rectifying homography (introducing minimum distortion), found instead by the formula derived in this work and not affected by the configuration settings.
498
P. Lafiosca and M. Ceccaroni
Furthermore, Fig. 4 shows the optimal solution found by our algorithm in case of extremely skewed camera positions configuration, for which Loop and Zhang’s initial guess cannot be retrieved. Finally it must be mentioned that the direct algorithm here presented is capable of providing the optimal solution also in the case of one camera entering the field of view of the other, i.e. when the epipoles lie within image boundaries, which causes convergence issues when employing numerical algorithms.
Fig. 4. Left and right image of a synthetic scene with extremely skewed camera positions (top-left and top-right, respectively). In this case the numerical minimisation of [12] fails. Corresponding rectified image pair using our algorithm is showed (bottom). Horizontal lines are drawn for reference on the rectified image.
In summary, the formula here presented guarantees convergence for every configuration, avoids the need of external libraries and saves computational time as it does not make use of minimisation. Indeed, comparing the computational cost, in our implementation on an Intel i7-7500U CPU 2.70 GHz, the original algorithm by Loop and Zhang requires 8.899 ms on average to calculate the rectifying homographies, while the proposed method takes 7.655 ms. 5.1
Rectification Examples
A synthetic example of rectification is shown in Fig. 5. The original left and right images are first shown. The images, as rectified following our direct algorithm, are then listed, where arbitrary horizontal lines have been drawn as a reference. Here the effects of rectification are clearly visible, as corresponding points are aligned. This example is included in the Python code.
Rectifying Homographies: Analytical Solution
499
The couple of images on the third line of Fig. 5, shows the same image pair rectified following the algorithm in [5]. As expected, while the value of the minimum distortion introduced in the second row is 46252, the homographies found by Fusiello’s algorithm and generating the third pair of images are non optimal and introduce a distortion of 48207, about 4% higher than the minimum.
Fig. 5. Left and right image of a synthetic scene (top-left and top-right, respectively) and corresponding rectified image pair as rectified using the direct algorithm (center) and the algorithm in [5] (bottom). Horizontal lines are drawn for reference on the rectified images.
In Fig. 6 a real scene stereo pair is rectified. Despite the cameras are placed almost frontoparallel, corresponding points still lie on different scan-lines, so rectification is needed. In real applications, calibration [22] is required to accurately fit the stereo rig to the pinhole camera model. Apart from lens distortion correction, the algorithm finds no difference between real-case and synthetic images, since it starts from the same premises (i.e. both camera projection matrices).
500
P. Lafiosca and M. Ceccaroni
Fig. 6. Left and right image of a real scene (top-left and top-right, respectively) and corresponding rectified image pair (bottom). Horizontal lines are drawn for reference on both images, showing effects of rectification. Lens distortion correction is also noticeable.
6
Conclusions
A direct rectification algorithm for calibrated stereo rigs has been proposed. Our method improves the well-known approach by Loop and Zhang to calculate the optimal rectifying homographies. Thanks to an alternative geometrical interpretation of the problem, the proposed algorithm is able to find the formula for the rectifying homographies that introduce the minimal perspective distortion for any camera configurations, even in extremely skewed relative positions, and without depending on minimisation libraries. The Python 3 code has been made publicly available at https://github.com/ decadenza/DirectStereoRectification. Because of the lower computational cost, the value of having an analytic solution may be particularly relevant for hardware-limited applications (e.g. space applications, miniaturised devices), where each change of camera extrinsic or intrinsic parameters would require the computation of new rectifying homographies. Future work might formulate the analytic solution for a distortion metric that includes pixel resampling effects and applications to scenarios with very limited computational capabilities.
Rectifying Homographies: Analytical Solution
501
Appendix A The coefficients of the 4th order polynomial expression in y1 of Eq. (38) are: a = m2 m4 + m6 m8 b = m1 m4 + 3m2 m3 m4 + m5 m8 + 3m6 m7 m8 c = 3m1 m3 m4 + 3m2 m23 m4 + 3m5 m7 m8 + 3m6 m27 m8 d= e= with:
3m1 m23 m4 + m2 m33 m4 m1 m33 m4 + m5 m37 m8
+
3m5 m27 m8
+
(40)
m6 m37 m8
m1 = [M1 ]23 [C1 ]23 − [M1 ]33 [C1 ]22 m2 = [M1 ]22 [C1 ]23 − [M1 ]23 [C1 ]22 m3 =
[C2 ]23 [C2 ]22
m4 =
[C2 ]22 [C1 ]22
m5 = [M2 ]23 [C2 ]23 − [M2 ]33 [C2 ]22
(41)
m6 = [M2 ]22 [C2 ]23 − [M2 ]23 [C2 ]22 m7 =
[C1 ]23 [C1 ]22
m8 =
[C1 ]22 [C2 ]22
The four roots of the equation are given by: 1 −b S ±Q± y1 = −4Q2 − 2p + 4a 2 Q
with: 1 Q= 2
1 2 − p+ 3 3a
q Δ0 + Δ0
(42)
8a2 d − 4abc + b3 8a3 13 s + s2 − 4q 3 Δ0 = 2 S=
8ac − 3b2 8a2 q = 12ae − 3bd + c2
p=
s = 27ad2 − 72ace + 27b2 e − 9bcd + 2c3
(43)
502
P. Lafiosca and M. Ceccaroni
Remark : for the case A1 = A2 , P1 = P2 , Pc1 = Pc2 and R1 = R2 , the solution is given by: m1 y1 = − (44) m2
References 1. Abraham, S., F¨ orstner, W.: Fish-eye-stereo calibration and epipolar rectification. ISPRS J. Photogramm. Remote. Sens. 59(5), 278–288 (2005) 2. Besl, P.J.: Active optical range imaging sensors. In: Sanz, J.L.C. (ed.) Advances in Machine Vision, pp. 1–63. Springer, New York (1989). https://doi.org/10.1007/ 978-1-4612-4532-2 1 3. Chen, Z., Wu, C., Tsui, H.T.: A new image rectification algorithm. Pattern Recogn. Lett. 24(1–3), 251–260 (2003) 4. Deriche, R., Zhang, Z., Luong, Q.-T., Faugeras, O.: Robust recovery of the epipolar geometry for an uncalibrated stereo rig. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 800, pp. 567–576. Springer, Heidelberg (1994). https://doi.org/10.1007/ 3-540-57956-7 64 5. Fusiello, A., Trucco, E., Verri, A.: A compact algorithm for rectification of stereo pairs. Mach. Vis. Appl. 12(1), 16–22 (2000) 6. Gallier, J.: Basics of projective geometry. In: Gallier, J. (ed.) Geometric Methods and Applications, pp. 103–175. Springer, New York (2011). https://doi.org/10. 1007/978-1-4419-9961-0 5 7. Gluckman, J., Nayar, S.K.: Rectifying transformations that minimize resampling effects. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001 (2001) 8. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2003) 9. Hinzmann, T., Sch¨ onberger, J.L., Pollefeys, M., Siegwart, R.: Mapping on the fly: real-time 3d dense reconstruction, digital surface map and incremental orthomosaic generation for unmanned aerial vehicles. In: Hutter, M., Siegwart, R. (eds.) Field and Service Robotics. SPAR, vol. 5, pp. 383–396. Springer, Cham (2018). https:// doi.org/10.1007/978-3-319-67361-5 25 10. Hornberg, A.: Handbook of Machine and Computer Vision: The Guide for Developers and Users. Wiley, Weinheim (2017) 11. Kumar, S., Micheloni, C., Piciarelli, C., Foresti, G.L.: Stereo rectification of uncalibrated and heterogeneous images. Pattern Recogn. Lett. 31(11), 1445–1452 (2010) 12. Loop, C., Zhang, Z.: Computing rectifying homographies for stereo vision. In: 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (1999) 13. Lu, H., Plataniotis, K., Venetsanopoulos, A.N.: Multilinear Subspace Learning: Dimensionality Reduction of Multidimensional Data. Chapman & Hall/CRC (January 2013) 14. Luong, Q.-T., Faugeras, O.D.: The fundamental matrix: theory, algorithms, and stability analysis. Int. J. Comput. Vis. 17(1), 43–75 (1996) 15. Monasse, P., Morel, J.-M., Tang, Z.: Three-step image rectification. In: BMVC 2010-British Machine Vision Conference (2010) 16. Xiao, R., Sun, W., Pang, J., Yan, Q., Ren, J.: DSR: direct self-rectification for uncalibrated dual-lens cameras. In: International Conference on 3D Vision (2018)
Rectifying Homographies: Analytical Solution
503
17. Liansheng, S., Jiulong, Z., Duwu, C.: Image rectification using affine epipolar geometric constraint (2009) 18. Sun, C.: Closed-form stereo image rectification. Association for Computing Machinery (2012) 19. Turrini, C.: Geometria per la ricostruzione tridimensionale (2017) 20. Xia, R., et al.: An accurate and robust method for the measurement of circular holes based on binocular vision. Measure. Sci. Technol. 31(2), 025006 (2019) 21. Yin, X., Wang, X., Yu, J., Zhang, M., Fua, P., Tao, D.: FishEyeRecNet: a multicontext collaborative deep network for fisheye image rectification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 475–490. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6 29 22. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000) 23. Xue, Z., Xue, N., Xia, G.-S., Shen, W.: Learning to calibrate straight lines for fisheye image rectification (April 2019)
Statistical Analysis of Electroencephalographic Signals in the Stimulation of Energy Data Visualizations O. F. Kucukler1(B) , A. Amira1,2 , and H. Malekmohamadi1 1 Institute of Artificial Intelligence, De Montfort University, Leicester, UK
[email protected] 2 Department of Computer Science, University of Sharjah, Sharjah, UAE
Abstract. Increasing luxury living standards coupled with technological developments have made energy efficiency in homes much more important. By protecting the environment and preventing the depletion of energy resources, making energy use conscious has an important role in preserving a livable world for future generations. The brain–computer interface (BCI) has been widely used to improve the quality of life of individuals. There have been numerous research projects on predicting the behavior of energy consumers. However, detecting emotional responses and incorporating personal perceptions of individuals is still a challenge to understanding energy users’ behavior. This paper investigates the best selection method for energy data visualization using BCI systems for improving energy users’ experience. An experimental study has been conducted to acquire electroencephalography (EEG) signals of energy users against the stimuli of energy data visualizations to detect emotions and understand the users’ perceptions. A self-assessment manikin (SAM) is used to rate the arousal and valence scales required for emotion classification. Sample entropy (SampEn) and approximate entropy (ApEn) are utilized to analyze EEG data. One-way ANOVA and Tukey’s honestly significant difference test is applied to the entropy values of EEG signals showing some promising results from the conducted statistical analysis. Keywords: Brain–computer interfaces · Data visualization · Electroencephalography · Entropy · Energy · Emotion recognition
1 Introduction Domestic energy consumption data has been accessible by users with the developments in technology in the past few decades [1–3]. By promoting smart meters and smart home management systems, it is intended to increase awareness of how much energy is used in households by helping people estimate their consumption. In turn, the acquired experience can potentially assist people to decrease their energy consumption. The widespread adoption of such a reduction in energy consumption can have a significant impact on reducing climate change problems. In this regard, it is very important to develop the energy user’s behavior. As their knowledge about energy consumption increases, energy users will be able to save energy. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 504–519, 2022. https://doi.org/10.1007/978-3-031-10464-0_34
Statistical Analysis of Electroencephalographic Signals
505
How exactly can smart meters and home energy management systems help homeowners better comprehend the ways they consume energy? How much information does conventional energy billing provide? If the consumption of separate appliances is not known, how can people probably intend to stop overusing them? The purpose behind these questions is to emphasize the importance of receiving more detailed feedback about users’ energy consumption. There have been many advances in technology for informing users about their daily energy usage. It is now possible for users to simply access data on how much energy is being utilized in their homes. Data visualization is a tool used to facilitate communication between people and data. When improved data visualizations, such contribution to data visualization can lead to a significant improvement of users’ experience. Moreover, it has recently become more common to interact with novel data visualizations for the general public. Popular websites, newspapers and journals are actively utilizing visualizations to deliver news and assessments with data. Furthermore, line charts, bar charts, pie charts etc. are basic visualizations that were shown to people in the past [4]. An important consideration is how users’ experience can be improved. To design effective visualizations, it is crucial to understand human perception [5] and evaluating how well users can comprehend visualizations is a crucial factor when communicating relevant data [6]. The literature deduced that there are many factors that influence how end users behave when it comes to energy savings. Global warming consciousness, environmental behavior, social interaction, age, and income play a significant role in energy-saving behavior [38]. In [38], it is emphasized that education is considered one of the key actions that encourage energy-saving behavior by raising awareness and changing behavior. Using different data visualizations on energy consumption [21], a model of visualization sensemaking [4], and a mobile application as a recommendation system [37] are examples of techniques and models for changing energy-saving habits. In spite of the effectiveness of the above-mentioned models, comparative research into users’ emotional responses and perceptions of different data visualizations is relatively scarce. In [20], an experimental study is performed to examine participants’ understanding of data visualizations and [21] proposes a game to reduce energy consumption competitively using two different visualizations. These two approaches, despite offering useful recommendations, fail to understand and modify energy users’ behavior. Therefore, it is critical to combine various data visualizations, emotions, and perceptions of participants in an efficient and smart system. In this context, a significant term arises, which is the brain–computer interface (BCI) system. As the impact of machines on our lives continues to increase in recent years, BCI is well suited to analyzing this interaction. [7]. As a conventional approach, BCI has been mainly used for control applications: cursors, paralyzed bodies, robotic arms, phone dialling etc. Among other things, a BCI can detect emotions in a person when he or she is exposed to a visual, audio-visual, or audio stimulus. An average BCI system typically starts with the acquisition of a signal emanating from the cerebrum and is measured via a subdural electrode implanted, or with the non-invasive placement of electrodes on the scalp [8]. The electroencephalography (EEG) technique records the electrical activity generated by the brain through electrodes placed on the scalp.
506
O. F. Kucukler et al.
The purpose of this paper is to evaluate the BCI system responses to six different data visualizations of domestic energy consumption to improve users’ experience. The data visualizations techniques are (1) an area chart, (2) a pie chart, (3) a bubble chart, (4) a scatter plot, (5) a density heatmap, (6) a 2D histogram contour. A statistical analysis method combining the analysis of SampEn and ApEn values of EEG signals is proposed to monitor changes in the participants’ emotional responses. The main contributions of this study can be summarized as follows: (1) development of an automated system to select the best data visualization for individuals; (2) the analysis of SampEn and ApEn features of EEG signals to determine whether participants’ emotional reactions can be detected; (3) achievement of under 0.05 significance in the ANOVA test for all participants; and (4) presentation of a new approach for the analysis of EEG in the stimulation of Energy data visualizations. The layout of the paper is as follows. In Sect. 2, the literature is presented for emotion recognition and data visualization. Section 3 delves into the proposed system for EEG data acquisition. Data analysis is made in Sect. 4. The discussion and conclusion of this study follow in Sects. 5 and 6, respectively.
2 Related Work In light of the importance and interest in the field of emotion recognition, there have been various proposed approaches in the domain of emotion recognition from EEG and physiological signals or other modalities. In [9], a review of computational techniques for emotion recognition from EEG signals is presented for discussing pre-processing, feature extraction and classification methods. Furthermore, in [10] the basics of emotion recognition methods are presented with definitions and steps. In [11], in a more recent review, in addition to discussing the recent developments in the emotion recognition area, a virtual reality-based emotion stimulus was proposed for future research opportunities. The DREAMER database was created by [12] with EEG and electrocardiogram (ECG) signals collected from a consumer-grade EEG device during emotional elicitation by audio-visual stimuli. Their experimental evaluation showed that the use of EEG and ECG based features and their fusion achieved higher classification accuracy than random voting of modalities. DEAP database for analysis of emotions was proposed by [13]. EEG and peripheral physiological signals were acquired from 32 participants as they watched 40 video clips with different emotional stimuli. In [14], a multi-channel EEG-based emotion recognition approach was utilized and a 2D frame was proposed instead of using traditional feature extraction methods for classification with a deep neural network (DNN) classifier. The classification accuracy for the DEAP database was higher than the DREAMER database. An attention-based convolutional recurrent neural network (ACRNN) approach was proposed to utilize more particular features for improving classification accuracy [15]. A channel-wise attention method that extracts attention information from channels and a self-attention method that extracts attention information from each sample were fused to the ACRNN approach. Their classification results showed that the performance of EEG classification in DREAMER and DEAP databases improved, while higher accuracy was gained from the DREAMER dataset.
Statistical Analysis of Electroencephalographic Signals
507
A multimodal database MAHNOB-HCI was created by [16]. Aside from recordings of face videos and audio signals, eye gaze data and physiological signals of the central nervous system were also acquired. In [17], a new classifier the broad learning system (BLS) was presented to improve the performance of emotion classification and a new approach was proposed as subject-independent emotion classification for four different emotions. Gray-scale image (GSI) features were extracted using continuous wavelet transform (CWT). Their approach showed higher accuracy rates and fast computation time with 0.7 s for the DEAP database and 0.6 s for the MAHNOB-HCI database. On the other hand, a locally robust feature selection (LRFS) was presented to elect features for individual-independent emotions [18]. It is obtained by comparing the probability densities of extracted features from each subject. As a result, more similar features were utilised for emotion classification. In [19], a shared-subspace feature elimination (SSFE) method was presented to address an inter-individual emotion recognition for multichannel EEG data. By analysing similarities of EEG signals for multiple subjects, features that convey similar characteristics were selected for emotion classification. The most important part of this study is to select the most appropriate features that can describe more accurately nonlinear EEG signals. In [30], approximate entropy (ApEn) is defined as an advanced technique for statistical analysis. It is proposed that ApEn can analyze complicated systems that include more than 1000 data points. In contrast, sample entropy (SampEn) is introduced by [31] to improve the relative inconsistency results of ApEn. It is suggested that SampEn is more consistent and independent of the length of data. In [32], the EEG signals are analyzed to detect driver fatigue using sample entropy. An entropy-based method suits comparatively better than other feature extraction methods for EEG signals [32]. In [33], entropy features are effectively employed to elicit emotions from EEG signals. Stimulation of emotions can be achieved through a variety of methods, including audio, visual, and audio-visual stimuli. It is necessary to present information perfectly to evoke emotions at a high level. In [20], it was presented that to reduce energy consumption, it is important to optimize people’s behaviours. Therefore, people need to know which devices consume the most energy in their daily lives. A study was conducted to see how real-life responses to visualizations of energy data are from households. Participants were shown three different types of energy data visualizations, including aggregated line graphs, disaggregated line graphs, and area-based visualizations. The results of the experiment showed that area-based visualization provides a better understanding for participants. According to [21], despite consumers’ difficulties engaging with energy data, it is crucial to develop relationships between users and energy data. Two different visualization techniques were used to construct this relationship, namely an engineering-type visualization of a bar chart, and an ambient-type visualization of a cartoon tree. Moreover, the experiment was performed with the assistance of an energy-saving game. For a three-day experiment, the ambient-type visualization of a cartoon tree reduced energy consumption by 82 Wh per day. In [4], it was presented that the study aims to observe what feelings novice people have when they see data visualizations for the first time. Therefore, an experiment was conducted by 13 participants, and audio and video recordings, as well as an interview,
508
O. F. Kucukler et al.
were made during the experiment. Analysis of data was performed with the NOVIS model presented by the author. Five cognitive activities were presented in the model: encountering visualization, constructing a frame, exploring visualization, questioning the frame, and floundering on visualization. Furthermore, there are three graphic types in the experiment: the parallel-coordinates plot, the treemap and the chord diagram. During the experiment, verbal reactions were recorded for analysis. Through this approach, it is possible to understand novice users’ cognitive activities against visualizations. In [36], a mobile application called Exploiting Micro-Moments and Mobile Recommendation Systems (EM3) is proposed to solve problems related to the overuse of energy by interacting with a real-time energy usage platform. In the application, energy data visualizations and micro-moments are presented to energy users to improve their behaviour regarding consumption. Participants in [22] take part in a study that is used to collect responses to visualizations of energy consumption data. To analyze which data visualization is best suited to consumers, participants are surveyed about the novel and conventional data visualizations. Study results indicate that there is no superiority between different groups of data visualizations, and the next step is to collect participants’ feedback on the data visualizations.
3 Proposed System A group of five healthy individuals participated in this user study. None of the participants reported any health problems. EEG is recorded using an EMOTIV EPOC Flex system with 32 sintered Ag/AgCl electrodes placed on a cap according to the international 10– 20 location system with a sampling rate of 128 Hz. The EMOTIV headset offers the ability to place CMS and DRL reference electrodes on the ears or cap together with 32 electrodes. This type of cap is commonly used in the field of neuroscience. Figure 1 shows the EMOTIV EPOC Flex Gel Kit headset.
Fig. 1. EMOTIV EPOC Flex Gel Kit headset [35]
Statistical Analysis of Electroencephalographic Signals
509
Participants are invited to join the study after preparations have been made. The experiments are conducted in an enclosed laboratory with controlled lighting to avoid exposure to environmental influences. Before the experiment begins, the concept and the details of the experiment are explained to the participant. Once it becomes clear that the participant fully understands the concept and details, the experiment begins. Participants wear the cap and electrodes are connected with conductive gel. The skin under the electrode gaps needs to be cleaned with isopropyl alcohol to ensure the best quality EEG signal and to reduce impedance. The EMOTIV software helps to ensure that electrodes are connected properly by indicating connections with green colors. Once the signal quality and the connection quality are at 100%, the device is ready for the experiment. Each trial lasts two minutes and the experiment lasts 12 min in total. After the experiment, participants are advised to wash their heads to remove the gels. According to the instructions from EMOTIV, electrodes and caps are cleaned with soap and water before use in the next experiment. Figure 2 shows a participant before the experiment. Participants use a computer to interact with the stimuli.
Fig. 2. A participant before the experiment with the headset.
In the experiment, participants interact with the energy data visualizations and they are asked to rate graphs on a scale from 1 (low) to 9 (high) for both valence and arousal separately based on the emotions evoked by each using SAMs [25] and answer 3 questions about them. Participant ratings were a valence-arousal scale that was widely used in the literature presented by [24]. The valence corresponds to pleasantness that ranges from negative to positive, and the arousal ranges from calm to excited. Participants in the experiment are instructed to minimize movement, especially eye blinking, jaw movements, and other facial movements. These movements cause noise and artefacts in EEG signals, which complicates the analysis of EEG data. Figure 3 shows the first interaction of participants with data visualization. Participants’ EEG data is saved in the EMOTIV software data storage, and their ratings are recorded in an Excel document for further analysis.
510
O. F. Kucukler et al.
Fig. 3. A participant examining the graph (an area chart) in the experiment
Stimuli are created using the Psychopy [34] application which is widely used in the field of neuroscience for EEG signal stimulation. The stimuli application is accessible by a website link with which the Psychopy is linked. The participants join the experiment via that link provided before the experiment. In [23], psychologists recommend using 1 to 10 min of videos for emotional excerpts thus, the 40 s per visualization for each participant is considered acceptable in our study. The trial consists of a 40-s stimulus in which energy data visualization and is followed by displaying a 20-s neutral image that is a blank page to neutralize participants’ emotions. Following that a self-assessment rating and a small questionnaire are answered by the participants. At the end of each period, there is an eyes-closed resting period for 20 s. A range of energy data visualizations is utilized in this study. Traditional charts are basic charts widely used in different fields (e.g. line, bar) to illustrate the information. In the study [22], the authors applied a publicly available energy dataset [26]. These data are real-time measurements of household energy over a given period. We used this dataset to create our energy data visualizations. A day of power consumption data is taken from the energy dataset and 6 different visualizations are drawn for 12 h period in a day. Figure 4 shows a bubble chart, a density heatmap, a scatter plot, a 2D histogram contour, an area chart, and a pie chart that are created from an energy data set to denote daily energy consumption. The graphs were created based on the same 12 h of data in a day.
Statistical Analysis of Electroencephalographic Signals
511
Fig. 4. Six different energy data visualizations used in the experiment, (a) a bubble chart, (b) a density heatmap, (c) a scatter plot, (d) a 2D histogram contour, (e) an area chart, (f) a pie chart.
4 Data Analysis EEG data collected during the experiment is already preprocessed at the source and sampled at 128 Hz. Recordings of EEG signals are filtered with a second-order Butterworth filter with a 3dB point at 0.5 Hz and 128 Hz. There were 32 channels of recordings obtained from each of the 5 participants, which totalled 12 min each. According to [27], left frontal EEG signals are associated with feelings such as anger and happiness, while right frontal activity is associated with feelings such as sadness, anxiety, and fear. Thus, frontal channels like Fp 1, Fp 2, F3 and F4 are used to analyze reactions relating to the approach. A total of six different data visualizations were used as stimuli through 12 min of EEG signals. Each stimulus is composed of 40 s EEG recordings with 5054 samples on average. The EEG data corresponding to a stimulus was split into 1-s epochs. Thus,
512
O. F. Kucukler et al.
40 s of EEG data from 4 channels creates 160-s epochs and 960 epochs are generated for a participant with 6 different data visualizations. We computed SampEn and ApEn for each 1-s epoch. To analyze the results, we collected each stimulus’ results separately for each subject. We then calculated the mean and standard deviation of the entropies of six different stimuli. The results of two different subjects are shown in Table 1. Table 1. Mean and standard deviation of sample entropy and approximate entropy values for 960 epochs in two different participants Stimuli
Subject 1
Subject 4
SampEn
ApEn
SampEn
ApEn
Mean
Std. dev
Mean
std. dev
Mean
Std. dev
Mean
Std. dev
Data vis.1
1.29
0.30
0.76
0.06
1.22
0.36
0.74
0.08
Data vis.2
1.16
0.32
0.75
0.06
1.05
0.52
0.65
0.19
Data vis.3
1.06
0.29
0.73
0.07
1.15
0.46
0.70
0.16
Data vis.4
0.96
0.32
0.69
0.09
1.15
0.42
0.71
0.13
Data vis.5
0.95
0.33
0.68
0.13
1.32
0.40
0.73
0.12
Data vis.6
0.95
0.29
0.69
0.09
1.36
0.50
0.71
0.13
In Table 1, it can be seen that the means of SampEn among stimuli are slightly different, with a high standard deviation. The means obtained from ApEn are remarkably similar and show low standard deviations. On the other hand, SampEn has a higher mean than ApEn. We applied a one-way ANOVA test on the signal entropy values across six stimuli to determine if there are statistically significant differences between the means. ANOVA uses the F test to calculate its results. In [28], the F test is defined as the ratio of between states and within states variance. F=
Between states variance Within states variance
(1)
The statistical analysis results conclude that both the SampEn and ApEn mean are statistically significant at the significance level of 0.000 for Subject 1 and Subject 4. Subject 2, Subject 3 and Subject 5 showed significance at 0.02, 0.01 and 0.02, respectively. Table 2 illustrates the calculated degree of freedom, F scores and p-values for six stimuli in Subject 1 and Subject 4. There is a significant difference between SampEn and ApEn when the F scores are compared. The ANOVA shows that the mean entropy values are significantly different. Meanwhile, it is possible to identify the source of significant differences with Tukey’s test for honestly significant differences [29]. Tukey’s HSD method examines all binary differences between stimuli while controlling for type I errors. To detect these differences, we applied Tukey’s HSD method to the data visualizations. Table 3 and 4 shows Tukey’s HSD test results for SampEn and ApEn, respectively.
Statistical Analysis of Electroencephalographic Signals
513
Table 2. ANOVA analysis of SampEn and ApEn for participants’ EEG Participant
Entropy type
Variance type
Subject 1
SampEn
Between states
ApEn
Between states
Within states Within states Subject 4
SampEn
5
F
p-value
3.121
32.492
0.000
23.422
0.000
10.930
0.000
7.056
0.000
0.096
5
0.180
954
0.008
5
2.174
954
0.199
5
0.137
954
0.019
Between states Within states
Mean square
954
Between states Within states
ApEn
df
Table 3. Tukey’s HSD test for SampeEn in five different participants (I) State Data vis. 1
Data vis. 2
(J) State
Subject1
Subject2
Subject3
Subject4
Subject5
p-value
p-value
p-value
p-value
p-value
Data vis. 2
0.004
0.785
0.000
0.008
0.015
Data vis. 3
0.000
0.007
0.494
0.769
0.201
Data vis. 4
0.000
0.603
0.914
0.759
0.000
Data vis. 5
0.000
0.050
0.166
0.294
0.045
Data vis. 6
0.000
0.155
0.069
0.062
0.236
Data vis. 1
0.004
0.785
0.000
0.008
0.015
Data vis. 3
0.052
0.248
0.059
0.274
0.927
Data vis. 4
0.000
1.000
0.006
0.283
0.879
Data vis. 5
0.000
0.630
0.249
0.000
0.999
Data vis. 6
0.000
0.878
0.458
0.000
0.899
Some pairwise p-values for most of the subjects range from 0.000 to 0.01, which means at a 0.01 level. Among all subjects, at least one pair of data visualizations is different. As compared to standard errors and p-values, Tukey’s HSD test produces better results with SampEn than ApEn. Figure 5(a) shows the measured EEG data of subject 2 (right) for an area chart (left) and Fig. 5(b) shows the measured EEG data of subject 2 (right) for a 2D histogram contour (left). It is possible to compare subjective responses with analysis results using SAM ratings. By way of example, Tukey’s HSD test results have found that there is no significant difference between the first and second data visualizations for Subject 2, while SAM ratings suggest that Subject 2 prefers the first data visualization rather than the second.
514
O. F. Kucukler et al.
Fig. 5. Subject 2’s EEG recordings while interacting with (a) an area chart and (b) a 2D histogram contour Table 4. Tukey’s HSD test for ApEn in five different participants (I) State Data vis. 1
Data vis. 2
(J) State
Subject1
Subject2
Subject3
Subject4
Subject5
p-value
p-value
p-value
p-value
p-value
Data vis. 2
0.938
0.792
0.008
0.000
0.770
Data vis. 3
0.008
0.017
0.009
0.240
0.802
Data vis. 4
0.000
0.735
0.273
0.411
0.004
Data vis. 5
0.000
0.052
0.029
0.995
0.753
Data vis. 6
0.000
0.196
0.000
0.471
0.984
Data vis. 1
0.938
0.792
0.008
0.000
0.770
Data vis. 3
0.123
0.387
1.000
0.019
1.000
Data vis. 4
0.000
1.000
0.783
0.007
0.203
Data vis. 5
0.000
0.630
0.999
0.000
1.000
Data vis. 6
0.000
0.914
0.979
0.005
0.988
Statistical Analysis of Electroencephalographic Signals
515
Participants answer the self-assessment ratings after interacting with the data visualizations. A list of responses to the data visualizations is shown in Table 5. Participants’ responses to data visualizations differ depending on their emotions and understanding of the information. Table 5. SAM responses of five participants for six data visualizations Stimuli Subject1
Subject2
Subject3
Subject4
Subject5
Valence Arousal Valence Arousal Valence Arousal Valence Arousal Valence Arousal Data vis. 1
1.39
2.90
7.17
6.16
8.17
7.59
4.13
5.04
5.01
4.02
Data vis. 2
2.00
2.15
3.01
1.99
1.65
6.33
6.02
6.01
5.99
5.99
Data vis. 3
2.97
2.98
6.05
5.19
7.28
7.24
2.97
1.99
2.94
2.99
Data vis. 4
3.99
3.41
8.49
8.25
6.72
4.62
3.95
4.04
5.01
5.97
Data vis. 5
1.23
1.27
8.41
8.51
3.85
3.76
3.01
3
5.99
6.99
Data vis. 6
4.01
2.98
7.14
7.19
7.44
7.24
4.05
4.99
3
3
Based on the self-ratings provided in Table 5, Subject 1 consistently rates all data visualizations very closely. On the other hand, using SampEn and ApEn, ANOVA test results indicate that the response of Subject 1 to the six stimuli differs. Furthermore, Subject 4 rated the second data visualization as more attractive and emotional, and entropy analysis supports that statement with a p-value less than 0.05 for the second stimulus. Figure 6 shows mean sample entropies of six data visualizations for subject 2. While the first stimulus has a higher entropy mean, the third stimulus has the lowest. By observing the SAM ratings, it can be concluded that subject 2 feels the fourth and fifth data visualizations are more appealing. In conclusion, both the SAM ratings and entropy analysis provide support and divergence to each other. It is imperative to expand the dataset and apply AI techniques to the data to generate more reliable results. The analysis results demonstrate that the proposed approach can be utilized to select the best energy data visualization for individuals. Moreover, the results of this study demonstrate that EEG data analysis can be used as a way to analyze the reactions of energy users to different data visualizations, and the ANOVA analysis proved that this method effectively detects differences. These findings establish a novel and promising method to improve the energy users’ experience.
516
O. F. Kucukler et al.
Mean of entropy values 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 Stimuli 1
Stimuli 2
Stimuli 3
Stimuli 4
Stimuli 5
Stimuli 6
Mean of entropy values
Fig. 6. The mean SampEn values of subject 2 for six data visualizations
5 Discussion ApEn can categorize datasets with at least 1000 values. SampEn is an improved version of ApEn. SampEn can be used with an independent dataset length and maintain consistency, which is the difference between SampEn and ApEn. In this study, there are less than 1000 data points. SampEn results show better performance in comparison with ApEn for all subjects. The SampEn results for subject 1 show that six different data visualizations are separated from each other. It is deduced from these results that individual reactions to data visualizations can be extracted from EEG data. According to the comparison with participants’ self-assessment ratings, there is a difference between participants’ perceptions of data visualization. Some participants are less interested in all data visualizations than others. This is supported by their self-ratings and their background information. In each stimulus, the same energy data is used to just evaluate individual perspectives for different graphs. The way in which subjects view graphs affects their understanding of the information. Although all graphs contain the same data, the answers vary due to the emotional differences and backgrounds of each subject. There is a correlation between self-assessment ratings and statistical analysis of EEGs. Among the conclusions from the obtained results is the fact that discriminating between graphs is not perfect. The data visualizations need to be improved for more precise results. A way to do that may be to use tips with data visualizations that may get individuals’ attention and understanding. Furthermore, motivation plays a significant role in the results of the experiment. Participants may rate their emotions low or high on some graphs due to their low motivation and concentration during the experiment. In addition, all participants raised concerns about sleep after the experiment. They all tried not to sleep during the experiment. That is another important point that affects the results worth mentioning.
Statistical Analysis of Electroencephalographic Signals
517
To explore the separation of data visualizations from EEG signals using descriptive and inferential statistical analysis, we compared its performance with SampEn and ApEn. It is not enough to conclude that SampEn and ApEn are perfect for discriminating differences, despite the excellent results on participants’ separation reaction levels. It is, however, good proof for investigating potential features. Further research is needed to improve the utility of the study and the EEG classification. The next step is to increase the number of participants, the data visualization and features for EEG classification with different classifiers.
6 Conclusion Emotional responses of individuals to stimuli can be detected by the analysis of EEG signals. Therefore, in this study, we developed an Energy data visualization selection system using the SampEn and ApEn features for statistical analysis of 4 channels and 6 data visualizations during a stimulation based on EEG signals. Experimental results illustrate that significant differences at under 0.05 with ANOVA test are achieved using SampEn and ApEn features. Tukey’s HSD test is applied to show where these differences come from and it is obtained that some data visualizations for all subjects are different. It is hence concluded that entropy features can effectively distinguish emotional reactions to Energy data visualizations from EEG signals. Future work includes increasing the number of visualizations and participants, adding more entropy features along with other common EEG data analysis features, and implementing an algorithm to consolidate the method with the expanded dataset. The proposed approach may provide us with a new and promising method for selecting Energy data visualizations for individuals. Acknowledgment. We would like to thank The Republic of Turkey, Ministry of National Education for supporting this work through their scholarship scheme.
References 1. Foster, D., Lawson, S., Blythe, M., Cairns, P.: Wattsup? motivating reductions in domestic energy consumption using social networks. In: Proceedings of the 6th Nordic Conference on Human-Computer Interaction: Extending Boundaries, pp. 178–187. New York, NY, USA (2010) 2. Froehlich, J., Findlater, L., Landay, J.: The design of eco-feedback technology. In: Proceedings of the 28th International Conference on Human Factors in Computing Systems, p. 1999, CHI ’10. Atlanta, Georgia, USA (2010) 3. Strengers, Y.A.A.: Designing eco-feedback systems for everyday life. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2135–2144. Association for Computing Machinery, New York, NY, USA (2011) 4. Lee, S., Kim, S.-H., Hung, Y.-H., Lam, H., Kang, Y.-A., Yi, J.S.: How do people make sense of unfamiliar visualizations?: A grounded model of novice’s information visualization sensemaking. IEEE Trans. Visual Comput. Graphics 22(1), 499–508 (2016) 5. Heer, J. Bostock, M.: Crowdsourcing graphical perception: using mechanical turk to assess visualization design. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 203–212. Association for Computing Machinery, New York, NY, USA (2010)
518
O. F. Kucukler et al.
6. Boy, J., Rensink, R.A., Bertini, E., Fekete, J.-D.: A Principled way of assessing visualization literacy. IEEE Trans. Visual Comput. Graphics 20(12), 1963–1972 (2014) 7. Seal, A., et al.: An EEG database and its initial benchmark emotion classification performance. Comput. Math. Methods Med. 2020 (2020) 8. Petroff, O.A., Spencer, D.D., Goncharova, I.I., Zaveri, H.P.: A comparison of the power spectral density of scalp EEG and subjacent electrocorticograms. Clin Neurophysiol 127(2), 1108–1112 (2016) 9. Kim, M.-K., Kim, M., Oh, E., Kim, S.-P.: A review on the computational methods for emotional state estimation from the human EEG. Comput. Math. Methods Med. 2013 (2013) 10. Zangeneh Soroush, M., Maghooli, K., Setarehdan, S.K., Motie Nasrabadi, A.: A review on EEG signals based emotion recognition. Int. Clin. Neurosci. J. 4(4), 118–129 (2017) 11. Suhaimi, N.S., Mountstephens, J., Teo, J.: EEG-based emotion recognition: a state-of-the-art review of current trends and opportunities. Comput. Intell. Neurosci. 2020 (2020) 12. Katsigiannis, S., Ramzan, N.: DREAMER: a database for emotion recognition through EEG and ECG signals from wireless low-cost off-the-shelf devices. IEEE J. Biomed. Health Inform. 22(1), 98–107 (2018) 13. Koelstra, S., et al.: DEAP: a database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 3(1), 18–31 (2012) 14. Cheng, J., et al.: Emotion recognition from multi-channel EEG via deep forest. IEEE J. Biomed. Health Inform. 25(2), 453–464 (2021) 15. Tao, W., et al.: EEG-based emotion recognition via channel-wise attention and self attention. IEEE Trans. Affect. Comput. 1 (2020) 16. Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3(1), 42–55 (2012) 17. Issa, S., Peng, Q., You, X.: Emotion classification using EEG brain signals and the broad learning system. IEEE Trans. Syst. Man Cybern. Part A Syst. 51, 7382–7391 (2020) 18. Yin, Z., Liu, L., Chen, J., Zhao, B., Wang, Y.: Locally robust EEG feature selection for individual-independent emotion recognition. Expert Syst. Appl. 162, 113768 (2020) 19. Zhang, W.: Selecting transferrable neurophysiological features for inter-individual emotion recognition via a shared-subspace feature elimination approach. Comput. Biol. Med. 123, 103875 (2020) 20. Herrmann, M.R., Brumby, D.P., Cheng, L., Gilbert, X.M.P., Oreszczyn, T.: An empirical investigation of domestic energy data visualizations. Int. J. Hum Comput Stud. 152, 102660 (2021) 21. Spangher, L., Tawade, A., Devonport, A., Spanos, C.: Engineering vs. ambient type visualizations: quantifying effects of different data visualizations on energy consumption. In: Proceedings of the 1st ACM International Workshop on Urban Building Energy Sensing, Controls, pp. 14–22. Big Data Analysis, and Visualization, New York, NY, USA (2019) 22. Al-Kababji, A., et al.: Energy data visualizations on smartphones for triggering behavioral change: novel vs. conventional. In: 2020 2nd Global Power, Energy and Communication Conference (GPECOM), pp. 312–317 (2020) 23. Soleymani, M., Asghari-Esfeden, S., Fu, Y., Pantic, M.: Analysis of EEG signals and facial expressions for continuous emotion detection. IEEE Trans. Affect. Comput. 7(1), 17–28 (2016) 24. Russell, J.A., Mehrabian, A.: Evidence for a three-factor theory of emotions. J. Res. Pers. 11(3), 273–294 (1977) 25. Morris, J.D.: Observations: SAM: the self-assessment manikin: an efficient cross-cultural measurement of emotional response. J. Advert. Res. 35(6), 63–68 (1995) 26. UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/individual+househ old+electric+power+consumption. Accessed 18 Nov. 2021
Statistical Analysis of Electroencephalographic Signals
519
27. Dennis, T.A., Solomon, B.: Frontal EEG and emotion regulation: electrocortical activity in response to emotional film clips is associated with reduced mood induction and attention interference effects. Biol Psychol 85(3), 456–464 (2010) 28. F Test—an overview | ScienceDirect Topics: https://www.sciencedirect.com/topics/mathem atics/f-test, last accessed 2021/11/18 29. Abdi, H., Williams, L.J.: Tukey’s honestly significant difference (HSD) test. Encycl. Res. Des. 3(1), 1–5 (2010) 30. Pincus, S.M.: Approximate entropy as a measure of system complexity. PNAS 88(6), 2297– 2301 (1991) 31. Richman, J.S., Moorman, J.R.: Physiological time-series analysis using approximate entropy and sample entropy. Am. J. Physiol. Heart Circ. Physiol. 278(6), H2039–H2049 (2000) 32. Zhang, T., Chen, J., He, E., Wang, H.: Sample-entropy-based method for real driving fatigue detection with multichannel electroencephalogram. Appl. Sci. 11(21), Art. no. 21 (2021) 33. Patel, P., Raghunandan, R., Annavarapu, R.N.: EEG-based human emotion recognition using entropy as a feature extraction measure. Brain Inf. 8(20), 1–13 (2021) 34. Peirce, J.W., Gray, J.R., Simpson, S., MacAskill, M.R., Höchenberger, R., Sogo, H., Kastman, E., Lindeløv, J.: PsychoPy2: experiments in behavior made easy. Behav. Res. Method. 51, 195–203 (2019) 35. EMOTIV EPOC Flex Gel Kit: https://emotiv.gitbook.io/epoc-flex-user-manual/. Accessed 18 Nov. 2021 36. Alsalemi, A., Bensaali, F., Amira, A., Fetais, N., Sardianos, C., Varlamis, I.: Smart energy usage and visualization based on micro-moments. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1038, pp. 557–566. Springer, Cham (2020). https://doi.org/10. 1007/978-3-030-29513-4_41 37. Alsalemi, A., et al.: Achieving domestic energy efficiency using micro-moments and intelligent recommendations. IEEE Access 8, 15047–15055 (2020) 38. Hori, S., Kondo, K., Nogata, D., Ben, H.: The determinants of household energy-saving behavior: Survey and comparison in five major Asian cities. Energy Policy 52, 354–362 (2013)
GCANet: A Cross-Modal Pedestrian Detection Method Based on Gaussian Cross Attention Network Peiran Peng1 , Feng Mu1 , Peilin Yan1 , Liqiang Song2 , Hui Li2 , Yu Chen3 , Jianan Li1 , and Tingfa Xu1(B) 1
2
Beijing Institute of Technology, Beijing, China ciom [email protected] National Astronomical Observatories, CAS, Beijing, China 3 Science and Technology on Near-Surface Detection Laboratory, Wuxi 214035, China
Abstract. Pedestrian detection is a critical but challenging research field widely applicable in self-driving, surveillance and robotics. The performance of pedestrian detection is not ideal under the limitation of imaging conditions, especially at night or occlusion. To overcome these obstacles, we propose a cross-modal pedestrian detection network based on Gaussian Cross Attention (GCANet) improving the detection performance by a full use of multi-modal features. Through the bidirectional coupling of local features of different modals, the feature interaction and fusion between different modals are realized, and the salient features between multi-modal are effectively emphasized, thus improving the detection accuracy. Experimental results demonstrate GCANet achieves the highest accuracy with the state-of-the-art on KAIST multimodal pedestrian dataset. Keywords: Pedestrian detection Cross Attention
1
· Multi-modal fusion · Gaussian
Introduction
Pedestrian detection is a critical research field widely applicable in self-driving, surveillance and robotics. In recent years, relying on the development of detection algorithms, the guarantee of safety has improved. Pedestrian detection task has achieved an excellent goal in detecting visible images and videos [1,6]. However, there are still some challenges in visible images like low resolution, occlusion, power contrast and bad light conditions that limited accuracy for detecting pedestrian. Therefore, the accuracy of pedestrian detection can be effectively improved by using infrared or other thermal images in low illumination conditions such as night time [16]. Thermal images eliminate the limitations of visible images in poor light, bad weather and other conditions. Thermal cameras detect objects based on infrared radiation, while the pedestrians can be c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 520–530, 2022. https://doi.org/10.1007/978-3-031-10464-0_35
GCANet
521
identified with ease because of the significant thermal difference between people and the surrounding environment. However, during the daytime, it is difficult to distinguish humans from disturbance objects, because the surroundings are similar to the temperature of the pedestrian in the thermal image. Therefore, it come up with a conclusion that thermal images are more suitable at nighttime while visible images are more suitable at daytime for pedestrian detection. In recent years, various pedestrian detection methods [11,12,14] based on multi-modal have been proposed, proving that visible images at nighttime and thermal images at daytime are also helpful to improve the accuracy of pedestrian detection. Accurate pedestrian detection models [7,17] are generated by the fusion of multi-modal features under different lighting conditions (day and night) are fused based on convolutional neural network (CNN). Although existing CNNbased fusion strategies enhance the expression of pedestrian features in images by learning local features, they lack the ability to extract long-range dependencies from images. This causes the loss of some essential global context that might be useful for pedestrian detection. IFT [21] trained an auto-encoder to extract multi-scale features from Color-thermal images and a Spatio-Transformer fusion strategy is proposed to obtain salient features which improved the generalization ability. Therefore, we argue that integrating local features with long-range dependencies can add global contextual information, which in turn helps to enhance the salient feature fusing from multi-modal images which further improve the accuracy of detection. With this motivation, we propose a cross-modal pedestrian detection network based on Gaussian Cross Attention (GCANet) to improve the detection accuracy by enhancing object expression ability of fusion features.
Fig. 1. The motivation for this model. A cross-modal pedestrian detection network based on Gaussian Transformer Encoder (GCANet) was proposed to enhance the salient feature fusion from multi-modal, which further improves the performance of pedestrian detection. The yellow boxes are grounding truth and the red boxes are the detection results by GCANet.
The framework structure of GCANet is shown in Fig. 1, which consists of three successive networks. First, pairs of visible and infrared images are input into a feature extraction network to obtain multi-modal features separately. Then salient fusion features are obtained by a Gaussian Transformer Encoder combining transferred features. Finally, features contain locate and global information
522
P. Peng et al.
are input into detection head to detect pedestrians. Through the bidirectional coupling of local features of different modal, the feature interaction and fusion between different modal are realized, and the salient features between multimodal are effectively emphasized. To validate the effectiveness, the proposed network is verified with the state-of-the-art on reasonable KAIST pedestrian datasets [13]. Experimental results show that the proposed method can effectively improve the accuracy of cross-modal pedestrians, Miss Rate (MR) and mean Average Precision (mAP) of our method are at least 1.1 and 2.5 higher than that of the baseline and other method separately. To sum up, the contributions of this work are as follows: – We propose a novel cross-modal pedestrian detection method, that utilizes fusion salient features from multi-modal images to overcome the limited under the complex background, low resolution and contrast. – The proposed method utilizes a novel Gaussian Cross Attention (GCA) fusion strategy, where a Gaussian matrix operation and a cross attention structure are employed to emphasis the local features while utilizing global features to obtain better fusion features. – The proposed method on evaluated on reasonable KAIST multi-modal pedestrian dataset [13], where we reach competitive results compared with existing pedestrian detection methods based on multi-modal feature fusion.
2 2.1
Related Work Multispectral Pedestrian Detection
Since the release of KAIST Multispectral PedesTrian Benchmark [10], there has been increasing interest in pedestrian detection using aligned color and thermal images. The initial baseline ACF+T+THOG was extended by aggregation channel feature (ACF) [5] with the addition of thermal channels. ACF+T+THOG [22] was proposed to generate region proposals, and was then re-scored by a convolution network. A separate RPN [3] was used to generate color and thermal recommendations, which were then evaluated by support vector regression (SVR). A Fusion RPN+BF [12] was proposed to detect multispectral pedestrians which introduced RPN+BF. MSDS-RCNN [13] proposed a twostage multispectral fusion pedestrian detection network based on CNN, which found that the fusion of multi-modal features can effectively improve detection accuracy. In this work, we introduce a cross-modal fusion pedestrian detection method based on single-stage network and obtain more competitive results. 2.2
Fusion with Attention
Attention [20] has been widely used in many different forms to enhance feature representations. SENet [9] uses channel-attention, CBAM [24] proposed a network combine with spatial attention and a non-local network [23] is proposed to combine CNN with self-attention. Inspired by the above methods, we donate
GCANet
523
features extracted from source images with multi-modal as different transformation expressions for significant feature enhancement, we propose a cross-modal feature fusion module combining with cross-attention.
3 3.1
Method Overview of GCANet
An overview of the proposed GCANet is shown in Fig. 2. Our network architecture is primary based on three continuous modules, associated with the feature extraction, with the multi-modal feature fusion and with the detection tasks. Take infrared-visible image pairs as input, for example. Backbone is the first module, which contains a deep convolutional network to extract visible and infrared feature maps separately trained in a parameter sharing manner. The Gaussian Transformer Encoder (GTE) based on attention module is the second network, takes pairs of multi-modal feature maps as input and output several high-quality fusion feature maps, which learns Gaussian parameters to enhance the features while fusing multi-modal features transferred from Backbone. Enhanced feature maps are transmission into the last Detection Head to obtain the detection results with object categories and corresponding bounding boxes with associated scores. In the subsequent sections, we describe the details of the proposed network. 3.2
Backbone
Fig. 2. Network structure of the proposed GCANet. The network consists of three successive modules. First, pairs of visible and infrared images are input into a feature extraction backbone to obtain color-thermal features (C features and T features) separately. Then Salient Fusion Features are Obtained by a Gaussian Transformer Encoder (GTE) combining transferred features. Finally, features contain locate and global information are input into detection head to locate pedestrians
The aim of the backbone is to extract features from visible and infrared images separately in a parameter sharing manner. Input image pairs are donated
524
P. Peng et al.
as R1 , . . . , Rk and T1 , . . . , Tk for visible and infrared (Color-Thermal) images respectively. We assume that each pair was pre-registered using method [25]. Starting from the initial pair of images, C0 ∈ R3×H0 ×W0 (which means visible image with three channels) and T0 ∈ R3×H0 ×W0 (which means infrared image with three channels), a CNN-based backbone generates a pair of lower-resolution feature map C0 ∈ RC×H0 ×W0 and T0 ∈ RC×H0 ×W0 . We adopt the ResNet series [8] as our backbone and all models are pretrained on ImageNet [4]. The C5 feature map is the output of the backbone which has 2048 channels and with a downsample rate of 32 (C = 2048, H, W = H0 /32, W0 /32). 3.3
Gaussian Transformer Encoder
GTE is a transformer model utilized to enhance salient object features while fusing deep features. To learn multi-modal features representations with lightweight computation and memory, we introduce a Gaussian cross-attention (GCA) module to fuse corresponding deep features from infrared and visible features.
Fig. 3. The network structure of the proposed GTE. The module consists of gaussian cross attention network and several 1 × 1 convolutional layers to fuse multi-modal features by integrating local features with long-range dependencies.
As shown in Fig. 3, given a pair of infrared and visible features T0 , C0 ∈ which are divided into two branches for processing. The module first R applies a convolutional layer with 1 × 1 filters on T0 in the infrared branch to generate feature query map Q ∈ RC ×H×W . C is the number of channels, which is less than C for dimension reduction. And the same operation is done in the visible branch to obtain feature value map V ∈ RC ×H×W . In order to reduce the influence of background occlusion in visible images, a convolutional layer with C×H×W
GCANet
525
1 × 1 filters is applied and the model then dot product C0 in the visible branch to generate Gaussian key map K ∈ RC ×H×W via a learnable Gaussian matrix operation. Each point on the spatial dimension of the feature maps R0 can be treated as a vector ri ∈ RC with the coordinate (x0 , y0 ). Each point in R0 can be expressed as [r0 , r1 , . . . , rW H ]. The Gaussian metric operation is defined as follows: (1) G(R0 ) = [Wr1 r1 , . . . , Wri ri , . . . , WrW H rW H ], wri =
1 (x − x0 )2 + (y − y0 )2 exp(− ), 2 2πσ 2σ 2
(2)
where Wri is the Gaussian weight coefficient, x, y are the coordinate of ri , σ 2 is a learnable parameter used to adjust the clustering scale. After obtaining feature maps Q, K and V , we further generate attention maps A ∈ RW H×H×W via an affinity operation. Each point on the spatial dimension of the feature maps Q can be treated as a vector Qi ∈ RC , and obtain the set Ki ∈ RW H×H×W by taking the dot product of Qi with K in the spatial domain. Then a softmax layer is applied to calculate the attention map A. The affinity operation is defined as follows: Qi · Ki,j T ), (i, j = 0, 1, . . . , W H), A(Qi , KI ) = sof tmax( √ WH
(3)
where Ki,j means the jth element of Ki . Mathematically, the GCA can be expressed as: (4) GCA(T0 , C0 ) = A(Q, K) × V. In order to better optimize the deep network, residual link and Layer Normalization (LN) are added at the end of the network. The feature map H after GTE can be expressed as follow. H = C0 + LN (GCA(T0 , Co )). 3.4
(5)
Detection Head
For the detection head, we design the network consisting of two parallel branches, which are classification branch and regression branch. We also set different convolution blocks in two branches. There are two blocks on the classification branch and four on the regression branch, each block contains a convolution layer, a batch normalization layer and a ReLU layer. In addition, we follow Autoassign [26] increasing an implicit prediction of object probability on of each regression branch, and the final classification result will be multiplied by this implicit prediction. 3.5
Loss Function
The proposed method is trained to capture both local and long-range dependencies generating sharper content and preserve most of the visual information from
526
P. Peng et al.
multi-modal features. In order to detect pedestrian more accuracy, we minimize the loss function to train GCANet, denoted as L: L = Lcls + λLreg ,
(6)
which is a weighted combination of classification loss Lcls and bounding box regression loss Lreg with the weight λ. Lcls is calculate by focal loss [15] and Lreg is employed by generalized IoU [19].
4
Experiments
To evaluate the performance of our proposed network, we conducted a comparative experiment on KAIST dataset [10] with four mainstream pedestrian detection methods. Experimental results show that our network achieves the best performance on KAIST dataset. In the following sub-sections, we will first introduce the dataset and evaluation indicators, then we will indicate the parameter setting of the network, and finally list a series of experimental results on the KAIST dataset. 4.1
Dataset and Evaluation Metrics
KAIST is a multi-modal pedestrian dataset that includes a total of 95,328 images of size 640 × 480, each containing visible and infrared versions. A total of 103128 intensive comments are included. The dataset captured a variety of conventional traffic scenes, including campus, street, and rural areas, during the day and at night. As the original dataset is composed of continuous frames and some annotations are incorrect, we used reasonable data cleaned by method [13], which sampled images every 2 frames from training videos, excluding heavily occluded, truncated and small (less than 50 pixels) pedestrian instances and re-labeled part of labels. 6.6k/1k/2.2k images for training, validation, and testing. For evaluation, we strictly follow the reasonable setting provided by the KAIST benchmark and adopt miss rate (MR) and the standard COCO metrics mean Average Precision (mAP) for evaluating the effectiveness of our network. 4.2
Implementation Details
The proposed network is implemented in the pytorch framework. The model is trained on one GPU with 4 images per mini-batch. We train our network using SGD with a momentum of 0.9 and a weight decay of 0.0001. All models are trained start with an initial base learning rate of 0.015, divide it by 10 after 4 epochs, and terminate training after 6 epochs. The backbone model is initialized with a ResNet50 model pretrained on the ImageNet dataset [4] and a smaller learning rate is set according to DETR [2], which is 1/3 of the base learning rate. To stabilize the training at the beginning, we extend the number of warmup iterations from 500 to 1500. For other hyperparameters, we follow the settings of RetinaNet [15]. We consider the ‘person’, ‘person?’ and ‘people’ categories in the KAIST dataset as foreground, and the remaining classes as background.
GCANet
4.3
527
Comparisons with State-of-the-Art
We evaluate the proposed network on the test set of KAIST, compared with ACF [25], Fusion RPN + BF [12], YOLOv3 [18] and MSDS-RCNN [13]. From Table 1, we can observe that the proposed method outperforms the existing methods in multi-modal fusion pedestrian detection. Our method can capture both local and long-range dependencies generating sharper content and preserve most of the visual information compared to other methods. Detection results of YOLOv3 [18], MSDS-RCNN [13] and our method are compared in Fig. 4. The blue box represents the ground truth (GT), the green box represents the truth positive (TP) and the red box represents the false positive (FP). In addition, the rows of images are daytime visible images, daytime thermal images, night visible images and night thermal images respectively. Experimental results illustrate that our method can effectively improve the pedestrian detection accuracy of daytime thermal images and night visible images.
(a) GT
(b) YOLOv3
(c) MSDS-RCNN
(d) Ours
Fig. 4. Pedestrian detection result examples in comparison among the proposed method with YOLOv3 [18] and MSDS [13] in daytime and nighttime. The blue box represents ground truth, the green box represents true detection, and the red box represents false detection.
528
P. Peng et al.
Table 1. Comparison experimental results on reasonable KAIST pedestrian dataset.
4.4
Method
MR−2 (IoU = 0.5) MR−2 (IoU = 0.75) All Day Night All Day Night
ACF [25]
47.30
42.57 56.2
88.8
87.7
91.2
Fusion RPN+BF [12] 25.80
24.88 26.6
73.00
68.1
81.4
MSDS-RCNN [13]
11.30
10.53 12.90
70.6
67.4
79.3
YOLOv3 [18]
25.50
18.20 27.90
75.50
69.40
78.40
GCANet
10.20 9.70 11.10 69.20 66.70 72.10
Ablation Study
An ablation experiment is conducted on GCANet. We first test the effectiveness of functional modules by incorporating them incrementally upon the baseline model [18] and the results are represented in Table 2. The improvement of indicators by adding Gaussian Cross-Attention evident that the multi-modal fusion image obtained by integrating long-term dependence and local features can effectively improve the detection accuracy. Table 2. Effectiveness of functional modules. Detection head GCA MR−2 (IoU = 0.5) mAP (IoU = 0.5)
Method Baseline [18]
25.5 22.7 10.2
36.7 37.1 39.2
Then we tested the influence of different source images (visible images, thermal images and fused images respectively) of the proposed method on detection results in Table 3. Table 3. Effectiveness of the score fusion scheme. Method
Source image MR−2 (IoU = 0.5) Color Thermal All Day Night
GCANet
24.1 17.3 34.9 20.4 24.8 15.3 10.2 9.7 11.1
GCANet
5
529
Conclusion
In this paper, we propose a novel and effective deep learning architecture based on CNN and Transformer for multi-mode feature fusion problem. The proposed network consists of three parts: backbone network, Gaussian Transformer encoder and detection head. At the beginning, the source images are used in the backbone network as input, and output the multi-modal feature maps extracted respectively. Enhanced feature maps, which contain all the significant information of the source image, are generated by the Gaussian Transformer encoder by the multi-modal features. Finally, objects are located and classified by detection head according to the enhanced feature maps. Experimental results show that our network architecture achieves the goal when applied to visible infrared image fusion for pedestrian detection. Acknowledgments. This research is supported under Grant 202020429036.
References 1. Cai, Z., Saberian, M., Vasconcelos, N.: Learning complexity-aware cascades for deep pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3361–3369 (2015) 2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: Endto-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8 13 3. Choi, H., Kim, S., Park, K., Sohn, K.: Multi-spectral pedestrian detection based on accumulated object proposal with fully convolutional networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 621–626. IEEE (2016) 4. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 5. Doll´ ar, P., Appel, R., Belongie, S., Perona, P.: Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1532–1545 (2014) 6. Dollar, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34(4), 743–761 (2012) 7. Guan, D., Cao, Y., Yang, J., Cao, Y., Yang, M.Y.: Fusion of multispectral data through illumination-aware deep neural networks for pedestrian detection. Inf. Fus. 50, 148–157 (2019). https://doi.org/10.1016/j.inffus.2018.11.017 8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 9. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018) 10. Hwang, S., Park, J., Kim, N., Choi, Y., Kweon, I.S.: Multispectral pedestrian detection: Benchmark dataset and baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1037–1045 (2015)
530
P. Peng et al.
11. Kim, M., Joung, S., Park, K., Kim, S., Sohn, K.: Unpaired cross-spectral pedestrian detection via adversarial feature learning. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1650–1654 (2019) 12. Konig, D., Adam, M., Jarvers, C., Layher, G., Neumann, H., Teutsch, M.: Fully convolutional region proposal networks for multispectral person detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 49–56 (2017) 13. Li, C., Song, D., Tong, R., Tang, M.: Multispectral pedestrian detection via simultaneous detection and segmentation. arXiv preprint arXiv:1808.04818 (2018) 14. Li, C., Song, D., Tong, R., Tang, M.: Illumination-aware faster R-CNN for robust multispectral pedestrian detection. Pattern Recogn. 85, 161–171 (2019) 15. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017) 16. Liu, J., Zhang, S., Wang, S., Metaxas, D.N.: Multispectral deep neural networks for pedestrian detection. arXiv preprint arXiv:1611.02644 (2016) 17. Park, K., Kim, S., Sohn, K.: Unified multi-spectral pedestrian detection based on probabilistic fusion networks. Pattern Recogn. 80, 143–155 (2018) 18. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 19. Rezatofighi, H., Tsoi, N., Gwak, J.Y., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666 (2019) 20. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp 5998–6008 (2017) 21. Vs, V., Valanarasu, J.M.J., Oza, P., Patel, V.M.: Image fusion transformer. arXiv preprint arXiv:2107.09011 (2021) 22. Wagner, J., Fischer, V., Herman, M., Behnke, S.: Multispectral pedestrian detection using deep fusion convolutional neural networks. In: ESANN, vol. 587, pp. 509–514 (2016) 23. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018) 24. Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 3–19. Springer, Cham (2018). https://doi.org/10.1007/9783-030-01234-2 1 25. Yang, Z., Dan, T., Yang, Y.: Multi-temporal remote sensing image registration using deep convolutional features. IEEE Access 6, 38544–38555 (2018) 26. Zhu, B., et al.: AutoAssign: differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496 (2020)
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks in Satellite Images from Palmira and La Victoria, Colombia Saulo Bosquez1 , Germán H. Alférez4(B) , Ana María Martínez Ardila2 , and Benjamin L. Clausen3 1 Facultad de Ingeniería y Tecnología, Universidad de Montemorelos, Av. Libertad 1300
Poniente, Barrio Matamoros, 67530 Montemorelos, N.L., Mexico [email protected] 2 Department of Earth and Biological Sciences, Loma Linda University, Griggs Hall, 11065 Campus Street, Loma Linda, CA 92350, USA [email protected] 3 Geoscience Research Institute, 11060 Campus Street, Loma Linda, CA 92350, USA [email protected] 4 School of Computing , Southern Adventist University , PO Box 370, Collegedale, TN 37315-0370, USA [email protected] Abstract. Manually inspecting and analyzing satellite images can lead to numerous errors and is quite time consuming. Our geological contribution is to offer a means for the automatic classification of areas with felsic, mafic, and ultramafic rocks via machine learning using satellite images from Palmira and La Victoria, Colombia. Specifically, this study focuses on two types of satellite images taken from the Earth Observation System (EOS), namely natural color (bands B04 B03 B02) and infrared color vegetation (B08 B04 B03). The following machine learning algorithms were used in this study: Random Forest, K-Nearest Neighbors, Support Vector Machines, Logistic Regression, and Multilayer Perceptron. The model generated with K-Nearest Neighbors performed best for classifying natural color images with an accuracy of 91%, a precision of 87%, and a recall of 88%. Random Forest was the best model for classifying infrared images with an overall accuracy of 83%, a precision of 31%, and a recall of 31%. Keywords: Machine learning · Geology · Rock classification · Random Forest · K-Nearest Neighbors · Support Vector Machines · Logistic Regression · Multilayer Perceptron · GDAL · OGR · Satellite images · Infrared images
1 Introduction The study of ophiolites in the Palmira and La Victoria regions of Colombia is indispensable for understanding the formation and accretion of land that make up western Colombia and the orogenesis of the West margin of South America during the Cretaceous and Cenozoic. According to [20], ophiolitic type and occurrence in the Colombian Andes have been defined via previous studies. Ophiolitic bodies that have been studied and characterized are: the ultramafic body of Los Azules, El Complejo Ofiolítico © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 531–547, 2022. https://doi.org/10.1007/978-3-031-10464-0_36
532
S. Bosquez et al.
de Pácora, the mafic-ultramafic complex of Bolívar-Valle, and the Ginebra Ophiolitic Complex (GOC). For the Amaíme Complex and the Buga Batholith, and the GOC petrographic and geochemical studies have been done on the banded gabbros in the ophiolitic sequence, and they have been interpreted in conjunction with previously published geochemical analysis of other parts of the ophiolitic sequence [20]. The ultramafic rocks in the study area are represented mainly by pyroxenite and peridotite bodies. The mafic rocks consist mainly of basalt, gabbro, and diorite bodies. Gabbro exists in the isotropic and layered forms. The felsic rocks are represented by quartz diorite and tonalite. These studies have analyzed and classified rock types from a “micro perspective”, i.e., geochemical analysis. However, the Palmira and La Victoria area have not been analyzed from a “macro perspective”. In other words, through remote sensing, the analysis of satellite images from that area. Given that classification and mapping of different types of ophiolitic rocks by using satellite images can be resource-intensive and time consuming, the contribution of this study is to analyze satellite images from Palmira and La Victoria, Colombia from a macro perspective, using the following machine learning models to analyze satellite images: Random Forest (RF), K-Nearest Neighbors (KNN), Support Vector Machine (SVM), Logistic Regression (LR), and Multilayer Perceptron (MLP). This study represents the first attempt for mapping an ophiolite zone via remote sensing and machine learning in the West margin of Colombia, which can potentially be applied in the other ophiolites of the Colombian Andes. The algorithms were trained with two types of satellite imagery, namely natural color and infrared color vegetation images. On the one hand, a natural color band combination allows for ground features to appear in colors similar to their appearance to the human eye; this band combination provides good sediment information. On the other hand, infrared color vegetation is an extremely popular band combination and has numerous uses, including studying vegetation studies, monitoring drainage, identifying soil patterns and multiple stages of crop growth [1]. Images used in this research were sourced from the Earth Observation System1 , which is an open-source image site. This site provides different options such as band combination, cloud cover, and time frame. The models were evaluated in terms of accuracy, precision, recall, and F1 score. This paper is structured as follows. The second section presents the underlying concepts of our approach. The third section presents related work. The fourth section presents the methodology. The fifth section presents the results. The sixth section presents the discussion. The last section presents the conclusions and future work.
2 Theoretical Foundation Our approach is based on the following concepts (see Fig. 1). Igneous Rocks Igneous rocks result from cooling and solidification of magma. These rocks can be volcanic or plutonic, depending on whether they solidify quickly at the surface or slowly in the Earth’s crust [23]: 1 https://eos.com/products/landviewer/
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
533
Fig. 1. Underpinnings of our approach
• In a widely accepted silica-content classification scheme, rocks with more than 65% silica are called felsic [9]. • Mafic rocks have a dark mineral content of 50 to 90% [13]. • Ultramafic rocks have a dark mineral content greater than 90% [13].
Machine Learning Machine learning can be defined as the science (and art) of programming computers so they can learn from data. It is the field that gives computers the ability to learn without being explicitly programmed [14]. In this research work, we focus on supervised learning. In supervised learning, the training set that is given to the algorithm(s) includes the desired solutions (i.e., the classes). The classification algorithms that were used here are as follows: • KNN is a machine learning technique that can be used for both regression and classification tasks. KNN examines the labels of a chosen number of data points surrounding a target data point, in order to make a prediction about the class that the data point falls into [17]. KNN is computed at the moment of prediction, and not when first fitting the data into the model. When a new data point arrives, the KNN algorithm starts by finding the nearest neighbors of this new data point. Once it has the values for neighbors of the data point, it uses them as a prediction for the new data point [19]. • RF combines the output of multiple decision trees to reach a single result. Due to its usability and flexibility, it has become a commonly used algorithm as it handles both classification and regression problems [8]. • SVM is a relatively simple supervised learning algorithm that is used for classification and/or regression, although it is preferred for classification. SVM finds a hyper-plane
534
S. Bosquez et al.
that creates a boundary between the types of data. SVMs are a decent and straightforward example of a supervised classification technique for remote sensor data classification [7]. This technique is suitable for distinguishing the patterns as well as the objects that are utilized for pixel-based as well as object-based classification. • LR models a relationship between predictor variables and a categorical response variable. LR helps estimate a probability of falling into a certain level of the categorical response given a set of predictors [22]. • Multilayer Perceptron (MLP) is the simplest kind of feed-forward networks. Within this algorithm, artificial neurons are arranged into a set of layers, each containing the exact same number of units. Every artificial neuron in one layer is connected to every unit in the next layer. Since this algorithm is built upon many layers, the first layer is called the input layer, the middle layers are called the hidden layers, and the last layer is called the output layer [15].
Satellite Images EOS offers access to numerous image types such as: natural color, infrared color, and false color. It also offers filters such as: cloud cover, region specific, and satellite sensors. The band combinations chosen in this research work were natural color B04, B03, and B02 and infrared color (vegetation) B08, B04, and B03. The natural band allowed us to observe ground features as they would appear to the human eye. Healthy vegetation would appear as green. Unhealthy vegetation would be brown and yellow. The color infrared (vegetation) presents the standard false color composite. Vegetation appears in shades of red, while urban areas appear as cyan, and soils would vary from dark to light browns. EOS’s website offers a filter that allows control of the percentage of cloud cover in the images. This option is most beneficial to studies using satellite images since having the least amount of visual noise returns better results. Underlying Technologies Scikit-learn Scikit-learn is a Python library that incorporates a wide range of state-of-the-art machine learning algorithms. The aim of this library is to provide machine learning to a more general audience using straightforward high-level language. One highlight of this library is the ease of use, performance, documentation, and application Programming Interface (API) consistency. Its dependencies are few and it is distributed under the simplified Berkeley Source Distribution (BSD) license, which encourages its use in both academic and commercial settings [18]. Quantum Geographic Information System QGIS is a free, open-source, cross-platform compatible, and scalable GIS tool. Today, QGIS is geographic information processing software that is popular with many users. This software makes it possible to collect, store, process, analyze, manage, and present all types of spatial and geographic data comparable to other available, high-priced software [3].
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
535
Geospatial Data Abstraction Library GDAL is a translator library for raster and vector geospatial data formats that is released under an X/MIT style Open Source License by the Open Source Geospatial Foundation. As a library, it presents a single raster abstract data model and single vector abstract data model callable for all supported formats. It also comes with a variety of useful command line utilities for data translation and processing [12]. OpenGIS Simple Features Reference Implementation OGR uses drivers to access files in different formats. OGR is a library focusing on vector data. It used to be a separate vector IO library inspired by OpenGIS Simple Features which was separated from GDAL; after the release of GDAL 2.0, GDAL and OGR components were integrated [12].
3 Related Work Within the last two decades, satellite images have been beneficial to many studies in the geosciences. In this context, satellite image classification has not only become the appropriate choice for these types of studies, it has become the right choice [6]. This section presents relevant research works with different machine learning algorithms applied to satellite images. Table 1 summarizes the work presented in this section. In [2], the authors present a detailed comparison of various techniques utilized in image processing with the goal of analyzing satellite images. One issue was that the images were affected by noise and other environmental conditions. In order to clear this visual noise, it was necessary to process the images so that they could be used for analysis. This research work indicates that satellite images are widely used in many real-time applications such as agriculture, land detection, navigation, and in geographical information systems. Some of the most popular machine learning based image processing techniques are presented, namely Particle Swarm Optimization (PSO), Genetic Algorithm (GA), and Cuckoo Search (CS) and PSO. The techniques are compared, and image processing limitations are described. The different metrics for performance evaluation in each of the image processing areas is studied. In [11], a systematic analysis of satellite image-based land cover classification techniques was conducted. Accurate and effective techniques are required for classification to provide meaningful information regarding climate change, bio-diversity variation, and so on. The authors indicated that one of the most interesting research areas is satellite image-based land cover classification. Remotely sensed data obtained from remote sensors are capable of providing easily accessible data. Within this area, the authors categorized research works based on different classification, such as Fuzzy Random Forest, SVM, ANN, Bayesian Model, DT, and so on. In [16], a system for land use mapping by machine learning is proposed. Governments, the private sector, research agencies, and community groups rely on land use mapping data for natural resource assessment, monitoring, and planning. Finding an effective mapping approach is thereby crucial for natural resource condition monitoring and investment, agricultural productivity, sustainability and planning, biodiversity conservation, natural disaster management, and biosecurity. The four machine learning algorithms used to classify satellite images for land use were: KNN, SVM, Convolutional Neural Network, and a Capsule Network. In addition, the implemented algorithms
536
S. Bosquez et al. Table 1. Comparison table of the articles described
Author
Area of study
Asokan et al. [2]
Year
Models used
Results
Machine learning 2019 based image processing techniques for satellite image analysis—a survey
Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Cuckoo Search (CS) and PSO
GA and PSO suffer from the limitation that they get trapped in local minima. CS and PSO has shown better results in terms of closeness to optimal solution when compared to PSA or GA individually
Gavade et al. [11]
Systematic 2019 analysis of satellite image-based land cover classification techniques: literature review and challenges
Clustering, SVM, Learning, Decision Tree, Neural Network, Hierarchical, Fuzzy, Spatial, Bayesian Model, Other Classifier
Accuracy from 50 selected papers: Clustering: 8% SVM: 14% Learning: 10% Decision Tree: 2% Neural Network: 8% Hierarchical: 10% Fuzzy: 20% Spatial: 6% Bayesian Model: 10% Other Classifier: 12%
Liao et al. [16]
ML-LUM: A system for land use mapping by machine learning algorithms
KNN, SVM, CNN, CapsNet
Accuracy (Satellite Dataset): KNN: 54% SVM: 75% CNN: 98.09% CapsNet: 96%
Tangthaikwan et al. [21]
Multiclass support 2017 vector machine for classifying spatial data from satellite image
MLP4, MLP36, SVM Standard, SVM Max
Accuracy: MLP4: 87.25% MLP36: 89.32% SVM Standard: 87.27% SVM Max: 90.89%
2019
(continued)
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
537
Table 1. (continued) Author
Area of study
Year
Models used
Results
Garosi et al. [10]
Assessing the performance of GIS-based machine learning models with different accuracy measures for determining susceptibility to gully erosion
2019
RF, SVM, NB, GAM
RF AUC: 92.4% SVM AUC: 90.9% NB AUC: 87.2% GAM AUC: 89.9%
Bérubé et al. [4]
Predicting rock 2018 type and detecting hydrothermal alteration using machine learning and petrophysical properties of the Canadian Malartic ore and host rocks, Pontiac Subprovince, Québec, Canada
SVM
Overall Precision: 89% Overall Recall: 89% F1 Scores: Meta-sedimentary rocks: 73% Felsic-intermediate Intrusive rocks: 69% Mafic dykes: 93%
Chakouri [5]
Geological and mineralogical mapping in Moroccan central Jebilet using multispectral and hyperspectral satellite data and Machine Learning
SVM
Accuracy: Hyperspectral: 93.05% Multispectral: 89.24%
2020
were also modified for land use mapping in a Machine Learning Use Mapping system. This system is able to train models, predict classifications of satellite images, map the land use, display the land use statistical data and predict production yields. In [21], a multiclass SVM is used to classify spatial data from satellite images. The image is pre-processed and classified using SVM with the Radial Basis Function (RBF) Kernel. A pixel-based classification method is performed according to the value of spectral pixels with a Multi-Spectral Scanner satellite image and the data used comes from a 3X3 square neighborhood. The research process consists of two stages. In the first stage, the RBF kernel is applied. In the second stage, the classification result is compared with other classification methods. The result with SVM got a higher accuracy compared to other methods.
538
S. Bosquez et al.
In [10], the performance of machine learning models applied to GIS resulted in different accuracy values. This was used to determine susceptibility to gully erosion in the study area. With the help of extensive field surveys and GPS data, digital maps were prepared. Topographical attributes were provided from digital elevation models (DEM). The land use and normalized difference vegetation index (NDVI) marks were created by satellite imagery. The functional relationships between gully erosion and controlling factors were calculated using Random Forest, SVM, Naive Bayes, and generalized additive models. The results showed that the RF model had the highest amount of efficiency, Area Under the Curve (AUC), and lowest amounts of mean absolute error (MAE) and root mean square error (RMSE) compared with SVM, NB, and Generalized Additive Model (GAM). In [4], machine learning was used on rock samples to predict rock type and hydrothermal alteration of the Canadian Malartic ore and host rocks. Various rock types are present in the Malartic District, mainly meta-sedimentary rocks, felsic intermediate intrusive rocks, and mafic dykes. With SVM, it was found that these two physical properties can be used to predict the rock type of a sample with an average precision and recall rate of 89%. The SVM classifier was extended to predict whether meta-sedimentary rocks, felsic-intermediate intrusive rocks, and mafic dykes had undergone hydrothermal alteration with average F1 scores of 73%, 69%, and 93%, respectively. The machine learning process used in this case study can be applied in advanced exploration stages, and the recovered rock samples can be used to update trained prediction models. In [5], geological and mineralogical mapping was done for the Moroccan central Jebilet using machine learning and multispectral/hyperspectral satellite data. The Moroccan central Jebilet Massif is one of the main Paleozoic outcrops in Morocco. The massif is characterized by its location in an arid climate, significant mining potential and the absence of plant cover, which favors the use of spatial remote sensing for geological mapping. The classification with SVM allowed the mapping of the lithological units in the study area. The accuracy of the SVM classification of hyperspectral data is higher than that of multispectral data, which was demonstrated by the confusion matrix, notably an overall accuracy of 93.05% and 89.24%, respectively. The use of hyperspectral and multispectral images has been shown to be a good technique for the characterization of iron deposits and lithological units, which may help in mineral exploration engineering with reduced need for fieldwork and geochemistry.
4 Methodology This section describes the steps followed in this research work. 4.1 Acquiring Raw and Infrared Satellite Images To begin, several images of natural and infrared colors were required. The images utilized for this research were sourced from EOS. Using the provided shapefile2 6 images for the Palmira and La Victoria area in Colombia were searched for and downloaded. 2 https://github.com/SBosq/MUFRocks/tree/main/Colombia_Geo.
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
539
Using the provided options within EOS, three filters were applied: source indicating passive sensors (day, night, or low resolution) or active sensors or terrain tiles or EOS storage files or high-resolution imagery, cloudiness indicating how much cloud cover is desired, and sensor indicating which satellite should the images come from. One of the most important filters was cloud cover. This filter allowed a search of images that were mostly or completely free of visual noise. Six different images were selected by hand and are available online3 . Their image quality was approximately 1394 × 1060 (60 m/px), meaning that there is a total of 1,477,640 pixels in each image, and each individual pixel covers 60 m. These images were downloaded in May 2021, in TIFF format. GeoTiff (.tif/.tiff) format is the most common raster data file type suitable for storage, transfer, display, and printing of raster images. GeoTiff supports black-and-white, gray scale, pseudo color, and true color images. The image quality of the used images was 661 × 1374, meaning that there was a total of 908,214 in the images. Processing this cropped
(a) Initial Image
(d) Shapefile Region
(b) Clipped Image
of
Colombian
(e) User Generated Shapefile
Fig. 2. Satellite images 3 https://github.com/SBosq/MUFRocks/tree/main/InitialImages.
(c) Clipped Infrared Image
540
S. Bosquez et al.
image took approximately 3.5 s. The images used in this work were not the full images that were downloaded from EOS, since processing the full downloaded image would have taken too much time, computing time, and resources. Figure 2(a) shows an example of one of the six initial satellite images that were downloaded. After the shapefile was fitted into the satellite image, the initial image was bigger than the focus area. Figure 2(b) shows the newly clipped satellite image of the focus area. The shapefile and its contents nicely fit into the clipped area. Figure 2(c) shows the focus area in a different band combination, this time focusing on the infrared aspect of the image. Figure 2(b) shows the focus area labeled. The first teal colored data points belong to the mafic classification. The purple data points belong to the ultramafic classification. Lastly, the yellow data points represent the felsic classification. Figure 2(e) provides a closer look at the clipped satellite image of the focus area with the shapefile (Figure 2(d)) superimposed on top of it. Next, the images were renamed according to their band combination, whether natural color or infrared. All 6 images were then imported to QGIS in order to begin image segmentation and analysis. 4.2 Preparing and Processing the Data In this step, the images were prepared and processed, one at a time. To this end, a Python script was created. This script is available online4 and is shown in Listing 1. Line 1 from Listing 1 shows that the “colombia_fn” variable stores the cropped natural color image of our area of interest. The format used was TIFF. The reasoning behind this is that the files can be large in size and may contain high detail, and the image layers are merged when they are saved. Line 2 highlights the “segments_fn” variable, which stores the final segmented and clipped image in.tif format. Notice that the format is spelled differently in lines 1 and 2. However, it is important to state that there is no other difference between these formats than their spelling. In line 4, GDAL is used in the creation of the variable called “driverTiff” to utilize the necessary driver to translate the data found in the aforementioned .tiff and .tif files. Next, our dataset is opened as read-only using the GDAL open function and loaded into the “colombia_ds” variable in line 5. Finally, the number of bands found in our image is counted and stored within the “nbands” variable in line 6. In line 7, an empty list named “band_data” is created to store the data for all the bands found within the image. In line 11, the parameters for the loop that will read the band data are established. Since the bands found within the “colombia_ds” variable start at 1 and not 0, it is important to state that the range will begin at 1 and end at nbands +1 in order to avoid an indexing error. Lines 12–14 show the creation of a local variable within the for-loop called “band”. This variable will store the individual data from each band as a Numpy array. Since the amount of data found within each band may be large in quantity, the utilization of a Numpy array is required. Specifically, a Numpy array provides better speed and consumes less memory space than Python lists. The array for the individual raster band is then appended to the empty “band_data” list. Line 15 shows the data from this list being stacked depth-wise and being reassigned 4 https://github.com/SBosq/MUFRocks/blob/main/OSGDAL/main.py.
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
541
into the “band_data” variable. This step is important because it arranges the data into a single data structure containing rows, columns, and bands. In Listing 1, line 16 creates a new variable to store the normalized data, from values 0 to 1, from the “band_data” list using the function “rescale_intensity”. In order to take into account, the amount of time that the segmentation of our image takes. Line 18 shows the creation of the “seg_start” variable that takes the time before the segmentation starts and stores it. In line 19, image segmentation is initialized using Simple Linear Iterative Clustering (SLIC) and stored in the “segments” variable. SLIC is a modern approach for segmenting superpixels, and it requires very little computing power. In this same line, the hyperparameters passed are “img”, “n_segments”, and “compactness”. The variable “img” contains the normalized data from “band_data”. Next, the number of segments, represented by “n_segments” was assigned the value of 68,250 after trial and observation. During this step it was realized that the image spanned more than the focus area, thus the image was downsized to an appropriate size. Once the image was clipped, it was stored in a local file, read, and stored into the “segments_fn” variable from line 2. The final hyperparameter found in this line is “compactness”, which refers to the size of the segments to be created; this was set at 0.1. Once an image was segmented, the amount of time that the segmentation took is displayed on screen, as observed in line 21.
Listing 1: Image Processing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
colombia_fn = ‘Cropped_Colombia_Area_3.tiff’ segments_fn = \ ‘C:/temp/eosImages/segments_final.tif’ driverTiff = gdal.GetDriverByName(‘GTiff’) colombia_ds = gdal.Open(colombia_fn) nbands = colombia_ds.RasterCount band_data = [] print(‘bands’, colombia_ds.RasterCount, ‘rows’, colombia_ds.RasterYSize, ‘columns’, colimbia_ds.RasterXSize) for i in range(1, nbands + 1): band = \ colombia_ds.GetRasterBand(i).ReadAsArray() band_data.append(band) band_data = np.dstack(band_data) img = exposure.rescale_intensity(band_data) seg_start = time.time() segments = slic(img, n_segments=68250, compactness=0.1) print(‘segments_complete’, time.time() – seg_start)
Listing 2 presents the creation of the “segments_ds” variable that contains the same parameters as the “colombia_ds” variable. Listing 2, lines 1–4 show the “segments_ds”
542
S. Bosquez et al.
variable, which is created and assigned a newly created.tiff file using the “Create” function from the GDAL “GetDriverByName” variable. The hyperparameters used in the “Create” function are the “segments_fn” variable. This variable is created in line 2 of Listing 1; the raster x size from the original.tiff file, as well as the raster y size from the.tiff file (loaded in line 1, Listing 1), the number of bands, and the data type. The data type utilized was GDT_Float32 to accommodate varying formats and element sizes found within the dataset. In line 5, the GDAL “SetGeoTransform” function is used to convert the map coordinates from “colombia_ds” to pixel coordinates. Line 7 utilizes the GDAL “SetProjection” function to identify “colombia_ds”’s projection information and assign it to the newly created.tiff file. Line 9 writes the data from “segments” into “segments_ds” as an array using GDAL’s “WriteArray” function; line 10 closes the newly created.tiff file by simply equaling it to None.
Listing 2: Image Processing
1 2 3 4 5 6 7 8 9 10
segments_ds = driverTiff.Create( segments_fn, Colombia_ds.RasterXSize, colombia_ds.RasterYSize, 1, gdal.GDT_Float32) segments_ds.SetGeoTransform( colombia_ds.GetGeoTransform()) segments_ds.SetProjection( colombia_ds.GetProjectionRef()) segments_ds.GetRasterBand(1).WriteArray(segments) segments_ds = None
Listing 3 presents the creation of a dataframe using the resulting feature names and values from the created shapefiles, as well as splitting the dataframe into two files, one for training and the other for testing the ML models. Listing 3, line 1 shows the “gdf” variable being created and assigned the truth data gathered from the shapefile; this data is read using a geodataframe. In line 3, the unique values from the geodataframe column ‘RockTypes’ is assigned to the “class_names” variable. The purpose of line 5 is to assign an integer value to each of the “class_names” variable. This is because rasters are not able to store strings. The Numpy function used here is “.arange”, which creates an array from zero to n depending on the size of the variable passed. In our work, the “class_names” variable is used and increased by 1 so that all the values are not stored in an index value lesser than their original place. In line 7, a Pandas dataframe is assigned to the “df” variable. Also the dataframe is passed a dictionary, which contains the columns ‘RockType’ and ‘id’. This dataframe is then converted to a CSV file and saved under the file name ‘RType_lookup.csv’ in line 9. Line 11 adds a column to the shapefile, via mapping, using dict(zip(class_names, class_ids)), which contains the IDs assigned in the CSV file. Lines 15 and 16 split the shapefile into two separate shapefiles, one reserved for training and the other used to test the model. Specifically, line 15 splits the data into two subsets. The split containing 70% of the data is used to train the models. The remaining 30% of the data is used to evaluated the models. Line 16 is then assigned the remaining thirty percent of the geodataframe that was loaded with the “Rock_data.shp”
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
543
shapefile. Lines 19 and 20 save these two files into separate shapefiles, train.shp and test.shp, respectively.
Listing 3: DataFrame Creation
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
gdf = gpd.read_file( ‘C:/Users/saulo/Documents/Rock_data.shp’) class_names = gdf[‘RockTypes’].unique() print(‘class_names’, class_names) class_ids = np.arange(class_names.size) + 1 print(‘class_ids’, class_ids) df = pd.DataFrame( {‘RockTypes’: class_names, ‘id’: class_ids}) df.to_csv(‘C/temp/eosImages/RType_lookup.csv’) print(‘gdf_without_ids’, gdf.head()) gdf[‘id’] = gdf[‘RockTypes’].map( dict(zip(class_names, class_ids))) print(‘gdf_with_ids’, gdf.head()) gdf_train = gdf.sample(frac=0.7) gdf_test = gdf.drop(gdf.train.index) print(‘gdf_shape’, gdf.shape, ‘training_shape’, gdf_train.shape, ‘test’, gdf_test.shape) gdf_train.to_file(‘C:/temp/eosImages/train.shp’) gdf_test.to_file(‘C:/temp/eosImages/test.shp’)
5 Results In this research, we used the following supervised-learning algorithms: RF, KNN, SVM, LR, and MLP to classify ultramafic, mafic, and felsic rocks. The source code of the algorithms that were used for training and evaluating the models is available online5 . Table 2 summarizes the results in terms of accuracy values for the ultra mafic, mafic, and felsic rock types. Accuracy is the ratio of the number of correct predictions made to the number of all predictions made. Accuracy =
Correct Predictions Predictions Made
In the KNN model, the number of neighbors was specified as four after several testings. In the SVM model, the RBF kernel was applied. In the RF model, 100 trees were used in the random forest. In the LR model, the “liblinear” optimizer was chosen because of the small training dataset. In the MLP model, the “lbfgs” optimizer was chosen because it has better performance and convergence than the other MLP optimizers. 5 https://github.com/SBosq/MUFRocks/tree/main/neighbortest.
544
S. Bosquez et al. Table 2. Class accuracy Natural color image
Infrared color image
[ULTRA MAFIC, MAFIC, FELSIC]
[ULTRA MAFIC, MAFIC, FELSIC]
RF
[76%, 87%, 83%]
[82% 80% 81%]
KNN
[86%, 95%, 93%]
[46% 59% 52%]
SVM
[56% 70% 79%]
[0% 47% 40%]
LR
[58%, 44%, 71%]
[0% 48% 0%]
MLP
[78%, 81%, 83%]
[26% 51% 36%]
As one can see from Table 1, natural color images were successfully able to be classified and yielded better results, whereas there are instances where the infrared rock datasets were not able to be processed by the models created. Also, precision, recall and F-score values were evaluated for each model. Precision is the number of correct positives results, divided by the number of positive results predicted. Precision =
TP TP + FP
Recall is the number of correct positive results, divided by the number of all relevant samples (all the samples that should be classified as positive). Recall =
TP TP + FN
F-score is the harmonic mean between precision and recall. This number, which is in the [0,1] range, indicates how precise the classifier is (precision) and how robust it is (recall). The greater the F1 score, the better the overall performance of the model. F1 = 2 ∗
Precision ∗ Recall Precision + Recall
Table 3 presents the classification report in terms of precision, recall, and F1 score for natural color images and infrared color images. The best overall results for both image types used the RF model, while the best results for each image were KNN and RF for raw images and infrared images, respectively.
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
545
Table 3. Classification report models in terms of precision, recall, and F1 score RF Natural color image
Infrared color image
Precision
Recall
F1 score
Precision
Recall
F1 score
Ultramafic
0.77
0.89
0.83
Ultramafic
0.83
0.71
0.77
Mafic
0.90
0.73
0.81
Mafic
0.79
0.82
0.81
Felsic
0.79
0.82
0.81
Felsic
0.82
0.91
0.86
KNN Natural color image
Infrared color image
Precision
Recall
F1 score
Precision
Recall
F1 score
Ultramafic
0.86
1.00
0.92
Ultramafic
0.30
0.36
0.33
Mafic
0.94
0.71
0.81
Mafic
0.50
0.60
0.54
Felsic
0.82
0.93
0.87
Felsic
0.36
0.22
0.27
SVM Natural color image
Infrared color image
Precision
Recall
F1 score
Precision
Recall
F1 score
Ultramafic
0.47
0.26
0.33
Ultramafic
0.00
0.00
0.00
Mafic
0.34
0.56
Felsic
0.86
0.76
0.42
Mafic
0.46
0.99
0.63
0.81
Felsic
0.36
0.01
0.01
LR Natural color image
Infrared color image
Precision
Recall
F1 score
Precision
Recall
F1 score
Ultramafic
0.54
1.00
0.70
Ultramafic
0.00
0.00
0.00
Mafic
0.67
0.06
Felsic
0.85
0.89
0.12
Mafic
0.50
1.00
0.66
0.87
Felsic
0.00
0.00
0.00
MLP Natural color image
Infrared color image
Precision
Recall
F1 score
Precision
Recall
F1 score
Ultramafic
0.78
0.96
0.86
Ultramafic
0.32
0.05
0.08
Mafic
0.94
0.61
Felsic
0.87
1.00
0.74
Mafic
0.51
0.90
0.65
0.93
Felsic
0.38
0.14
0.20
6 Discussion According to Table 2, for natural color images the KNN model performed best, with an accuracy score of 87.04%. Table 3 presents the classification report of the natural color image using the KNN model. From this it can be observed that the precision scores for
546
S. Bosquez et al.
each individual class are: 77% for ultramafic, 90% for mafic, and 79% for felsic rock types. The recall scores for the KNN model using the natural color image are 89% for ultramafic, 73% for mafic, and 82% for felsic rock types. The F1 Score values for the KNN model using the natural color image are 83% for ultramafic, 81% for mafic, and 81% for felsic rock types. Focusing on the F1 Score for each of the classes for the KNN model using the natural color image, it can be determined that KNN was in fact the best performing model for natural color image classification. According to Table 2, for infrared images, the RF model was the best performing one, with an accuracy score of 82%. Table 3 presents the classification report of the infrared image using the RF model. From this, it can be observed that the precision scores for each individual class is: 83% for ultramafic, 79% for mafic, and 79% for felsic rock types. The recall scores for the RF model using the infrared image are 71% for ultramafic, 82% for mafic, and 79% for felsic rock types. The F1 Score values for the RF model using the infrared image are 77% for ultramafic, 81% for mafic, and 86% for felsic rock types. Upon further observation of the F1 Score of each class, it can be determined that RF was in fact the best performing model for infrared image classification.
7 Conclusions and Future Work We proposed the use of five machine learning models for rock type classification of felsic, mafic, and ultramafic rocks, using satellite images of the Palmira and La Victoria area in Colombia. The best overall results, for both image types, used the RF model, while the best results, for the images were the ones obtained with the KNN and RF models. In future work, we expect to utilize more images and shapefiles obtained from future visits to the area of study. Also, we expect to test our classifier in the field to corroborate the validity of the results obtained in the lab. In order to test our results, fieldwork, remote sensing technique and the study of the spectral signature of the rocks should be considered. The combination of macro results (i.e., obtained with the analysis of satellite images) with micro results (i.e., obtained with the analysis of samples in the field), can prove useful for future work. Also, we expect to analyze hyperspectral images obtained from the area of study using a drone in order to have more detailed data.
References 1. Earth Observing Data Analytics Inc.: EOS LandViewer: browse real-time Earth observation (2021). https://eos.com/products/landviewer/ 2. Asokan, A., Anitha, J.: Machine learning based image processing techniques for satellite image analysis—a survey. In: 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon) (2019) 3. Baghdadi, N., Zribi, M., Mallet, C.: QGIS and Generic Tools. ISTE LTD. ISBN: 978-1-78630187-1 (2018) 4. Bérubé, C.L., et al.: Predicting rock type and detecting hydrothermal alteration using machine learning and petrophysical properties of the Canadian malartic ore and host rocks, Pontiac Subprovince, Québec, Canada. Ore Geol. Rev. 96, 130–145 (2018)
Automatic Classification of Felsic, Mafic, and Ultramafic Rocks
547
5. Chakouri, M.: Geological and mineralogical mapping in Moroccan central Jebilet using multispectral and hyperspectral satellite data and machine learning. Int. J. Adv. Trends Comput. Sci. Eng. 9(4), 5772–5783 (2020) 6. Lu, D., Weng, Q.: A survey of image classification methods and techniques for improving classification performance. Int. J. Remote Sens. 28(5), 823–870 (2007). https://www.tandfo nline.com/doi/pdf/10.1080/01431160600746456?needAccess=true 7. Dhingra, S., Kumar, D.: A review of remotely sensed satellite image classification. Int. J. Electr. Comput. Eng. 9, 1720 (2019) 8. IBM Cloud Education: What is Random Forest? (2021). https://www.ibm.com/cloud/learn/ random-forest 9. Encyclopedia Britannica: Felsic and mafic rocks - igneous rock (2021). https://www.britan nica.com/science/felsic-rock 10. Garosi, Y., Sheklabadi, M., Conoscenti, C., Pourghasemi, H.R., Oost, K.V.: Assessing the performance of GIS-based machine learning models with different accuracy measures for determining susceptibility to gully erosion. Sci. Total Environ. 664, 1117–1132 (2019) 11. Gavade, A.B., Rajpurohit, V.S.: Systematic analysis of satellite image-based land cover classification techniques: literature review and challenges. Int. J. Comput. Appl. 43(6), 514–523 (2019) 12. GDAL/OGR: GDAL/OGR Geospatial Data Abstraction Software Library, Open Source Geospatial Foundation. https://gdal.org (2021) 13. Geologyin: How to classify igneous rocks into (ultramafic, mafic, intermediate and felsic)? (2021). https://www.geologyin.com/2014/12/how-to-classify-igneous-rocks-into.html 14. Geron, A.: Hands-on Machine Learning With SciKit-Learn and TensorFlow Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media Inc, Sebastopol, CA (2017) 15. Grosse, R.: Lecture 5: Multilayer perceptrons (2018). http://www.cs.toronto.edu/~rgrosse/ courses/csc321_2018/readings/L05%20Multilayer%20Perceptrons.pdf 16. Liao, X., Huang, X., Huang, W.: ML-LUM: a system for land use mapping by machine learning algorithms. J. Comput. Lang. 54, 100908 (2019) 17. Nelson, D.: What is a KNN (k-nearest neighbors)? (2020). https://www.unite.ai/what-is-knearest-neighbors/ 18. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 3426 (2011) 19. Korstanje, J.: The k-nearest neighbors (kNN) Algorithm in Python (2021). https://realpython. com/knn-python/ 20. Rodríguez Ramos, B.P.: Estudio metalgenético de las mineralizaciones auríferas del área de Ginebra y zonas aledañas, Valle del Cauca Universidad Nacional de Colombia (2012). https:// repositorio.unal.edu.co/handle/unal/21336 21. Tangthaikwan, K., Keeratipranon, N., Agsornintara, A.: Multiclass support vector machine for classification spatial data from satellite image. Multiclass support vector machine for classification spatial data from satellite image (2017) 22. Pennsylvania State University: 12.1—logistic regression | stat 462 2021 (2021). Available: https://online.stat.psu.edu/stat462/node/207/ 23. Vera Torres, J.A.: RACEFN Glosario de Geología (2009). http://www.ugr.es/~agcasco/per sonal/rac-geologia/0-rac.htm
SHAQ: Single Headed Attention with Quasi-recurrence Sangeet Dandona(B) , Warren Kushner, Nashwin Bharwani, and Ben Schreiber Georgia Institute of Technology, Atlanta, USA [email protected] Abstract. Natural Language Processing research has recently been dominated by large scale transformer models. Although they achieve state of the art on many important language tasks, transformers often require expensive compute resources, and days spanning to weeks to train. This is feasible for researchers at big tech companies and leading research universities, but not for scrappy start-up founders, students, and independent researchers. Stephen Merity’s SHA-RNN, a compact, hybrid attention-RNN model, is designed for consumer-grade modeling as it requires significantly fewer parameters and less training time to reach near state of the art results. We analyze Merity’s model here through an exploratory model analysis over several units of the architecture considering both training time and overall quality in our assessment. Ultimately, we combine these findings into a new architecture which we call SHAQ: Single Headed Attention Quasi-recurrent Neural Network. With our new architecture we achieved similar accuracy results as the SHA-RNN while accomplishing a 4x speed boost in training. Keywords: Natural Language Processing Attention
1
· NLP · RNN · QRNN ·
Introduction
This paper presents a model analysis of the Single Headed Attention Recurrent Neural Network (SHA-RNN) architecture popularized by Stephen Merity in his paper Single Headed Attention RNN: Stop Thinking With Your Head [11]. In this paper Merity reports near state of the art performance on the ENWIK8 character modeling task using a single GPU (Titan V) inside of one day. When Sha-RNN was published the current state of the art (SOTA) in language modeling was becoming increasingly led by transformer models. Merity’s work demonstrated that it was possible to achieve near-SOTA results without relying on large, complex transformer architectures. Our aim here is to extend Merity’s research of the SHA-RNN by seeing if we can enhance the original network to increase training speed and/or accuracy while minimizing the memory footprint. Although Merity’s results were widely celebrated in the practitioner community, the paper itself has limited detail on how the author actually arrived at c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 548–563, 2022. https://doi.org/10.1007/978-3-031-10464-0_37
SHAQ
549
the final architecture. This is not a problem for those who merely want to use the model. However, for those who want to understand and further optimize the model, it is important to know what each component’s purpose is and how it impacts performance. To address this need, we performed an ablation study over several components of the model, both by removing them and substituting in new components. More specifically, we wanted to examine the recurrent layer of the network, boom layer, custom attention head layer, and number of layer blocks in the network. By applying SHA-RNN to the ENWIK8 dataset we measure both the quality of solutions that our ablation studies produce and the amount of training time it takes to achieve them. Additionally, from the results of our experimentation, we introduce a new variation on SHA-RNN called the Single Headed Attention Quasi-recurrent Neural Network (SHAQ) which has shown to outperform the SHA-RNN model in terms of both speed and accuracy. 1.1
Background and Motivation
The application of interest here is character prediction which falls under a more broad category of Natural Language Processing (NLP), in which transformer networks have dominated. These networks are able to make use of ordered information through positional embeddings without the need to run sequentially. This allows for parallelization in both training and inference time. Additionally, transformers develop a strong sense of context awareness through their attention mechanism, which consists of estimating a distribution of the importance of preceding tokens to inform the prediction of the next token [13]. Transformers have achieved state of the art results in many problems including machine translation, text prediction and conversation. One of the most notable transformer architectures is BERT, which is a bi-directional encoder model which means that it uses both left oriented and right oriented context to predict tokens in a sentence [3]. It has achieved record-breaking performance on important language problems including GLUE, a composite score of an array of natural language tasks, and SQuAD, which involves predicting the answer to a question given a Wikipedia passage containing necessary information. Since BERT was released, it has become a common pretrained foundation for many practical language applications and has served as a baseline and fountain of inspiration for other cutting edge research projects. One example of this is RoBerta, where the team started with the BERT model architecture and made several adjustments during training including training on a 10x larger input dataset, increasing the batch size, increasing the vocabulary size, and dynamically changing the masks applied to the training examples [7]. This achieved an even better GLUE score than the original BERT, and outperformed it on each of the 9 tasks. One important thing to note about models like BERT and RoBERTa is that they are driven by abundantly funded organizations and institutions. While there is merit in striving for the absolute maximum performance that can be achieved with massive scale transformers, it is often infeasible for startups and individuals on a bootstrap budget to build on or even apply this research, limiting
550
S. Dandona et al.
the audience to which it is relevant. These limitations bottleneck innovation, preventing many researchers from building upon previous advancements and limiting the influx of ideas. In particular, outside the realm of research and academia there are constraints of budget, training time, and machine resources which make it difficult to engineer something like BERT which requires 16 TPU’s and 4 days to train, and resulted in a 340M parameter model, which will not be trivial to store and serve at-scale in production. If successful, this paper will show how SHA-RNN can be customized for use by researchers with limited computing resources. The introduction of SHAQ reinforces Merity’s argument that you do not need to build extremely large-scale models to reach near SOTA results. Stephen Merity notes that there is a significant bias in Big Tech and academia towards enormous transformer models remarking how researchers are trending towards this approach like ”moths to a light-bulb” [11]. On top of this, Merity also reminds us that there has not been a definitive answer as to whether RNN models are inferior to transformers especially in low-resource situations. It turns out that the door really was not shut for RNN’s. Using a multi-layer RNN architecture that employs an attention head mechanism, Merity achieves excellent results on a single GPU with one day of training and a 50M parameter model. Our goal in this endeavor is to understand and build on what Merity has started with SHA-RNN. Our ablation study shows how some components are critical to success while others can be swapped out. Furthermore, SHA-RNN can be scaled in such a way that it becomes much leaner and faster to train at the cost of slight performance degradation. This kind of information is useful for people who are tight on resources so that they can make the best decision possible when it comes to training, serving, and storing their models. 1.2
ENWIK8 Dataset
We employed the ENWIK8 dataset, which is the main focus of the Hutter Prize competition. Regarding the aspects mentioned in [4], we will consider motivation, composition, collection and preprocessing. The motivation of this dataset by computer scientist Marcus Hutter was to ignite AI innovation through an equivalent task of compression. He chose to use Wikipedia as a corpus since it is a good proxy for a snapshot of human knowledge at a given moment in time. The ENWIK8 dataset is 100M UTF encoded bytes that were from an XML dump of the English version of Wikipedia. In terms of preprocessing, we were able to take full advantage of the utilities in the Merity code base which splits train, validation and test sets of 90M, 5M, and 5M respectively, and cuts the data into variable length character sequences of a minimum of 5 characters long. The user has some control over these sequence lengths by providing a center of a distribution in which the lengths are drawn at random. With respect to collection, this dataset will have any of the same flaws that are present in Wikipedia pages concerning sensitivity and privacy.
SHAQ
2 2.1
551
Architecture Overview SHA-RNN
SHA-RNN was inspired by transformers and adopts an attention head and fully connected layers to support a recurrent cell. The network itself is composed of an embedding layer followed by four SHA-RNN blocks. Figure 1 shows each block contains an LSTM cell, followed by a custom attention head and a modification of a fully-connected layer which Merity refers to as a ’Boom layer’ [11]. Both the attention head and the Boom layer have a residual connection.
Fig. 1. A single SHA-RNN block. Multiple Blocks bre chained together in sequence to train the full network. In addition to the LSTM, merity includes his custom attention head and boom layer in order to process the output of the recurrent cell.
The custom attention head primarily differs from standard attention heads by ignoring the linear pre-processing step when preparing the key and value parameters. It incorporates 3 gating vectors called qs, ks and vs which are multiplied elementwise across the entire set of the corresponding query, key and value vectors. It also adds Dropout and Normalization layers in several sections to help regularize the network. This attention head can be approximated visually by Fig. 2. The Boom layer is a simplification of a typical feed-forward fully-connected cell commonly used in transformer architectures with linear layers which expands dimensions 1024 to 2048, a GeLU activation function, and a linear layer that goes
552
S. Dandona et al.
from 2048 to 1024. Merity’s approach replaces the second layer with an operation to split the activation function output into two equally sized chunks of 1024 each. The first chunk comprises the first set of dimensions going from 1 to 1024 and the second chunk takes the dimensions going from 1025 to 2048. These two chunks are then summed together producing a final output of size 1024. This has empirically shown to be a good approximation of the linear layer by producing slightly worse results, by significantly reducing the number of parameters and increasing speed, leading to a more efficient architecture. The SHA-RNN model is trained using the LAMB optimizer. This optimizer often works significantly better than standard SGD optimizers on transformer models, and in practice has been able to reduce the training time of BERT from 3 days to 76 min [14]. It accomplishes this as a modification of the Layerwise Adaptive Rate Scaling (LARS) optimizer, making a few modifications to the trust ratio, and adopting the ADAM update rule [9]. This optimizes using the Cross Entropy Loss after a softmax layer. The network performance is measured using bits-per-character (bpc), which is the average number of bits necessary to represent a character [5]. A smaller number of bits means a more efficient encoding [5].
Fig. 2. S. Merity’s custom attention head. This differs from traditional attention since it does not apply a linear layer to the k and v inputs, thereby reducing the number of parameters to train.
SHAQ
3
553
Approach
We performed a series of experiments based on Merity’s original implementation of the SHA-RNN [10]. We first ran a baseline of Merity’s original code to capture bpc and time per epoch metrics and reported this as our baseline for comparison. Since we used Merity’s exact implementation we reasoned any deviation from the reported results in the original paper were due to slight differences in training and hyperparameter tweaks and not mistakes in our implementation of the architecture. In the paper, Merity implemented multiple novel modifications both to individual layers (like self-attention) as well as adding his own unique layer he called Boom. He received very impressive scores, but it was not clear how much of an influence each innovation he made had. He also did not report an ablation study in many aspects of the architecture. In order to decipher which layers and modifications had impact, we performed our own ablation study. In addition, we wanted to see if we could then improve upon the original architecture. Using the results of the ablation study along with testing hypotheses inspired by the literature for optimizing sequential networks, we made further custom additions that we thought could yield positive results on bpc and time efficiency. We then compared our modifications with the baseline findings to see if our additions were fruitful. Finally, we included our modification in the ablation/modification studies that either decreased model parameters, time efficiency and/or bpc score for a final architecture and recorded the results. In designing these experiments, we believed they would reveal interesting properties of SHA-RNN because ablation studies are prevalent and proven methods to analyze existing deep learning architectures. We also felt we could improve SHA-RNN because, as Merity notes, the SHA-RNN architecture was put together rather quickly without a complete analysis of tradeoffs. This suggested that there was an opportunity for optimization and refinement of the existing model. In SHAQ, our proposed architecture, we made use of the convolutional nature of Quasi-recurrent Neural Networks (QRNN) in addition to other optimizations to provide a faster and more robust model.
4
Experiment and Results
Specific experiments were chosen in order to investigate the purpose and performance of the individual components constructing the SHA-RNN architecture. As such, experiments were divided into Boom analysis, Attention analysis, Layer analysis, and Recurrent Unit analysis sections. All experiments were run for 32 epochs and were compared against a baseline, Merity’s SHA-RNN model using the default parameters published to his Github repository [10]. For consistency, all comparisons were performed with the baselines originating from the same GPU type. An experiment was considered successful if it is able to outperform the baseline in either time or performance without degradation. This was explicitly defined as offering a better bpc on the
554
S. Dandona et al.
test set given a similar amount of training time per epoch or offering a similar bpc on the test set and a smaller training time per epoch. Experiments were performed using Merity’s PyTorch implementation on his Github repository [10] as a starting point and making minor modifications to it. These updates can be found on our code repository1 . Table 1. Experiment results performed on an Tesla V100. All experiments were conducted with the same parameters and varying elements of the architecture. The accuracy metrics (i.e. Loss and bpc) were from the Test Set after 32 epochs. Experiment
4.1
Avg. time/Epoch Params Loss bpc
Baseline
2.16 h
54 M
0.84 1.208
Removed Boom
1.84 h
37 M
0.78 1.120
Replace Boom with FC 2.33 h
71 M
0.82 1.158
QRNN (w = 2)
45 M
0.78 1.126
0.92 h
Mean attention
2.03 h
53 M
0.84 1.208
Removal of Qs,Ks,V s
2.12 h
51 M
0.78 1.13
Boom
In order to determine the effectiveness and contribution of the Boom layer, we performed two experiments. The first of which replaced the Boom layer with its analogous Fully Connected layer and the latter of which removed the Boom layer entirely. We hypothesized that the fully-connected layer would eventually converge to a better value after all 32 epochs, but the Boom layer would outperform it after the same amount of time elapsed. Additionally, we believed that having no boom layer would cause the network to converge faster, but with a worse bpc. We found that the Baseline model using Merity’s Boom layer was able to outperform the fully connected model until approximately the 26 h mark. This showed that removing complexity from the model allowed the network to optimize more effectively when the model is trained under significant time pressure. It also demonstrated that if training time was not a factor, the Fully Connected layer would perform better (see Fig. 3). Surprisingly though, the network performed better and trained faster than the baseline without including any fully connected components to process the results of the attention head. This proved to be a successful experiment as it surpassed our expectations and demonstrated that adding a component after the attention head actively made the network worse. This is likely because the network is not generalizing well enough to compensate for the additional complexity. 1
https://github.gatech.edu/deep-learning-gatech.
SHAQ
555
Fig. 3. Experiments to determine the value of a Boom layer on a Tesla V100.
4.2
Alternate-Attention Head Experiments
Fig. 4. Displays experiments altering Merity’s attention layer and charting the validation bpc results over training time of 32 epochs in hours on a Tesla V100.
Merity used a custom self-attention head illustrated in Fig. 2. In his implementation every query, key and value had an accompanied Qs , Ks and Vs vector. These were created by having their own trainable parameters with a sigmoid activation to bind these ranges from [0, 1]. Figure 2 shows that they are multiplied elementwise to their corresponding Q, K, and V vector counterparts. Merity reasoned that certain dimensions in the LSTM hidden vectors contain local information while others encapsulate global information. As the network trains these parameters, Qs , Ks and Vs decide what dimensions to exclude versus what dimensions to incorporate [11]. This idea, however, was never tested in the original paper. For the first attention experiment, Qs , Ks and Vs were removed making the attention layer much simpler. This resulted in a smoother training and a much steeper learning curve (implying faster convergence) than baseline as can be seen in Fig. 4. Along with the faster convergence, it went through 32 epochs
556
S. Dandona et al.
of training 1.5 h faster than baseline on a Tesla V100. This was due to fewer operations in the attention layer. Finally, it had a significantly better bpc score than the baseline on the test set at 1.131 bpc. Merity also made use of caching the memory state vectors originating from the LSTM in prior windows of the training loop. For example, if the sequence length of the LSTM is p the LSTM will emit [Xi , ..., Xi+p ] along with [Mi , ..., Mi+p ] cell states where each X is the output vector, each M is the memory vector and subscript i is the index of the token within the document. In the training loop all the M ’s are cached and used for the next forward pass. Since the training loop only increments 1 token at a time the memory is stored and , ..., Mi+p−1 ] in relation to becomes M , where M is composed of vectors [Mi−1 position i (the current LSTM position). This cached sequence is concatenated with the current memories produced and the cached memory tensor gets larger until it reaches a max sequence length of 5000 − p. In the attention layer these memories are also concatenated with the current hidden state H by the sequence dimensions to make the key and value vectors [10]. As a result of the larger sequence size, the dot product in self attention will have many more operations. The modified mean attention condenses all of these concatenated memory vectors to its mean along the sequence dimension, thus reducing the computation for the downstream dot product and can be seen as follows: q q , ..., Mi+p−q ], M q = [Mi−q
...,
M = M =
[Mi−2 , ..., Mi+p−2 ], [Mi−1 , ..., Mi+p−1 ], q
C = Concat([M , ..., M , M ]) W = 1/(p ∗ q)
p∗q
(1)
Cj
(2)
Z = [W, Hi , ..., Hi+p ]
(3)
key, val ≈ Z, Z
(4)
j=0
where all H, M , and W vectors have a size of 1024 (≈ denotes downstream operations for key/val). The above idea was inspired by another simplified attention implementation where they replaced the fully connected operations for Q, K and V with an elementwise multiplication with trainable parameters summed over time [8]. As can be seen in Fig. 4 this did have much smoother convergence compared to baseline. It also completed training time in 64 h which is a 4 h speed boost compared to baseline. However, it suffered from a nontrivially worse score from baseline with a 1.219 bpc on the test set.
SHAQ
4.3
557
Layer-Attention Analysis
Fig. 5. Experiments to understand layer scaling on an RTX 3090.
This line of experiments began by considering why Merity chose to put an attention head specifically on the third layer of the network. We considered that putting it on a different layer could provide some benefit and tested this hypothesis by measuring performance after placing a single attention head on each layer each in a separate experiment. In Table 2 are the results, where putting the head on the first two layers performed very poorly and attention on layer 3 and 4 performed relatively the same. As to why this happened, we posit that the attention signal was probably being lost through the earlier layers and the gradients were too small to provide any meaningful update. This was not overfitting, it just was not learning. Since putting the attention head on either of the first two layers was unsuccessful, we wondered if these layers were worth the time and memory they consume. We proposed removing one layer and adding a second attention head in the fourth layer would be a reasonable tradeoff. This is since the extra attention head could compensate for the inferential power that we would lose by removing the layer, as more attention heads can jointly attend to information in different representation subspaces [13]. The results here showed a trade-off in favor of removing the layer in that model quality was stable, but training time was greatly reduced. We then decided to take off another layer without even adding another attention head, which resulted in another speed up without much bpc increase. Lastly, we wondered how much worse the two layer model would do without its second head. To our surprise, this model performed better than even the baseline model, and achieved these results in about 40% less time (see Fig. 5). This may have been a case of Occam’s Razor, where we found the right balance between model complexity and efficiency. However, given that in other research
558
S. Dandona et al.
endeavors on ENWIK8, giant transformer models have done well [1], it could also be a case of not fine-tuning the more complex model enough. Overall, we deem these experiments a success because we discovered a model architecture which could outperform the baseline in bpc, and require less training time and fewer parameters. 4.4
QRNN
The original implementation of Merity’s code made use of LSTMs as the recurrent unit in each block. We wanted to explore what would happen if the recurrent unit was altered and see if there would be any improvement in the overall speed or performance of the model on ENWIK8. Upon researching into different types of recurrent units, we came across the quasi-recurrent neural network (QRNN) proposed by Bradbary et al. [2]. The QRNN works by making use of two layers, a convolutional layer and a quasi-recurrent pooling layer as shown in Fig. 6.
Fig. 6. Convolution feeding into fo-pooling [2]
The basic idea is that the convolutional layer will perform a causal 1d convolution over token embeddings of a sequence. The causal 1d convolution provides a padding layer before the start of the sequence which consists of zeros placed to the left of the first token with the amount of padding being dependent on the size of the filter (or kernel) minus 1 [12]. This forces the filter to only look at the current and prior tokens as it slides across the sequence because of the offset of the padding. Running filters over these embeddings will produce results that emphasize what aspects of the embeddings possess the most information while only looking at the past [12]. Afterwards, the results of the convolutions are fed into a pooling layer that preserves the ordering of the tokens. The following equations summarize how the internals of a QRNN works mathematically: Z = tanh(Wz ∗ X)
(5)
F = σ(Wf ∗ X)
(6)
SHAQ
559
O = σ(Wo ∗ X)
(7)
ct = ft ct − 1 + (1 − ft ) zt
(8)
ht = ot ct
(9)
Equations (5), (6), and (7) represent the convolutional operations of the cell. Three different filter types are convolved over the sequence and fed into activations. Z, F , and O are different gates of the cell representing an input gate, forget gate, and output gate respectively [6]. Equations (8) and (9) represent the quasi-recurrent pooling layer which preserves ordering and calculates the cell states and hidden states respectively. Equation (8) preserves the ordering by calculating a moving average using the previous cell state [6]. The main advantage of using a QRNN is its speed. QRNNs achieve comparable performance to LSTMs but are much faster. The limitation of LSTMs is their sequential nature, needing to read one token at a time. The advantage of the QRNN is the parallelizability of the model due to its convolutions. Each convolution can be performed in parallel with different CUDA cores of a GPU which greatly increases the speed of the model [6]. The lack of parameters to learn in the pooling layer allows for faster sequential calculations as well [6]. As a result, in our experiments we found there was a noticeable improvement in the training time of the QRNN compared to baseline as shown in Fig. 7. The QRNN takes less then half the time to train over 32 epochs as compared to the LSTM in the baseline. Both Figs. 7 and 8 indicate that there is an improvement in bpc at the end of epoch 31 with the LSTM reaching a value of 1.227 on validation while the QRNN reaches a value of 1.152. As stated earlier, the speedup in training time is a result of being able to parallelize the convolutions over the embeddings as well having to learn fewer parameters. Moreover, performing simple pooling calculations as shown in (8) and (9) improve speed. The decrease in bpc is likely a result of the convolutions focusing on more important aspects of the character embeddings which helps in predicting the next byte. Table 1 shows that the QRNN achieved a bpc of 1.126 on the test set which is much lower than the value of 1.208 achieved by the LSTM.
5
SHAQ
Taking everything that was discussed earlier in the paper from our experiments, we decided to combine these findings into a new architecture that we call the single headed attention quasi-recurrent neural network or SHAQ. The basic building blocks are illustrated in Fig. 9. The figure shows that SHAQ is much simpler compared to SHA-RNN as shown in Fig. 1. The input is fed into a layer normalization (LN) block which allows for a rescaling of the input to have zero mean and a variance of 1. The output of the LN goes directly into a QRNN cell whose output is then passed through a dropout. The purpose of the dropout layer is to set some of the QRNN outputs to 0 to help regularize the network and thus improve the ability of the model to generalize. There is a skip connection on the output of the dropout from the output of the first LN. The purpose of the
560
S. Dandona et al.
Fig. 7. LSTM/QRNN validation bpc vs Time on a Tesla V100.
Fig. 8. LSTM/QRNN validation bpc vs Epoch on a Tesla V100. Table 2. Experiments where the location of the attention head was moved to different layers indicated within parentheses. These tests were performed on an RTX 3090 GPU
Experiment
Time/Epoch Params Loss bpc
4 Layer (1)
1.40 h
54 M
4 Layer (2)
1.41 h
54 M
3.52 5.07
4 Layer (3) Base 1.51 h
54 M
0.79 1.146
3.52 5.07
4 Layer (4)
1.41 h
54 M
0.79 1.141
3 Layer (3,3)
1.15 h
41 M
0.80 1.16
2 Layer (2,2)
0.84 h
29 M
0.81 1.164
2 Layer (2)
0.78 h
29 M
0.79 1.135
Table 3. SHAQ results on a Tesla V100. Training time Params Test loss Test bpc 18.5 h
26 M
0.82
1.180
SHAQ
561
skip connect is to provide more effective gradients during backpropagation. The result is then fed into two additional LNs and into the simplified self-attention which helps to provide focus on certain characters in the model. This result is fed through another dropout for further regularization and there is another skip connection to help with gradient flow.
Fig. 9. SHAQ block diagram
Figures 10 and 11 show the results of SHAQ on the validation set of ENWIK8. Figure 10 has SHAQ reaching a final validation bpc of 1.212. This is a good comparison to the baseline validation of 1.208 bpc as shown in Fig. 7. However, SHA-RNN took ∼69 h to train while SHAQ took ∼18.5 h to train, nearly 1/4 of the time. Figure 11 shows that there is a fairly smooth and steady decline in the bpc value as the number of epochs increase. We were able to reduce the number of parameters by half to 26 M from the original 53 M of SHA-RNN and SHAQ had 1.18 bpc on test vs 1.208 for SHA-RNN. Table 3 summarizes the number of parameters of SHAQ and its results on the test set.
562
S. Dandona et al.
Fig. 10. SHA-RNN/SHAQ validation bpc vs Time on a Tesla V100.
Fig. 11. SHA-RNN/SHAQ validation bpc vs Epoch on a Tesla V100.
6
Conclusion
Through our study of SHA-RNN, we discovered that the network still appeared to be too complicated in many aspects and would benefit from simplification. Some aspects of the network could be removed entirely or simplified to result in a less complex network which was able to better generalize over the dataset. In particular, we found replacing the LSTM with a QRNN, removing the Boom layer, removing attention parameters (Qs /Ks /Vs ), and reducing the depth of the network all improved results over the baseline. These considerations were combined into a new network called SHAQ, which was able to slightly improve upon the results of the SHA-RNN in 25% of the training time. There are still other improvements that could be done to SHAQ over time. Some examples include doing more robust hyperparameter tuning such as changing the learning rate or number of epochs or trying new variations on the Boom layer. Section 4.3 explored the idea of reducing the number of blocks used in the model. Both SHA-RNN and SHAQ used 4 blocks connected in sequence to train. An alteration to SHAQ could simply be reducing the number of blocks
SHAQ
563
while increasing the number of QRNN layers in each block. Since attention is all you need these days, another area to explore is increasing the number of heads per block as well.
References 1. Al-Rfou, R., Choe, D., Constant, N., Guo, M., Jones, L.: Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444 (2018) 2. Bradbury, J., Merity, S., Xiong, C., Socher, R.: Quasi-recurrent neural networks. arXiv preprint arXiv:1611.01576 (2016) 3. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2019) 4. Gebru, T., et al.: Datasheets for datasets. arXiv preprint arXiv:1803.09010 (2020) 5. Huyen, C.: Evaluation metrics for language modeling (2019) 6. Jagtap, R.: QRNN: a potential competitor to the transformer (2020) 7. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) 8. Luo, H., Zhang, S., Lei, M., Xie, L.: Simplified self-attention for transformer-based end-to-end speech recognition. arXiv preprint arXiv:2005.10463 (2020) 9. Mann, B.: An intuitive understanding of the lamb optimizer (2019) 10. Merity, S.: Sha-rnn. https://github.com/Smerity/sha-rnn (2019) 11. Merity, S.: Single headed attention RNN: stop thinking with your head. arXiv preprint arXiv:1911.11423 (2019) 12. Pascual, S.: D2l3 recurrent neural networks ii 13. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 (2017) 14. You, Y., et al.: Large batch optimization for deep learning: training Bert in 76 minutes. arXiv preprint arXiv:1904.00962 (2020)
Dynamic Topic Modeling Reveals Variations in Online Hate Narratives Richard Sear1(B) , Nicholas Johnson Restrepo2 , Yonatan Lupu1 , and Neil F. Johnson1 1 The Dynamic Online Networks Lab, The George Washington University,
Washington, DC 20052, USA {searri,neiljohnson}@gwu.edu 2 ClustrX, LLC, Washington, DC 20007, USA
Abstract. Online hate speech can precipitate and also follow real-world violence, such as the U.S. Capitol attack on January 6, 2021. However, the current volume of content and the wide variety of extremist narratives raise major challenges for social media companies in terms of tracking and mitigating the activity of hate groups and broader extremist movements. This is further complicated by the fact that hate groups and extremists can leverage multiple platforms in tandem in order to adapt and circumvent content moderation within any given platform (e.g. Facebook). We show how the computational approach of dynamic Latent Dirichlet Allocation (LDA) may be applied to analyze similarities and differences between online content that is shared across social media platforms by extremist communities, including Facebook, Gab, Telegram, and VK between January and April 2021. We also discuss characteristics revealed by unsupervised machine learning about how hate groups leverage sites to organize, recruit, and coordinate within and across such online platforms. Keywords: Latent Dirichlet Allocation · Online hate · Hate speech · Machine learning · Topic modeling
1 Introduction Online hate speech is a very worrying societal problem that is attracting significant attention not only among academics, but also among policy makers because of its highly negative impact on victims [1–5]. Arguments continue to rage around the trade-off between the need to moderate such content and to regulate or punish social media companies that do not comply, versus the need to protect online users’ free speech. The presence of hate speech raises a plethora of issues, perhaps most importantly that it can precipitate offline acts of violence. Better-moderated social media platforms such as Facebook have been stuck in a fight against the spread and proliferation of hate speech for years, with efforts increasing in early 2021 after the riot at the U.S. Capitol on January 6. Despite efforts to curtail it, hate speech continues to be a problem. Its resilience is partly a result of the adaptive, multi-platform network that carries hate speech throughout the internet between both moderated and unmoderated platforms. A better understanding of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 564–578, 2022. https://doi.org/10.1007/978-3-031-10464-0_38
Dynamic Topic Modeling Reveals Variations
565
this network, and the narratives it carries, is important for academics and policymakers looking to gain a better picture of the online battlefield across which online hate evolves. In particular, an automated approach could help social media companies to better police their own platforms in light of the sheer volume of new content that appears on each one daily. Indeed, Facebook already uses artificial intelligence to help it with moderation [6]. In short, online social media platforms with built-in community features are known to be popular fora in which producers of hate speech congregate. Unfortunately, social media companies have an uphill battle containing it due to the enormous amount of fresh material combined with its frequent virality across multiple social networks. Our study here is prompted by the following questions: (A) how are different social media platforms used to spread hate narratives? (B) Can an automated technique be developed in order to overcome the practical problem that human moderators cannot sift through such enormous amounts of content quickly enough every day across multiple platforms? The procedure used in this study is by no means a complete solution to these issues, but it provides a useful framework to be built upon in future work. We show here that a machine learning model like Latent Dirichlet Allocation (LDA) can provide useful insights into the ways that hate groups utilize different social media platforms for different purposes, both in the spreading of narratives and their attempts to coordinate and organize. Our study does not use Twitter data since Twitter tends to be used more as a “broadcast” medium, whereas narratives tend to be nurtured on platforms that have community spaces specifically built around fostering discussion (e.g. Facebook’s “Page”). Twitter is in the early stages of developing such community spaces, but has not yet made the feature widely available [7]. In the present methodology, we obtain the material from community content that is publicly available on Facebook Pages, VKontakte, Telegram, and Gab. All pages or groups used in this study were categorized by our team of subject matter experts (SMEs) as being “hateful” according to well-established criteria as discussed in Sect. 2. We stress that our analysis does not require individual or personal information to gain useful insights, similarly to how understanding conversations in a crowded environment does not need information about the individual people who make up the crowd. Details of our approach are provided in Sect. 2 of this paper. Though our study would benefit from further improvement and refinement, it represents one of the first attempts at a highly automated yet transparent model of hate speech analysis across multiple social media platforms.
2 Data and Machine Learning Methods We start by briefly describing the online ecosystem in which hate manages to thrive, and how the online audience aggregates itself within this ecosystem. The global social media universe comprises several billion users who operate within and often across multiple social media platforms. Most of these platforms have an in-built community feature that allows online users to aggregate around a topic of interest. Each platform uses their own term to describe such online communities; for example, Telegram uses “Group” or “Channel” whereas Facebook uses “Page” or “Group” [8]. Typically, these communities feature relatively benign narratives around sports or lifestyle choices, but
566
R. Sear et al.
some generate or focus on more extreme content which can be regarded as ‘hateful’. This subset of hateful communities and their narratives can survive a long time if the platform has lower levels of moderation. Any such in-built community can feature links (hyperlinks such as URLs) into other communities whose content is of interest to them, within the same social media platform and between different ones. This can help these communities keep their members away from moderator pressure [4, 5, p. 202]. Hence, the online ecosystem and its audience comprise a highly complex network of interconnected communities within and across platforms, through which hate narratives can evolve and move. Between 2019 and 2021, Facebook developed new content moderation policies designed to counter violent extremism and reduce hate speech [9, 10]. By contrast, Gab and Telegram have largely grown their user-base by positioning themselves as unmoderated (or less moderated) free-speech alternatives to major platforms like Facebook and Twitter [11, 12]. VKontakte (VK) is a social network with many similar features to Facebook, but based and hosted in Russia. While VK is subject to more content moderation policies than unmoderated alternatives like Gab, past research has shown that American and European white nationalists have ‘migrated’ to there after being banned from Facebook and Discord [13]. In this study we look across multiple social networks to capture and measure the publicly available text in posts that were shared in hateful communities. To label a community as ‘hateful’, two SMEs who focus on right-wing extremism manually reviewed each community’s most recent 25 posts. When the opinions of the SMEs coincided that two of these posts exhibited hate against protected groups referenced in the “hate crimes” description from the FBI, then we labeled that community as hateful and we included it in this study [14]. Reviewers also drew on the text of Mann’s discussion of violence that is ethnically and racially motivated as “cleansing nation-statism through paramilitarism” [15]. As a result of these definitions, the hateful communities included in this study include organized, well-known real-world hate groups like the KKK, as well as decentralized movements like certain Boogaloo groups. Our study uses only English text as identified by Google’s Compact Language Detector. However, our methodology and implementation can easily be extended to other languages. Our collection of hateful communities was carried out irrespective of their geographical location. All posts used in the study were created between January 1 and April 30, 2021 (inclusive). We perform standard preprocessing on the text to remove emojis, URLs, and stopwords. Notably, we leave in domains by converting them to recognizable tokens with the following procedure: “domain.com” becomes “domain__com”. Within the LDA model, such a token will be treated like any other word. We do this so that if there is a useful signal related to social media posts’ domain usage, it remains visible upon manual inspection of the output topics for LDA. Additionally, during this preprocessing phase, we unroll contractions (i.e. “don’t” becomes “do not”) and lemmatize and stem words using the Natural Language Toolkit.1 The goal of this preprocessing is to reduce the “noise” present in the text; generic articles and commonly-used words are not good indicators of topic, and therefore the LDA models will achieve a better fit without them. We base this off a similar preprocessing setup in previously-published work [16]. 1 https://www.nltk.org/.
Dynamic Topic Modeling Reveals Variations
567
We processed the text content by aggregating it for each platform (Telegram, Facebook, Gab, and VK). We then analyzed it using the machine learning tool LDA, which is an unsupervised learning algorithm [17]. This algorithm detects the emergence and evolution of topics by regarding documents as distributions of topics and topics as distributions of words. It learns how to fit these distributions to the dataset during training. We then employ a dynamical version of LDA, which also accounts for the timestamp when the post was created, to extract the evolution of the emergent topics over time [18]. We employ the Gensim implementation for both standard and dynamic LDA.2 This is a completely unsupervised process: all we need to input is the “number of topics” (n_topics), which is a parameter that designates how many groups the model should cluster text into. Having carried out this process, we then use CV coherence as an evaluation technique (see [19] for details). There are many types of coherence score which provide a quantitative method for measuring the alignment of words within an identified topic and can be used as a “goodness of fit” measurement for the topic modeling process. This CV coherence quantity is generated by a separate algorithm that analyzes the set of topics (coherence is not specific to LDA). Coherence analyzes the entire vocabulary of words in a corpus, ranked according to the word distribution in each topic. The CV coherence score for a single model is obtained by calculating the arithmetic mean of the scores obtained for each topic. Specifically, CV is calculated using a sliding window, one-set segmentation of the top words. This comprises collections of probability measures on how often top words in topics co-occur with each other in the corpus. The CV formula also incorporates cosine similarity as an indirect confirmation measure for the context vectors generated by the one-set segmentation. A full description and explanation of CV is given elsewhere [19]. Our manual review of the top words in each topic’s word distribution reveals that they do indeed relate to separate conversation topics. Sophisticated automation could be used to address the problem of troublesome content in the sea of new material appearing every day on social media platforms; specifically, the combination that we present here of both standard and dynamic LDA approaches. Many standard LDA models can be trained and then their topic keywords quantified using CV to determine the best number of topics discussed in particular platforms (i.e. the highest value of CV ). We can then seed this parameter into a dynamic LDA model that over a longer time period can automatically track the evolution of topics in terms of their highest-probability keywords. In the next section, we illustrate the output of this method [16], where we train multiple standard LDA models and then average their coherence scores to determine an optimal fit. All code used in these experiments is open-sourced and documented at the following repository: https://github.com/gwdonlab/topic-modeling. It can be used to run similar experiments on arbitrary text datasets.
2 https://radimrehurek.com/gensim/.
568
R. Sear et al.
3 Results and Discussion Here we show the results of data collection and analysis. We split the data into nine twoweek time frames to provide a reasonable balance between the following competing issues: having a large enough time frame such that there is sufficient data within each to get a good fit for the topic model, while also having a sufficiently small time frame to robustly identify the evolution of topics over time. Table 1 shows the quantity of data in our study. We first train multiple standard LDA models to determine the best number of topics for each platform. After training 10 standard LDA models for each value of n_topics, we evaluate the average CV coherence score, producing the coherence plot shown in Fig. 1. We then look at the coherences (CV ) to identify the value of the best fit n_topics per platform. This turns out to be 12 for Telegram, 9 for Facebook, 25 for VK, and 8 for Gab. Specifically, we do this by finding the peak in the average coherence scores which typically precedes their decay for large values of n_topics. We note that the platform Telegram has the most available data by far: this likely explains why the coherence scores for models trained on this data are so high relative to other platforms. Table 1. Data quantities. Each date indicates the start date of its two-week time frame. Even though the amount of posts in each time frame is not uniformly distributed, we believe each has enough data for the models to achieve a good fit. Telegram
VK
1/1
Facebook 8,689
Gab 5,659
114,488
692
1/15
9,493
2,458
188,108
237
1/29
7,985
20,022
99,747
1,109
2/12
8,207
15,104
104,142
1,095
2/26
3,778
3,290
90,436
824
3/12
3,722
13,202
78,006
731
3/26
6,357
12,696
65,807
742
4/9
3,936
14,070
62,504
816
4/23
2,343
10,120
32,688
Total
54,510
96,621
835,926
540 6,786
Dynamic Topic Modeling Reveals Variations
569
Fig. 1. For different numbers of topics (horizontal axis), the average coherence score is shown (vertical axis) for standard LDA models used to analyze the content of hateful communities within four separate social media platforms.
Figures 2, 3, 4 and 5 show the resulting coherence scores for each platform, disaggregated by topic, after performing dynamic LDA using the aforementioned n_topics values. Due to the implementation of dynamic LDA and our own computing restraints, we only train one dynamic LDA model for each platform. This is why the prior step of training standard LDA models to determine an optimal n_topics value is important as an attempt to avoid over- or underfitting.
570
R. Sear et al.
Fig. 2. Individual topics’ coherence scores within a 9-topic Facebook dynamic LDA model
Of particular note is discussion of the U.S. 2020 Presidential Election on several platforms. In itself, this is perhaps not surprising (even though the study period ranges well past January 2021) given the longevity that this election had in the U.S. and to some extent world media. However, our topic modeling reveals the variations between platforms in which this event was discussed. The topics relevant to the election were Topic 10 on Telegram and Topic 5 on Gab. The keywords and their probabilities for these topics are shown in Figs. 6 and 7. Posts which contained these topics tended to discuss events related to individual states’ recount efforts and generally the “stop the steal” narrative; this is also evident from analysis of the topics’ keyword evolution through all time frames (the word “military” appears in the topics during mid-March). Telegram’s Topic 10 was the most coherent of all topics anywhere: this suggests that Telegram acted as the primary platform where this narrative was prominently featured.
Dynamic Topic Modeling Reveals Variations
571
Fig. 3. Individual topics’ coherence scores within an 8-topic Gab dynamic LDA model
Facebook, on the other hand, had a far lower amount of such election content: no particular topic featured keywords related to the “stop the steal” narrative or, generally, the 2020 election. Our data suggests that during our study period, fewer English-speaking white nationalists/white supremacists were active on Facebook. This is likely because of a 2019 policy introduced by Facebook concerning hate speech and violent extremism, together with increased scrutiny within the U.S. For example, there was a major deplatforming event in the summer of 2020. Our data comes from clusters of users that identify as white nationalist; that is, the communities that persisted on Facebook concentrated towards “softer,” more peripheral hate narratives like white motherhood, white beauty, children’s defense, and political topics like immigration. These communities have survived on Facebook because they make a point of avoiding explicit hate. By contrast,
572
R. Sear et al.
Fig. 4. Individual topics’ coherence scores within a 12-topic Telegram dynamic LDA model
communities on the less moderated platforms were free to blend these “soft-hate” topics with more explicit narratives including (but not limited to) “stop the steal.” This self-censorship is likely the reason the 2020 election is not as prominent among topics discovered in Facebook groups. Interestingly, increases and decreases in coherence score can also prove useful in analyzing when communities are increasing or decreasing their interest in some broad narrative or set of conversation topics – and hence aggregating towards or fragmenting away from these things. On Telegram, for example, the most significant decrease in coherence score over our study period is shown by Topic 6. Topic 6 features discussions around getting banned or censored as well as mentions of other platforms like Parler and Twitter. There are peaks in its coherence score in January and February, which coincides with the aftermath of the January 6 Capitol riot when more mainstream, better-moderated
Dynamic Topic Modeling Reveals Variations
573
Fig. 5. Individual topics’ coherence scores within a 25-topic VK dynamic LDA model
platforms like Facebook or Twitter, and web hosts like Amazon were removing communities [20]. After people were banned from these mainstream, moderated platforms, many of them migrated to Telegram [21]. The evolution of this topic demonstrates how dynamic LDA can be leveraged to detect coordination within and across platforms at the macroscopic movement level. Notably, the coherence score then decreases over March and April as users settle into their new platforms. Finally, it is noteworthy that multimedia content is a key part of the narratives in the less-moderated platforms; specifically, videos on external websites which can help reinforce hateful narratives being expressed. On Gab, Telegram, and VK, our LDA approach found a topic that included the “youtube__com” signal, indicating links into YouTube. Topic 3 on Gab also included frequent use of video platforms Rumble and BitChute, indicating the wide variety of platforms employed to host these narratives, as well as the frequent linkage between them.
574
R. Sear et al.
Fig. 6. Word probabilities for Gab LDA, topic 5, during the first and last time frames
Dynamic Topic Modeling Reveals Variations
575
Fig. 7. Word probabilities for Telegram LDA, topic 10, during the first and last time frames
576
R. Sear et al.
4 Limitations of the Study Of course, much work remains to be done. It would be interesting to directly address the question of external agents or entities. Specifically, it would be useful to try to gauge how much influence such forces exert on these networks [22]. We note, however, that troll or bot-like behavior tends to be weeded out by self-policing within these online communities. We also know that more granular analysis of the types of content could prove fruitful, as well as incorporating more platforms. Shorter time frames would allow analysts to study with greater precision the ways in which these narratives evolve. Ideally, this analysis will go beyond just the use of LDA algorithms and analysis of pure text. This would be of interest since multimedia posts are very common. Further research is also required to derive actionable results for social media moderators and policymakers. Another open question is whether the structure of the network itself could aid the analysis of these narratives, or whether the topic modeling presented here could aid network analysis.
5 Conclusion We have shown that application of simple unsupervised topic model architecture like LDA can provide significant insights into the online hate ecosystem, in particular the style of narratives users share to these communities and the ways different platforms are employed. Our methodology and machinery can potentially be used at scale to help moderation efforts across platforms and hence reduce the spread of hateful material. Specifically, we showed that a machine learning algorithm (LDA) can identify word distributions within posts from historically hateful online communities which are both plausible as distinct conversation topics and useful for gaining insights into the structure of narratives in these communities. Algorithms like LDA can not only handle huge quantities of data, but deliver results quickly. These techniques are significantly less costly than needing to rely on human labeling. Acknowledgments. We acknowledge Rhys Leahy and Nicolás Velásquez for their help finding and downloading the data used in this study. We are grateful for funding for this research from the U.S. Air Force Office of Scientific Research under award numbers FA9550-20-1-0382 and FA9550-20-1-0383. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force.
References 1. Velásquez, N., et al.: Online hate network spreads malicious COVID-19 content outside the control of individual social media platforms. Sci. Rep. 11(1), 11549 (2021). https://doi.org/ 10.1038/s41598-021-89467-y 2. Hate crime: abuse, hate and extremism online – Home Affairs Committee – House of Commons. https://publications.parliament.uk/pa/cm201617/cmselect/cmhaff/609/60902. htm. Accessed 5 October 2021
Dynamic Topic Modeling Reveals Variations
577
3. Cullors, P.: Online hate is a deadly threat. When will tech companies finally take it seriously? CNN. https://www.cnn.com/2018/11/01/opinions/social-media-hate-speech-cullors/ index.html. Accessed 5 October 2021 4. The year in hate and extremism 2020. Southern Poverty Law Center. https://www.splcenter. org/news/2021/02/01/year-hate-2020. Accessed 5 October 2021 5. The Daily 202: hate crimes are a much bigger problem than even the new FBI statistics show. Washington Post. [Online]. https://www.washingtonpost.com/news/powerpost/pal oma/daily-202/2018/11/14/daily-202-hate-crimes-are-a-much-bigger-problem-than-eventhe-new-fbi-statistics-show/5beba5bd1b326b39290547e2/. Accessed 5 October 2021 6. Vincent, J.: Facebook is now using AI to sort content for quicker moderation. The Verge, Nov. 13, 2020. https://www.theverge.com/2020/11/13/21562596/facebook-ai-moderation. Accessed 27 September 2021 7. Communities: talk about your thing with people who get you. https://blog.twitter.com/en_us/ topics/product/2021/testing-communities. Accessed 27 September 2021 8. Facebook. https://www.facebook.com/policies_center/pages_groups_events. Accessed 3 September 2021 9. Removing new types of harmful networks. About Facebook, Sep. 16, 2021. https://about. fb.com/news/2021/09/removing-new-types-of-harmful-networks/. Accessed 30 September 2021 10. Combating hate and extremism. About Facebook, Sep. 17, 2019. https://about.fb.com/news/ 2019/09/combating-hate-and-extremism/. Accessed 30 September 2021 11. Roose, K.: On Gab, an extremist-friendly site, Pittsburgh shooting suspect aired his hatred in full. The New York Times, Oct. 28, 2018. [Online]. https://www.nytimes.com/2018/10/28/ us/gab-robert-bowers-pittsburgh-synagogue-shootings.html. Accessed 30 September 2021 12. Schwirtz, M.: Telegram, pro-democracy tool, struggles over new fans from far right. The New York Times, Jan. 26, 2021. [Online]. https://www.nytimes.com/2021/01/26/world/europe/tel egram-app-far-right.html. Accessed 30 September 2021 13. Johnson, N.F., et al.: Hidden resilience and adaptive dynamics of the global online hate ecology. Nature 573(7773), 261–265 (2019). https://doi.org/10.1038/s41586-019-1494-7 14. Hate Crimes. Federal Bureau of Investigation. https://www.fbi.gov/investigate/civil-rights/ hate-crimes. Accessed 30 September 2021 15. Grand, A.D.: Michael Mann, Fascists. J. Mod. Hist. 78(2), 473–475 (2006). https://doi.org/ 10.1086/505814 16. Sear, R.F., et al.: Quantifying COVID-19 content in the online health opinion war using machine learning. IEEE Access 8, 91886–91893 (2020). https://doi.org/10.1109/ACCESS. 2020.2993967 17. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993– 1022 (2003) 18. Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning – ICML ’06, Pittsburgh, Pennsylvania, pp. 113–120 (2006). https://doi.org/10.1145/1143844.1143859 19. Syed, S., Spruit M.: Full-text or abstract? Examining topic coherence scores using latent Dirichlet allocation. In: 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), October 2017, pp. 165–174. https://doi.org/10.1109/DSAA.2017.61 20. Trump and his allies are banned from these platforms. The Washington Post. https://www.was hingtonpost.com/technology/2021/01/11/trump-banned-social-media/. Accessed 30 September 2021
578
R. Sear et al.
21. Far-right groups move online conversations from social media to chat apps – and out of view of law enforcement. Washington Post. [Online]. https://www.washingtonpost.com/technology/ 2021/01/15/parler-telegram-chat-apps/. Accessed 30 September 2021 22. Broniatowski, D.A., et al.: Weaponized health communication: Twitter bots and Russian trolls amplify the vaccine debate. Am. J. Public Health 108(10), 1378–1384 (2018). https://doi.org/ 10.2105/AJPH.2018.304567
An Improved Bayesian TRIE Based Model for SMS Text Normalization Abhinava Sikdar1(B) and Niladri Chatterjee2 1
2
Columbia University, New York, USA [email protected] Indian Institute of Technology Delhi, New Delhi, India [email protected]
Abstract. Normalization of SMS text, commonly known as texting language, is being pursued for more than a decade. A probabilistic approach based on the Trie data structure was proposed in literature which was found to be better performing than HMM based approaches proposed earlier in predicting the correct alternative for an out-of-lexicon word. However, success of the Trie-based approach depends largely on how correctly the underlying probabilities of word occurrences are estimated. In this work we propose a structural modification to the existing Trie-based model along with a novel training algorithm and probability generation scheme. We prove two theorems on statistical properties of the proposed Trie and use them to claim that is an unbiased and consistent estimator of the occurrence probabilities of the words. We further fuse our model into the paradigm of noisy channel based error correction and provide a heuristic to go beyond a Damerau-Levenshtein distance of one. Keywords: SMS text normalization Estimation theory
1
· Noisy channel · Trie ·
Introduction
SMS text normalization focuses on translating texting language, often filled with abbreviations and marred by typing errors into plain English text. With the smartphone revolution and massive popularisation of social media platforms, users often transmit messages consisting of up to thousands of words in a single day. However, these text messages consist of numerous abbreviations and errors. This often arises due to a lack of formalities between users, human error, and in more severe cases due to disabilities. With an increase in the screen sizes, this is becoming more of a concern especially when the user resorts to one handed typing. Thus, shorter message length and semantic unambiguity end up working antagonistically which gives shape to a compressed, non-standard form of language called NetSpeak [3], commonly known as the texting language. Unfortunately, traditional NLP methods perform rather poorly when dealing with these kinds of texts [7]. As described in [10], texting language contains many nonstandard abbreviations, have unique characteristics and behave quite differently which may be the reason for poor performance of traditional NLP methods. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 579–593, 2022. https://doi.org/10.1007/978-3-031-10464-0_39
580
A. Sikdar and N. Chatterjee
In [11] the problem was approached by leveraging the phonetic information but required a dictionary for phonetic transcription. [5] used a dictionary for erroneous words but used an ambiguous ranking mechanism. [9] used a completely heuristic rule based model and employed a dictionary for typos as well which in all the three cases caused the time complexity to grow with the size of the dictionary. [13] compared machine translation models such as SMT and NMT and showed their critical dependence on high quality training data. In [2], a Hidden Markov Model based approach was used for each word present in the corpus to model all possible normalized or corrected texts and their probabilities of occurrence. This was further used to rank all the possible normalised words and the highest ranking word would be selected for output as the semantically correct word. In [1], a Trie based probability generation model was proposed, and was shown to outperform HMM-based models whenever the incorrect word was within an edit distance one after prepossessing for phonetic errors. However, in some cases the target word did not end up having the highest rank. For example, ‘mate’, the intended target word was ranked 4th in the suggestions list for the word ‘m8’. Through this work we address the limitations of this Trie based probability generation model. We make a set of improvements over the scheme proposed in [1] which we consider to be the baseline system for the present work. Firstly, unlike [1] where a new word is added in a Trie only once when it is encountered for the first time during the training phase, a dynamic training environment is used in the proposed model. In the proposed model the Trie is initially constructed with a certain corpus of most commonly used words and is then deployed. After deployment, the tries learns dynamically from the texting habits of the users over time i.e. it keeps adding the same words repeatedly as and when they are encountered. We show that the new model in the described environment is able to estimate the real probability of occurrence of a word through its own new probability generating algorithm. Hence it facilitates 0-1 loss minimisation in a decision theoretic setup. This in turn minimizes the chances of the target word not being ranked as the highest one. Additionally, it is well known that generating all the words which are at an edit distance two from a given word is computationally infeasible. This limited the spelling correction abilities of the baseline model. In this work we propose some novel heuristics that help in overcoming the above shortcoming significantly. The contributions of the present work are 1. We suggest a structural modification to the Trie-based reasoning scheme to improve model accuracy and performance. 2. We prove mathematically that in the described dynamic environment, the expectation of the thus generated Trie probability equals the occurrence probability for each word. 3. We further prove that the Trie probability of each word in the corpus almost surely (a.s.) converges to its occurrence probability. 4. Develop a set of heuristics for error correction. 5. Provide empirical results through simulations to support the presented theorems and highlight the superiority of the new model.
Improved Bayesian TRIE
2
581
Background
From existing literature on the behaviour of users while using texting language [6], one can classify most of the linguistic errors as: 1. Insertion Error: Insertion of an extra alphabet in the word. Eg. ‘hoarding’→ ‘hoardingh’ 2. Deletion Error: Deletion of an alphabet from the word. Eg. ‘mobile’→‘moble’ 3. Replacement Error: Replacement of an alphabet by another in the word. Eg. ‘amazing’→‘anazing’ 4. Swap Error: Swapping of two adjacent characters in the word. Eg. ‘care’→ ‘caer’ 5. Phonetic Error: A group of alphabets is replaced by another group of alphabets or numbers having the same sound. Eg. ‘tomorrow’→‘2morrow’ To deal with these errors, an elaborate error correction algorithm has already been proposed in the baseline model wherein given an erroneous/non-lexical word, a suggestion list of possible target words was prepared. The probability of occurrence for each word in the list was generated using a Trie based algorithm as described in Sect. 2.2 which was used for ranking the words in the suggestion list. The highest ranking word was given out as the likely intended word. 2.1
Design of the TRIE
Trie is a memory efficient data structure primarily used for storing of words. The search complexity in the data structure is O(M ), where M represents the number of characters in the longest word stored in the Trie which makes it a computationally efficient data structure for the problem. Each node contains an integer variable Count and a Boolean variable EndOf-Word. The Trie is set up by initializing it with a number of English words. Count for each node is initiated to zero and incremented by one each time a word is inserted through that node. End-Of-Word represents if a word ends at the given node and is set to True at the ending node of each word. Each time a new word is inserted, the Count variables of the passing nodes are updated and if at any point the next node is not already present for the insertion of a character, a new node is created and added as a child. At the end of the new word, End-Of-Word is switched from False to True. 2.2
TRIE Probability of a Word
Assigning probability of occurrence to a word through the Trie is a task of prime importance in the model. In the baseline model this was done in a simplistic manner as described below. The probability of choosing the ith child node was estimated by dividing its Count variable, to be referred as Count(i ) by Total (i ) which was defined as: T otal(i) = Count(i) + Count(j) (1) j∈Sibling(i)
582
A. Sikdar and N. Chatterjee
Fig. 1. Trie: the model has been trained with 8 words each once. Green represents a True value for End-Of-Word for the coloured node.
where Sibling(i ) refers to the set of all the other child nodes of the common parent node. Hence the probability of the ith child node was set to be: P (i) =
Count(i) T otal(i)
(2)
Further, to get the probability of a string s1 s2 . . . sn from the Trie, the conditional rule of probability was used as: P (s1 s2 . . . sn ) = P (s1 )P (s2 |s1 ) . . . P (sn |s1 s2 . . . sn−1 ) For example, consider the string ‘bird’, then P (bird) = P (b)P (i|b)P (r|bi)P (d|bir) Hence, from Fig. 1 we get that P (bird) =
3
1 2
×
3 4
×
1 3
×
1 1
= 18 .
The Improved Trie
We consider the user to be modelled as a corpus of words denoted by S where each word wi ∈ S has an occurrence probability of pi . Note that then the training and probability calculations as proposed in the baseline Trie does not account for an important case. This is when the corpus of words S contains two words say w1 and w2 with occurrence probabilities p1 and p2 , p1 = p2 such that w1 is a prefix of w2 or vice versa. For example consider the two words ‘bill’ and ‘bills’ in Fig. 1. No matter what the occurrence probabilities of those two words are, they will always end up with the same Trie probability1 even after training the 1
Generally, in more branched Tries, whenever w1 is a prefix of w2 , Trie probability of w2 ≤ Trie probability of w1 .
Improved Bayesian TRIE
583
Trie a large number of times after random sampling from S. While the deviation from occurrence probabilities might not vary a lot in the example considered since one word happens to be the plural of the other, it will matter much more in words such as ‘lips’ & ‘lipstick’ and ‘dear’ & ‘dearth’. To overcome this we propose the introduction of a dummy node. This dummy node is added as a child of every node which has End-Of-Word set as True but has at least one non-dummy node as its child. The Count variable of this dummy node is set to be the count variable of the parent reduced by the count variable of all the siblings of the new dummy node. Count(j) (3) Count(D) = Count(P ) − j∈Siblings(D)
where D denotes the dummy node and P denotes the parent of the dummy node. Algorithm I in this section outlines the procedure for training the new Trie as described above. In addition to the training algorithm, we also propose Algorithm II which is used to generate the Trie probability of a given string. Further, to justify usage of the new Trie mathematically, the proposed training algorithm and the algorithm for the Trie probability generation, we present Theorem 3.1, Theorem 3.2 and Corollary 3.3. Algorithm I: The Training Input: w, root Initialization: str ← w, m ← str.length, parent ← root, F lag ← 0 For j ← 1 to m 1. If parent.children does not contain str.charAt[j] node (a) Insert new node corresponding to str.charAt[j] (b) F lag ← 1 2. parent ← child corresponding to str.charAt[j] 3. parent.Counter ← parent.Counter+1 parent.End-Of-Word←True If a dummy node D is already a child of parent 1. Increment Counter of D by 1 Else if F lag is 0 and parent has at least 1 non-dummy child 1. Insert new dummy node as a child of parent 2. set Counter of dummy as in Eqn.(3) End
584
A. Sikdar and N. Chatterjee
Algorithm II: The Trie Probability Generation Input: w, root Output: Trie probability of w Initialization: str ← w, m ← str.length, parent ← root, probab ← 1 For j ← 1 to m 1. If parent.children does not contain str.charAt[j] node (a) return 0 2. child ← child corresponding to str.charAt[j] 3. Update probab using Eqn.(1) & Eqn.(2) for child 4. parent ← child If parent.End-Of-Word is True 1. If parent has dummy node (a) Update probab using Eqn.(1) & Eqn.(2) for parent 2. return probab return 0 End Theorem 3.1: Let S denote a corpus of finite words. Let wi ∈ S be a word of the corpus with an occurrence probability pi and let pˆi denote the word probability generated by the Trie. Let the Trie be trained n number of times after randomly sampling(with replacement) from S such that the Trie is trained pi ] = pi . with each wi ∈ S at least once. Then, E[ˆ Proof: We use strong induction over the number of words present in the corpus to prove the theorem. Base Case: Consider a corpus S consisting of two words w1 and w2 with occurrence probabilities p1 and p2 . Then three cases are possible as shown in Fig. 2. Case I: The starting characters of w1 and w2 are different. Then, after using the random sample from S to train the Trie n number of times. The Trie along with all the expected values of the Count variables are shown in Fig. 2(a). Clearly, E[ˆ p1 ] =
np1 = p1 np1 + np2
(4)
E[ˆ p2 ] =
np2 = p2 np1 + np2
(5)
and
Case II: Both w1 and w2 have a common string of characters in the front. For illustration, ‘more’ and ‘most’ have ‘mo’ common at the front. Hence similar to Case I, it can be argued from Fig. 2(b). that Eqs. (4) and (5) hold true.
Improved Bayesian TRIE
585
Case III: w1 can be obtained by appending a string of characters at the end of w2 or vice versa. Here, as described earlier we introduce a dummy node, seen as the blue box in Fig. 2(c). Similar to previous arguments, it can be seen that Eqs. (4) and (5) hold. Hence, for the base case when |S| = 2, the theorem is true.
Fig. 2. Induction base cases
Induction: Assume that the theorem holds whenever |S| ≤ k for some positive integer k. We now show that the theorem holds true for any corpus of size k+1. Hence assume that we are given the words w1 , w2 , . . . , wk+1 with occurence probabilities p1 , p2 , . . . pk+1 , respectively. Case I: Similar to Case I and Case II of the base case, all the words branch out at the same node together as in Fig. 3. The figure also shows the expected values of the Counter variables after training the Trie n number of times. Hence we can easily conclude that for each i = 1, 2, . . . k, k + 1, E[ˆ pi ] =
npi = pi np1 + . . . + npk+1
(6)
k+1 since i=1 pi = 1. Case II: In this case, consider all such possible Tries which are not covered by Case I. These Tries are in a way “more branched” than the ones considered in Case I. Consider T to be any such Trie and denote its root by R. Notice that there must exist a subtree T , which contains strictly less than k + 1 and at least two nodes which have their End-Of-Word set as True. It is easy to see that if not so, then T lies in Case I which is a contradiction. Let us denote the root of T
586
A. Sikdar and N. Chatterjee
Fig. 3. Induction step Case I
by R . Let us assume that the nodes in T for which End − Of − W ord is set to T rue represent the words w(1) , w(2) , . . . , w(r) with probabilities p(1) , p(2) , . . . , p(r) respectively where 2 ≤ r < k + 1. Now suppose that T instead of having the words w(1) , w(2) , . . . , w(r) , has a word whose characters are defined by travelling from R to R in the Trie as shown in Fig. 4. Let this prefix word be wα with a probability of pα = p(1) + p(2) + . . . + p(r) . Hence T now practically contains strictly less than k + 1 words. Hence by the induction hypothesis, when we train T , such that each one of the k + 2 − r words is used for training at least once, the expectation of the Trie probabilities matches the occurrence probabilities. This means for each wi ∈ T whose last node doesn’t lie in T , Eq. (6) holds true. Also, at the end of the training, E[Count(R )] = n × (p(1) + p(2) + . . . + p(r) ) r p(i) =α× p + p(2) + . . . + p(r) i=1 (1)
(7)
where α = n×(p(1) +p(2) +. . .+p(r) ). We can now consider T to be a standalone Trie by itself, consisting of words w(1) , w(2) , . . . , w(r) but truncated from the front so as to remove the characters of wα . Let these truncated words in T be denoted , w(2) , . . . , w(r) with occurrence probabilities p(1) + p(2) + . . . + p(r) . Now by w(1) notice that T has been trained α number of times and since r < k + 1, the expectation of probabilities of T will be the same as occurrence probabilities of the truncated words which from T are given by: p(i) = Poccurrence (wi |wα ) =
p(i) p(1) + p(2) + . . . + p(r)
(8)
Improved Bayesian TRIE
587
Fig. 4. Induction step Case II Hence, as described earlier, using induction, for each w(i) , i = 1, . . . , r
E[ˆ p(i) ] =
p(i) p(1) + p(2) + . . . + p(r)
(9)
Let us consider the entire tree T with its original words w(1) , w(2) , . . . , w(r) and not wα . Note that for each w(i) , i = 1, 2, . . . , r pˆ(i) = pˆ(i) × pˆα
(10)
One may further notice that this is the same as PT rie (w(i) ) = PT rie (w(i) |wα ) × PT rie (wα ), which implies that E[PT rie (w(i) )] = E[PT rie (w(i) |wα )] × E[PT rie (wα )] since both of the random variables on the RHS are independent by definition. Also since E[ˆ pα ] = p(1) + p(2) + . . . + p(r) , we can use Eqs. (9) and (10) to get that, for each w(i) , i = 1, . . . , r E[ˆ p(i) ] = E[ˆ p(i) ] × E[ˆ pα ] p(i) × (p(1) + p(2) + . . . + p(r) ) = p(1) + p(2) + . . . + p(r) = p(i) Hence for each wi ∈ T , E[ˆ pi ] = pi when |S| = k + 1. While Theorem 3.1 justifies the use of Trie through equality of occurrence probability and the expected Trie probability at any stage of training, the next theorem provides a more desired result. a.s.
Theorem 3.2: For each wi ∈ S, pˆi −−→ pi as n → ∞.
588
A. Sikdar and N. Chatterjee
Fig. 5. Base case with node labels
Proof: Let ηj denote a node of the Trie such that each time any one of the words w(1) , . . . w(ηj ) from S is sampled and used for training, its Counter variable is incremented by 1. Let Wηj = {w(1) , . . . w(ηj ) }. η Define, a random variable 1nj such that, 1 if nth training word ∈ Wηj 1ηnj = 0 otherwise. η
For a fixed node ηj , 1nj are i.i.d random variables for n ∈ N. Also for each η n ∈ N, E[1nj ] = p(1) +. . .+p(ηj ) . Note that the Counter variable of ηj is actually n ηj i=1 1i . Hence by using the Strong Law Of Large Numbers, n η 1 j a.s. Counter(ηj , n) = i=1 i −−−−→ p(1) + . . . + p(ηj ) (11) n→∞ n n With the above in mind, we proceed for induction over the number of words in S. Base Case: Let S contain two words w1 and w2 with occurrence probabilities p1 and p2 . Then as in the previous theorem, we can get three cases which can be for our purposes be depicted by Fig. 5. It is possible for η0 to be the root note as well. It is also possible for exactly one of η1 or η2 to be a dummy node. This would cover all the three cases. Now for w1 , Counter(η1 , n) Counter(η1 , n) + Counter(η2 , n) Counter(η1 , n)/n = Counter(η1 , n)/n + Counter(η2 , n)/n
pˆ1 =
a.s.
a.s.
n→∞
n→∞
Note that Counter(η1 , n)/n −−−−→ p1 and that Counter(η2 , n)/n −−−−→ p2 . Hence, by using the Continuous Mapping Theorem [12], we get that a.s.
pˆ1 −−−−→ n→∞
p1 = p1 p1 + p2
Improved Bayesian TRIE
589
Fig. 6. Almost sure convergence to occurrence probabilities a.s.
We can similarly show that pˆ2 −−−−→ p2 . Hence our base case i.e. |S| = 2 n→∞ holds true. The proof of the induction step is similar to that of Theorem 3.1. From Theorem 3.1 and Theorem 3.2 we conclude that from the perspective of Theory of Estimation and using the fact that almost sure convergence is stronger than convergence in probability, we get the following corollary Corollary 3.3: For each wi ∈ S, the Trie probability pˆi is an unbiased and consistent estimator of the occurrence probability pi . Through the results above, we can infer that the new Trie can learn the occurrence probabilities for any set of words through sufficient training which in turn implies that the new Trie can adapt to the texting and typing style of any user when deployed on either a mobile phone, laptop or any such other device.
4
Error Checking Algorithm
In this section, we built upon the error correction schemes used in the baseline model. 4.1
The Bayesian Approach
As mentioned earlier in Sect. 2, in [1] the words in the suggestions list were ranked solely based on the Trie probabilities. However, following the noisy channel model presented in [7], we use a Bayesian noisy channel approach to rank the words based on the type of error committed by the user. This requires assignment of
590
A. Sikdar and N. Chatterjee
Fig. 7. Equality of expectation to occurrence probabilites
probabilities to the types of error for which confusion matrices in [8] can be used. Hence P (w|w) ˜ ∝ P (w|w)P ˜ (w) where w denotes a possible correction for w. ˜ P (w|w) ˜ depends on the type of error committed to get w ˜ from w and P (w) is the occurrence probability of w estimated using the Trie. 4.2
Character Bigram
It was observed in [1] that the Trie could generate suggestions up to a DamerauLevenshtein distance of one. This is because generating an exhaustive list of words at a DL distance of two is computationally very expensive. However, 80% of human errors lie within a DL distance of one and almost all within a DL distance two [4]. To partially overcome this, we propose a heuristic on the lines of beam search algorithm, wherein we maintain a character probability bigram denoted by C where C[i][j] represents the probability of the next letter being ’j’ given that the current letter is ’i’. At each stage of error correction, we provide a score to each word say ‘s1 s2 . . . sn as: n−1 score(word) = Πi=1 C[si ][si+1 ]
(12)
and set a threshold, say Γ. If the score exceeds the threshold, the word is passed on to the next stage of error correction. This ensures that most of the intermediate low probability words are discarded while only the ones with a very high probability are passed on further.
Improved Bayesian TRIE
591
Fig. 8. Performance comparison of the two TRIE based models
5 5.1
Simulations Almost Sure Convergence
In light of Theorem 3.2, a simulation was performed where we trained the new Trie using the training and probability generation algorithms defined in Sect. 3 with the eight words used in Sect. 2. The Trie probabilities denoted by the bold lines evidently converge to the occurrence probabilities denoted by the dotted lines in Fig. 6. 5.2
Equality of Expectation
Similar to the simulation done in Sect. 5.1, we train 30 Tries and take the average of their probabilities. The bold lines can be seen to converge much faster to the dotted lines in Fig. 7 than in Fig. 6. This supports the theorem stating the equality of expectation of the Trie probabilities to the occurrence probabilities. 5.3
Comparison
In this simulation, we consecutively train both, the new Trie model proposed by us and the baseline Trie model using a corpus of twenty words. The assigned occurrence probabilities to these words are depicted in Fig. 8. The new Trie clearly outperforms the one proposed baseline Trie. An important observation is that the baseline Trie probabilities clearly sum up to more than one, hence not a valid probability measure.
592
5.4
A. Sikdar and N. Chatterjee
Error Correction
We use a dictionary of 20,000 highest frequency English words2 and 5000 TL words similar to [1] and use Zipf’s Law3 to fit4 a probability distribution over the ranked data. Comparative analysis with the baseline shown in Table 1 clearly showcases superiority of the new model. Further, we were able to achieve a top-1 accuracy of 81.97% as compared to a top-1 accuracy of 71.88% in [1]. Table 1. Comparison of output with baseline Trie Impure Target Top 5 suggestions tran
6
train
than, train, ran, trap, tan
Rank Rank in [1] 2
5
lng
long
long, lang, log, leg, an, gang
1
4
aple
apple
pale, alle, able, apple, ample
4
5
beleive believe believe, believed, believes
1
1
gost
ghost
host, lost, most, past, ghost
5
1
moble
noble
noble, nobler, nobles
1
2
cuz
cause
use, case, cause, cut, cup
3
8
cin
seen
in, can, sin, son, skin
13
17
dem
them
them, then, den, chem, the
1
11
m8
mate
mate, might, eight, ate, mare
1
6
thx
thanks the, thy, tax, thanks, them
4
1
h8
hate
1
–
hate, height, hare, ate, haste
Conclusions
In the paper, we first pointed out a limitation in the Trie based probability generating model proposed in existing literature, to overcome which, we proposed a structural modification, a training algorithm and a probability generating scheme. We further proved rigorously that the new Trie generated probabilities are an unbiased and consistent estimator of the occurrence probabilities. These occurrence probabilities vary user to user which the new Trie is capable of adapting to. We performed simulations, the results of which strongly support both the presented theorems and demonstrated superiority in error correction.
2 3 4
https://gist.github.com/h3xx/1976236. Value of the exponent characterising the distribution was set to 0.25. Not needed once deployed, model learns probability directly from the user.
Improved Bayesian TRIE
593
References 1. Chatterjee, N.: A Trie based model for SMS text normalization. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) CompCom 2019. AISC, vol. 997, pp. 846–859. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-22871-2 60 2. Choudhury, M., Saraf, R., Jain, V., Mukherjee, A., Sarkar, S., Anupam, B.: Investigation and modeling of the structure of texting language. IJDAR 10, 157–174 (2007) 3. Crystal, D.: Language and the Internet, 2nd edn. Cambridge University Press (2006) 4. Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964) 5. Desai, N., Narvekar, M.: Normalization of noisy text data. Procedia Comput. Sci. 45, 12 (2015) 6. Grinter, R.E., Eldridge, M.A.: Y do tngrs luv 2 txt msg? In: Proceedings of the Seventh European Conference on Computer Supported Cooperative Work (ECSCW 2001) (2001) 7. Jurafsky, D., Martin, J.H.: Speech and Language Processing, 2nd edn. PrenticeHall Inc., USA (2009) 8. Kernighan, M.D., Church, K.W., Gale, W.A.: A spelling correction program based on a noisy channel model. In: Proceedings of the 13th Conference on Computational Linguistics - Volume 2 (1990) 9. Khan, O.A., Karim, A.: A rule-based model for normalization of SMS text. In: 2012 IEEE 24th International Conference on Tools with Artificial Intelligence, vol. 1, pp. 634–641 (2012) 10. Kobus, C., Yvon, F., Damnati, G.: Normalizing SMS: are two metaphors better than one ? In: COLING (2008) 11. Li, C., Liu, Y.: Normalization of text messages using character- and phone-based machine translation approaches. In: INTERSPEECH (2012) 12. van der Vaart, A.W.: Asymptotic Statistics. Cambridge University Press (2000) ˜ ronique c 13. Veliz, C.M.: Orph´ee De Clercq, and VA Hoste. Comparing MT approaches for text normalization. In: RANLP (2019)
Dialog Act Segmentation and Classification in Vietnamese Tho Chi Luong1 and Oanh Thi Tran2(B) 1
FPT Technology Research Institute, FPT University, Hanoi, Vietnam [email protected] 2 International School, Vietnam National University, Hanoi, Vietnam [email protected]
Abstract. Natural Language Understanding (NLU) is a critical component in building a conversational system. So far, most systems have processed the user inputs at the utterance-level and assumed single dialog act (DA) per utterance. In fact, one utterance might contain more than one DA which are denoted by different continuous text spans inside it (a.k.a functional segments). As a step towards achieving natural and flexible interaction between human and machine especially in poor-resource languages, this paper presents a work for dialog segmentation (DS) and DA classification in Vietnamese. We first introduce the corpus and then systematically investigate different pipeline and joint learning approaches to deal the two tasks. Experimental results show that the joint learning approach is superior in boosting the performance of both tasks. It outperforms the conventional pipeline approach which looked at the two tasks separately. Moreover, to further enhance the final performance, this paper proposes a technique to enrich the models with useful DA knowledge. Compared to the standard models which don’t use DA knowledge, we achieve considerably better results for two tasks. Specifically, we achieved an F1 score of 86% in segmenting dialogues, and an F1-micro score of 74.75% in classifying DAs. This provides a strong foundation for future research on this interesting field. Keywords: Dialog segmentation Vietnamese retail domain
1
· Dialog act · Deep learning ·
Introduction
Nowadays, automatic conversational systems are emerging and receiving lots of attention among Natural Language Processing (NLP) research community in both academic and industry fields. To develop such systems, the segmentation and classification of DAs hidden inside utterances are the key components in understanding the natural languages. DA is a representation of the communicative function of an utterance in a dialog [16]. Based on this information, the machine can make appropriate responses to specific user inputs. Currently, most systems process user inputs at the utterance-level and assume one single DA per c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 594–604, 2022. https://doi.org/10.1007/978-3-031-10464-0_40
Dialog Act Segmentation and Classification in Vietnamese
595
utterance. In fact, one utterance may contain more than one DAs. An example of an utterance with three segments and their DAs is given in Table 1. DAs indicate the interaction between utterances inside each dialog. Therefore, it is necessary to provide the contextual information in recognizing DAs. For instance, the utterance “What color do you like? ” belongs to the DA class Question, specifically SetQuestion. However, in many situations, the identification of DAs is more challenging. For example, the utterance “I like the red one”, if it follows the previous Question utterance, it should be an Answer. Otherwise, it is likely to be an Inform or Statement. General speaking, to correctly determine their DAs, it is very important to understand the utterance from not only semantic, pragmatic and syntactic aspects, but also its dialog history.
Fig. 1. An example of dialog act segmentation and recognition. (English in Italics).
Researchers proposed two main approaches to deal the two tasks: – Pipeline approach: The utterance is first segmented into different functional segments by a DA segmenter [5,8,9,13,16]. Then, these predicted segments are in turn fed into a DA recognizer to produce their DA labels. – Joint learning approach: The joint methods [14,17–19] which treat two tasks in one go. For popular languages such as English, German and Chinese, lots of research and practices about this interesting application have been performed. In another point of view, very little work has been done for low-resource languages like Vietnamese. To our knowledge, there is only one work dedicated to this research direction. Specifically, Ngo et al., 2018 [15] introduced the first Vietnamese corpus for DAs including only 28 spoken dialogues annotated for DAs using the ISO 24617-2 standard [4]. Their work was limited to a smaller dataset in spoken texts and investigated only pipeline approach using Maxent. In this paper, we introduce a larger task-oriented dataset which is built using written chatting texts rather than spoken texts. Based on this corpus, we investigated a systematic study to these two tasks using both two approaches with robust deep neural models such as CNN [12], LSTM [7] and recent innovation in pre-trained language models, namely, BERT [6]. Experimental results show that the joint learning approach is superior in boosting the performance of both tasks. It outperforms the conventional pipeline approach which looked at the two tasks separately. Moreover, this work also proposes a technique to enrich the models with useful DA knowledge. Compared
596
T. C. Luong and O. T. Tran
to the standard models without using DA knowledge, we achieve considerably better results for two tasks in both approaches. Specifically, we achieved an F1 score of 86% in segmenting dialogues, and an F1-micro score of 74.75% in classifying DAs. In conclusion, this paper makes the following contributions: 1. Systematically investigates the first work on DA segmentation and classification in Vietnamese retail domain. 2. Introduces a new corpus labelled with both dialog segments and their DA types. 3. Proposes a technique to enrich the joint models with DA knowledge which demonstrates a further improvement on both tasks. The remainder of this paper is organized as follows: Related work is described in Sect. 2. Section 3 presents the models using two approaches. We also present a technique to integrate DA knowledge in this section. Section 4 first shows the rigorous process of annotating the corpus. Then, experimental setups and results are reported. Some analysis and discussions are also mentioned in this section. Finally, we conclude the paper and point out some future work in Sect. 5.
2
Related Work
Dialog Act Recognition Early work on DA recognition mostly exploited traditional machine learning methods with hand-crafted features. For example, Stolcke et al., 2000 [16] treated the discourse structure of a conversation as a HMM model and the individual DAs as observations emanating from the model states. Ivanovic, 2005 [8] developed a method which combines naive Bayes and DA n-grams to obtain better than 80% accuracy. Ang et al., 2005 [1] proposed using maxent classifier with textual features. Then, they added the posterior probabilities given by the tree as features to recognize DAs. With recent innovation in deep learning, researchers extensively exploited different deep neural architecture to segment and detect DAs in dialog systems. Its advantage is that it is not necessary to do heavy feature engineering by hands. It also leads to increased performance. One more notice is that most recent work on DAs using this approach has modeled the task at both utterance and discourse levels. For example, Kalchbrenner et al., 2013 [9] modeled the sentence using novel CNNs to convert the sentence into a vector. Then, a RNN network was used to model the discourse. Lee et al., 2016 [13] presented a model based on recurrent neural networks and convolutional neural networks that incorporates the preceding short texts. Ji et al., 2016 [11] presented a novel latent variable recurrent neural network architecture for jointly modeling sequences of words and discourse relations between adjacent sentences. Kumar et al., 2018 [10] build a hierarchical recurrent neural network using bidirectional LSTM as a base unit and the conditional random field (CRF) as the top layer to classify each
Dialog Act Segmentation and Classification in Vietnamese
597
utterance into its corresponding DA. Chen et al., 2018 [5] incorporated hierarchical semantic inference with memory mechanism on the utterance modeling at multiple levels. Then, the structured attention network on CRF is utilized to dynamically separate the utterances into cliques. Bothe et al., 2018 [3] proposed a novel context-based learning method to classify DAs using a character-level language model utterance representation, and we notice significant improvement. Joint Models of Segmentation and Recognition Most work on DA recognition assumed one single DA per utterance or considered predicting DAs after getting the gold functional segmentation. In fact, people observed that there existed utterances consisting of multiple DAs. So, the prerequisite is to segment utterances into different functional segmentations. Each segment corresponds to one DA. These two tasks have a close relation in terms that the DA segmentation is a pre-process of DA recognition, and the recognition of DA helps detecting segmentation. Some researchers proposed another direction which combines two tasks in one go. For example, Morbini and Sagae, 2011 [14] presented a data-driven approach for identification of multiple DAs in single utterances in the context of dialogue systems with limited training data. Zhao and Kawahara, 2017 [17] introduced a unified architecture which can (1) incorporate contextual information in DA recognition, and (2) integrate models for tasks of different levels as a whole. Zhao and Kawahara, 2018 [18] proposed a unified architecture based on neural networks, which consists of a sequence tagger for segmentation and a classifier for recognition. Zhao and Kawahara, 2019 [19] proposed an encoder-decoder model featuring joint coding and incorporate contextual information by means of an attentional mechanism. Work in Low-Resource Languages Like Vietnamese For low-resource languages like Vietnamese, we noticed only one work on this field. Specifically, Ngo et al., 2018 [15] presented a first dialog corpus annotated for DAs using the ISO 24617-2 standard [4]. The corpus size is also quite small with only 28 dialogues, 2273 turns and 5065 functional segments. They then exploited Maxent to conduct experiments using the pipeline approach. In this paper, we aim at building a larger dataset1 in the Vietnamese retail domain. Based on this corpus, a wider range of experiments based on traditional and deep learning method on both independent and joint learning settings is performed. A technique to enrich the prediction models is also introduced to further enhance the performance.
3
Proposed Methods
This section first describes the methods for segmentation, then presents the methods for DAs using pipeline approach. After that the methods for joint learning the two tasks in one go are described. 1
The dataset and baseline models will be freely available online once the paper is accepted for publication.
598
3.1
T. C. Luong and O. T. Tran
Dialog Segmentation
This task can be casted as a sequence labelling problem. In this paper, we investigate the robust DS models proposed by Zhao and Kawahara [18]. In the utterance representation layer, rather than using only GRUs, we also exploit other robust deep learning methods such as CNN [12], LSTM [7], and also a very recent innovation in the pre-trained language model, i.e. BERT [6]. 3.2
Dialog Act Classification
In classifying DAs, we may use contextual information to provide more information in making prediction. This work also investigates the robust DA classification models proposed by Zhao and Kawahara [18]. In comparison to the work of [18], we extend our work to investigate other robust deep learning methods such as CNN [12], LSTM [7], GRU [2], and BERT [6]. 3.3
Joint Learning Architecture
We implemented the joint encoder-decoder model proposed by Zhao et al., 2019 [19]. To further improve the performance, we also propose a technique to enrich the word embeddings with DA knowledge in the encoder layer. Figure 2 shows the joint learning architecture as follows: Encoder: This layer exploits either GRU [2] to encode utterances based on token embeddings. It outputs the hidden states {h1 , h2 , ..., hn }. To further enrich the token representations, we propose to integrate the knowledge about DA labels via probabilistic embeddings for each token. This is inspired by the intuition that certain keywords may be associated with certain DA labels and therefore acts as indicators for classifying each segment. For each token, a probability matrix X is created of size n × m, where m is the number of DAs. Thus, an element xij in X will be the probability that the ith word in n appears in texts that has the corresponding j th DA. Decoder: A unidirectional GRU decodes the output {y1 , y2 , ..., yn } as follows: ), zt = ht T agEmbedding(yt−1 rt = GRU (zt , rt−1 ), yt = F CLayer(rt ), yt = DecodingAlgorithm(y1:t )
where the tth input to the GRU, zt is the concatenation of the encoder output . The encoder-decoder is expected ht and the embedding of the decoded label yt−1 to model an output sequence more accurately because the prediction of yt relies . on not only the encoding ht but also the embedding of its last prediction yt−1 To model the history context, attention mechanism [19] is adopted to integrate contexts of the previous k utterances c1 , ..., ck .
Dialog Act Segmentation and Classification in Vietnamese
599
Fig. 2. A general architecture of the joint learning approach.
4
Corpus Construction
This section describes the rigorous annotation process to build the corpus in Vietnamese. The process consists of the following four main steps: 4.1
Collecting Raw Dialogues
Typed natural, human-human conversations between a client and a clerk of Sendo2 , one of the largest e-commerce retailers and online commerce platforms in Vietnam) were logged. We gathered the chat logs in the period from March to November 2019 for the next processing steps. Various dialogue scenarios are considered ranging from requesting basic information about products/services through ordering a product/service, performing payment and shipping, etc. 2
https://www.sendo.vn/.
600
4.2
T. C. Luong and O. T. Tran
Designing Label Sets
A team with the participation of all the group members, two linguists from Vietlex3 , and one sale staff at Sendo was established to design the labels. We composed the guidelines and trained annotators how to annotate. Dialog Segmentation. To determine the functional segments, we followed the definition by (Bunt et al., 2010). According to Bunt et al., an utterance is split into smaller units called functional segments which are minimal stretches of communicative behaviour. Normally, each segment expresses exactly one DA. This would make the process of assigning communicative functions simpler than considering long and complex turns (see Fig. 1). Dialog Acts. After investigating different annotation schema, we decided to choose ISO 24617-2 Standard as a guideline to design a customized set of DAs for Vietnamese (see Table 1). This standard is domain-independent and is applicable to any kind of dialogues. 4.3
Annotating the Corpus
We hired staffs from Vietlex to perform annotating the dataset. After 9 months, we completed 1,000 dialogues with 16,434 utterances. The Kappa coefficient was 0.91 which is interpreted as almost perfect agreement between annotators. 4.4
Double-Checking and Correcting the Corpus
When we received each batch of 100 dialogues, two staffs performed doublechecking this data. We randomly selected 30 dialogues to check. If the error rate was greater than 5%, we returned the data and requested the Vietlex team to re-edit all data. If it was equal or less than 5%, we just sent them the detailed errors. Then, we set up a meeting to discuss with the Vietlex team to clarify the cases and propose the solution to solve them. Accordingly, we also updated the guidelines and re-modified other similar errors on the annotated data which had been finished. Some statistics about the corpus are given in Table 1.
5
Experiments
5.1
Experimental Setups
We conducted 5-fold cross validation test. Each time, we left 10% of training data to fine-tune the parameters like number of epochs, batch sizes, learning rates, etc. Similar to previous work, we reported the precision, recall, and F1 score. For the DA task, because the data were too skewed, so we reported both F1-micro and F1-macro scores to see a more complete picture about the performance. 3
http://www.vietlex.com.
Dialog Act Segmentation and Classification in Vietnamese
601
Table 1. The number of samples per DA type. initialGreeting : 547
setAnswer: 1291
returnGreeting: 362
complain: 39
propositionalQuestion: 1012 correction: 7 propositionalAnswer: 1086
suggestion: 11 acceptSuggestion: 4 disagreement: 23
positiveAutoFeedback: 1399 selfCorrection: 6
inform: 2878
thanking: 1049
pausing: 11
expression: 523
contactCheck: 23
initialGoodbye: 34
choiceQuestion: 66
apology: 32
declineOffer: 18
choiceAnswer: 63
checkQuestion: 443
acceptApology: 7
request: 956
disconfirm: 51
instruct: 22
acceptRequest: 696
negativeAutoFeedback: 17
returnGoodbye: 15
declineRequest: 63
promise: 92
stalling: 1
offer: 390
acceptThanking: 365
threat: 5
acceptOffer: 279
confirm: 397
contactIndication: 1
setQuestion: 1114
agreement: 36
5.2
Network Training
All experiments were performed using Pytorch. For BERT, we used the viBERT4 which is optimized for Vietnamese. Word embedding were initialized using Glove method trained for Vietnamese with 100 dimensions. In training, we set the learning rate of 0.005, the weight decay of 0.01, number of epochs at 20, 30, 40 (where 30 yielded the best performance), the batch size of 16, and the optimization method of AdamW. For generating character embeddings using CNN, we set the number of hidden unit of 128, the number of dimensions of 100, the dropout rate of 0.5 to prevent over-fitting. 5.3
Experimental Results
We reported two types of experimental results for the pipeline and joint learning approaches as described in Sect. 3. Table 2 shows the experimental results. Pipeline Approach: For the DS task, we can see that using BERT+CRF outperformed the other methods. It boosted the F1 score by more than 3% and yielded 83.43% in the F1 score. For the task of DA recognition, the model using BERT still proved to be very effective. With or without using history context5 , BERT significantly outperformed CNNs by a large margin. 4 5
https://github.com/fpt-corp/viBERT. In experiments, we used five previous utterances as history dialog contexts.
602
T. C. Luong and O. T. Tran
Table 2. Experimental results of different models using both two approaches (i.e. pipeline and joint learning).
Methods
Dialog Segmentation Dialog Act Recognition Pre
Rec
F1
Pre
Rec
F1-micro F1-macro
80.2
79.99
–
–
–
–
80.58
79.31
–
–
–
–
Pipeline Approach BiLSTM
79.80
BiLSTM + CRF 78.09 BERT + CRF
84.88 82.02 83.43
–
–
–
–
GRU
–
–
–
60.57
60.57
60.50
36.67
CNN
–
–
–
59.67
60.00
60.00
38.27
BERT
–
–
–
61.49
62.00
61.49
40.14
GRU + context
–
–
–
64.20
64.20
64.20
38.01
CNN + context
–
–
–
63.60
63.63
63.62
38.61
BERT + context –
–
–
69.84 69.84 69.84
51.49
Joint Learning Approach ED
87.65
84.62
85.90
67.46
67.46
67.46
45.25
ED+context
87.74
84.46
85.97
74.84
74.84
74.84
50.35
ED+context 88.00 85.01 86.00 + DA-knowledge
74.75 74.75 74.75
51.10
The performance of the DA models was even significantly improved when integrated with contextual information. It improved the F1-micro and F1-macro scores for all models of GRU, CNN and BERT. This confirms that using history context was very effective in predicting DAs. For the best model of BERT+context, we achieved 69.84% in the F1-micro score and 51.49% in the F1-macro score. Joint Learning Approach: We implemented several variants of the joint learning architecture with/without history context via attention mechanism. We also reported experimental results of enriching the joint model with DA knowledge. Experimental results confirmed that the contextual information is very valuable and useful in predicting DAs. It significantly boosted all evaluation metrics by a large difference. Moreover, using DA-knowledge also slightly improved the final prediction performance of the model in all evaluation metrics. In comparison with pipeline approach, the model ED+context+DAknowledge, outperformed all other models. Compared with the best pipeline model using BERT, it remarkably enhanced the performance in all evaluation metrics by a very large margin and got a competitive F1-macro score. Using this model, we achieved 86% in the F1 score in segmenting dialogues, and 74.75% in the F1-micro score in classifying DAs.
Dialog Act Segmentation and Classification in Vietnamese
6
603
Conclusion
This paper presented a study on segmenting and classifying DAs in Vietnamese. To our knowledge, this is also a first work in Vietnamese which presented a systematic study on these tasks. We first introduced an annotated corpus for a taskoriented target domain, the retail domain. Based on this corpus, we have explored different robust architectures using both two main approaches of pipeline and joint learning ones. Experimental results on this dataset shows that the joint learning models were superior. They outperformed other strong pipeline models by a large margin. We also propose a technique to integrate the DA knowledge to enrich the joint models. The experimental results showed the effectiveness of DA knowledge in enhancing the performance of the both two tasks. The best joint model of ED+context integrated with DA knowledge yielded the best performance. We achieved an F1 score of 86% in segmenting dialogues, and an F1-micro score of 74.75% and F1-macro score of 51.1% in classifying DAs. In the future, we would like to extend the corpus and also annotate richer information such as slot-value information to help building Vietnamese chatbots in this emerging retail application domain. Additionally, more efforts should also be spent to boost up the final prediction performance.
References 1. Ang, J., Liu, Y., Shriberg, E.: Automatic dialog act segmentation and classification in multiparty meetings. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2005, pp. 1061–1064 (2005) 2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014) 3. Bothe, C., Weber, C., Magg, S., Wermter, S.: A context-based approach for dialogue act recognition using simple recurrent neural networks. In: Proceedings of the 11th LREC, pp. 1952–1957 (2018) 4. Bunt, H., et al.: ISO 24617-2: a semantically-based standard for dialogue annotation. In: Proceedings of LREC 2012, pp. 430–437 (2012) 5. Chen, Z., Yang, R., Zhao, Z., Cai, D., He, X.: Dialogue act recognition via CRFattentive structured network. In: Proceedings of the 41st SIGIR 2018, pp. 225–234 (2018) 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding, pp. 1–16. In: Proceedings of NAACL, Minnesota (2019) 7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. J. Neural Comput. 9(8), 1735–1780 (1997) 8. Ivanovic, E.: Dialogue act tagging for instant messaging chat sessions. In: Proceedings of the 43rd ACL, pp. 79–84 (2005) 9. Kalchbrenner, N., Blunsom, P.: Recurrent convolutional neural networks for discourse compositionality. In: Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pp. 119–126 (2013) 10. Kumar, H., Agarwal, A., Dasgupta, R., Joshi, S.: Dialogue act sequence labeling using hierarchical encoder with CRF. In: Proceedings of the 32nd AAAI, pp. 3440– 3447 (2018)
604
T. C. Luong and O. T. Tran
11. Ji, Y., Haffari, G., Eisenstein, J.: A latent variable recurrent neural network for discourse-driven language models. In: Proceedings of the 2016 NAACL HLT, pp. 332–342 (2016) 12. LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and timeseries. In: The Handbook of Brain Theory and Neural Networks, pp. 255–258. MIT Press, Cambridge (1998) 13. Lee, J.Y., Dernoncourt, F.: Sequential short-text classification with recurrent and convolutional neural networks. In: Proceedings of the 2016 NAACL HLT, pp. 515– 520 (2016) 14. Morbini, F., Sagae, K.: Joint Identification and Segmentation of Domain-Specific Dialogue Acts for Conversational Dialogue Systems. In: Proceedings of the 49th ACL, pp. 95–100 (2011) 15. Ngo, T.L., Pham, K.L., Takeda, H.: A vietnamese dialog act corpus based on ISO 24617–2 standard. Proc. LREC 2018, 39997–40001 (2018) 16. Stolcke, A., et al.: Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. linguist. 26(3), 339–373 (2000) 17. Zhao, T., Kawahara, T.: Joint learning of dialog act segmentation and recognition in spoken dialog using neural networks. In: Proceedings of 8th International Joint Conference on Natural Language Processing, pp. 704–712 (2017) 18. Zhao, T., Kawahara, T.: A unified neural architecture for joint dialog act segmentation and recognition in spoken dialog system. In: Proceedings of the 19th SIGDIAL, pp. 201–208 (2018) 19. Zhao, T., Kawahara, T.: Joint dialog act segmentation and recognition in human conversations using attention to dialog context. Comput. Speech Lang 57, 108–127 (2019)
ConDef: Automated Context-Aware Lexicography Using Large Online Encyclopedias Houjun Liu(B) and Zachary Sayyah The Nueva School, 131 E. 28th Avenue, San Mateo, CA, USA [email protected], [email protected]
Abstract. Current automated lexicography (term definition) techniques cannot include contextual or new term information as a part of their synthesis. Our work proposes a novel data-harvesting scheme that leverages lead paragraphs in Wikipedia to create a dataset used to train automated, context-aware lexicographical models. Furthermore, in order to validate the harvested dataset, we present ConDef, a fine-tuned BART model trained on the harvested data which defines vocabulary terms from a short context. ConDef is shown to be highly accurate in context-dependent lexicography as validated on ROUGE-1 and ROUGEL measures in an 1000-item withheld test set, achieving scores of 46.40% and 43.26% respectively. Furthermore, we demonstrate that ConDef’s synthesis serves as a good proxy for term definitions by achieving a ROUGE-1 measure of 27.79% directly against gold-standard WordNet definitions. Keywords: Lexicography · Terminology · Conditional generation Vocabulary data · Educational data mining
1
·
Introduction
In comparison to the highly-researched subfield of abstractive summary, there is a relative lack in research around automated, direct synthesis of word-level summative definitions [4]. In response, we seek to create a dataset to aid in the creation of terminology models for automated, context-aware lexicography. Human-written lexicographical databases such as WordNet have long been used in order to provide highly accurate word definitions [10]. However, these databases are often unable to index the contextual knowledge of the term, instead relying on external sense-prediction systems meant to model the context of the passage and to predict the best definition for multi-sense or other contextuallydependent terms. Such database rooted techniques, also entirely falter when defining new, never-indexed terms [7]. With the introduction of the Transformer network, the field of Natural Language Generation (NLG) has seen rapid growth due to its newfound ability to c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 605–612, 2022. https://doi.org/10.1007/978-3-031-10464-0_41
606
H. Liu and Z. Sayyah
produce high-quality products [11]. BART, a variant of the original Transformer network, improves upon the language-modeling capabilities of the Transformer by combining bi-directional encoding (as seen in the categorical sequence-tovalue model BERT) and auto-regressive decoding (as in GPT series of networks) modules to produce language matching the current state-of-the-art. Such generation could easily be leveraged to create natural language definitions given the existence of a sizable body of training data [6]. Online Encyclopedias such as Wikipedia provide a unique opportunity to harvest a large dataset of human-written context and sense-specific term definitions, due to Wikipedia’s exacting conventions [5]. The definitional “lead sentence” (first sentence) of each article (in which both the term and a description thereof is conventionally included, per the Wikimedia House Guide) [12] serves as a good proxy for term definitions; additionally the rest of the “lead section” (i.e. first paragraphs, abstract) of the article—with the lead sentence removed for prediction—serves as a suitable context for which the title—the “term”—is defined. We present a context-specific lexicography dataset harvested from Wikipedia in the manner described above. Furthermore, we introduce ConDef, a BART network fine-tuned on the proposed data which produces definitions of arbitrary terms given a brief long-form prose context in which the term is used. Below, we describe how the dataset on which we trained ConDef was built, and then demonstrate that ConDef produces context-specific term definitions that achieve high validation benchmark results when compared against humanwritten definitions. Finally, we also discuss the potential use-cases for such a model in expediting both manual lexicography and knowledge extraction.
2
Lexicography Dataset Construction
The performance of ConDef can be primarily attributed to the novelty of the dataset. In this work, we create a novel dataset leveraging the fact that the lead sentence of each article will almost always [12] contain the page title (acting as the term to define) as the subject of the sentence. In addition, contextual information relating to the background and explanation for the subject of the article will also be provided in the same lead sentence: creating perfect proxies for term definitions [5]. Example of Dataset Pair is given below: Term: Colombians Context: This connection may be residential, legal, historical or cultural. For most Colombians, several (or all) of these connections exist . . . languages and religions have combined to form the culture of Colombia and thus a modern Colombian identity. Target: Colombians are people identified with the country of Colombia.
2.1
Data Harvesting
Prior to data augmentation and processing, non-page artifacts were removed using a collection of regular expressions, and cross-links were indexed for later use as per Sect. 2.3.
Automated Context-Aware Lexicography
607
Wikipedia—like other collaborative texts—contains a large amount of unfinished and hardly proofread “stubs” (the remnants of partial articles). It should also be noted that longer articles tend to be reviewed by multiple editors and are therefore of higher quality. The removal of such stubs was considered necessary in order to preserve the integrity of our data. Accordingly, pages with lead sections containing less than two sentences, or bodies containing less than 20 sentences, were discarded. 2.2
Data Cleaning and Sampling
Basic steps in data cleaning and regularization were performed upon the harvested dataset to ensure compatibility with the model and to increase training performance. The house style guide discussed in Sect. 2 also dictates that the metadata regarding the subject (e.g. birth years, locations, pronunciation information, etc.) are to be placed in a parenthetical semicolon-separated list [12] after the subject of the article in the lead sentence. During exploratory training, such information was seen to cause the model to consistently produce coherent— though almost always fully incorrect—random metadata utterances attached to any appropriate subject. These incoherent outputs created confused prediction strings that broadly contain generated misinformation, also found in parethetical lists, regarding the biographics of the term to be defined. However, no other wide use of parentheses was found in the lead sentences; hence, we elected to remove any parenthetical information from the target definition strings to promote more accurate and consistent training results. We additionally regularize the “term” (the title of the article) to be lower-case and remove any parenthetical information within the term (e.g. “Bar (law)”). These two procedures ensure that the input data (the term) does not reveal any information regarding the “sense” (meaning variant) of the term: a critical objective for the model to predict from the context and not simply extract the information from the input term itself. After data regularization, each training input sample is formed by conjoining—via a BART token—the term to define and its corresponding lead-section (with the lead-sentence removed); the output sample is simply the lead-sentence itself (serving as the “definition”). All textual elements are tokenized via Byte-Pair encoding [3] and capped/tailed with the usual start/end sample tokens prescribed by the original training task of BART. 2.3
Data Augmentation
“Off-Center” Definitions The process of early exploratory training revealed a weakness in leveraging the lead section of an article as the “context” for term definitions as described above. The majority of passages’ lead sections are quite “information-rich”, containing a high density of extractable content specifically surrounding the target term that essentially already serves to nearly obviate
608
H. Liu and Z. Sayyah
the need for the model to perform any generalization beyond simply performing abstractive summarization. During the data harvesting process, however, we elected to follow and harvest the most linked (twice or more) articles within the lead section of each article as per Sect. 2.1 and storing the captured link relationship. These linked articles, referred to as the “off-center” (OC) articles, are used as a source for data augmentation that mitigates the aforementioned problem. By augmenting the dataset such that a portion (the selection of the ratio is later described in Sect. 3) thereof leverages the lead section of the main article to predict the lead sentence of not the article itself but that of an OC article, the model is induced to generalize further by training to predict the definition of a term from a context that utilizes—but does not directly describe—the term to be defined. Noise Entries. Early testing revealed an issue whereby the model, when given the task of providing a definition not derivable from the given context, would generate a grammatically correct, yet nonsensical and/or tautological definition. Such a result is not acceptable in the context of definition generation. In response, we elected to introduce a new token to the output set model, , representing “possible noise”. As an augmentation strategy, the input term to define in a small portion of the training dataset (as with OC samples, the ratio at which this is selected is discussed in Sect. 3) is pseudo-randomly swapped so as to have a term for definition in an unrelated context, and the model—in those cases—is expected to produce a token in recognition of the failure.
3
Training, Tuning, and Validation
Full exports of the English Wikipedia online encyclopedia are available in XML format as a dataset containing roughly 54,000,000 articles. After the data cleaning procedures described in Sect. 2.1, which removed most stub and brief entries, only around 537,600 sample pages of the original 54,244,773 articles were placed in the published dataset. Augmentation upon the dataset as described in Sects. 2.2 and 2.3 were performed, resulting in the final dataset with which ConDef was trained. As the novelty of this work surrounds the process of data mining and augmentation, a majority of the hyperparameters tuned for training surround the process of data preparation and, specifically, augmentation. After manually evaluating pre-training results, the final, evaluated trained model was trained with a learning rate of 3.2 × 10−6 , with a warm-up of 6800 steps; the model was trained at a batch size of two. The dataset is capped at 215 words (roughly 1505 characters)—close to the median of the content, as per Fig. 1—had an off-center mix of 21% and a training noise of 10% as per procedures outlined above.
Automated Context-Aware Lexicography
609
The weights of the proposed model were seeded by the BART (large) network [6,13] (24 layer, 16 heads, 406M params), and fine-tuned upon the proposed dataset. The learning rate was controlled via cosine annealing with soft warmups [9], with one cycle being complete along with each epoch (i.e. learning rate decays to 0 and resets every epoch.) As with the original BART training task, token-level categorical cross-entropy of model output against the desired output was used as the training objective. The reported benchmark model was trained by two epochs, over a span of 10 h, until convergence upon a single NVIDIA Titan V GPU. A withheld test set was created containing 1, 000 utterance pairs of equalparts OC samples (as per Sect. 2.3) and regular samples never seen by model during training. An additional 230-sample validation subset was created via sampling gold-standard definitions from WordNet. The final trained model performed the test labeling of each sample in the withheld test-set; the results of the machine-labels were then benchmarked against the corresponding humanwritten labels.
Fig. 1. (a)(b) Distribution of ROUGE-1 and ROUGE-L Recall Values over Test Set, with a Grey Line Representing Mean Values Reported above (both p < 0.001), (c)(d) Box Plot of Context and Target Lengths used for Prediction, (e) Distribution of ROUGE-1 Recall Values, Compared between OC and Non-OC Samples, (f) Correlation between Length of Context and ROUGE-1 Precision, p < 0.001, r ≈ −0.173
4
Results
After fine-tuning a BART-large model on the data prepared as highlighted above, the trained model is evaluated using abstractive summary metrics upon a test set entirely withheld from training data. The performance of the model upon the validation data (both withheld 1000item test set and against 210 WordNet definitions) is measured in context against the benchmark results from the leading works of abstractive summarization. The evaluation metrics chosen are the mean ROUGE-1 and ROUGE-L values [8]—gold-standard summarization metrics measuring the 1-g co-occurrence
610
H. Liu and Z. Sayyah
and longest overlapping sub-sequence respectively—between prediction and target. Additional results from similar approaches performing abstractive summary tasks are also shown as benchmarks. The results derived are reported in Table 1. Table 1. Training results in withheld test-set and against WordNet as compared to abstractive summary benchmarks [1, 2, 14] Benchmark
ROUGE-1 ROUGE-L
This Work (test) 46.40
43.26
This Work (wn)
27.79
24.14
BART-RXF
44.38
41.17
Muppet BART
52.34
42.80
ERNIE-GEN
33.75
31.35
Furthermore, the model’s performance upon various subsets of the validation data is also measured and reported in Fig. 1. The model appears to perform better on shorter contexts (with a Pearson’s correlation coefficient of ≈ −0.173 when attempting to derive a correlation between context-length and ROUGE-1 value, and the null hypothesis being there having no correlation); additionally, the model—as expected due to the additional generalization required for OC samples—performers better with non-OC samples. Target: Colombians are people identified with the country of Colombia. Output: Colombians are people who share a common connection to the country of Colombia.
5
Conclusion
In this work, a dataset was created to train a model to synthesize highly accurate, context-aware arbitrary English phrase definitions by leveraging data harvested from Wikipedia. Via the recognition that the lead sentence (first sentence) of each article served as a suitable proxy for term definitions under most Encyclopedias (and specifically Wikipedia), a dataset was built to aid in automated lexicography. Large-scale data augmentation upon the harvested data was performed to enable better generalization and noise recognition. The resulting model was validated against withheld training data and WordNet using both ROUGE-1 and ROUGE-L metrics. The trained model was shown to meet or exceed the benchmark results in the withheld test set, corroborating its ability to perform the lexicographical task prescribed effectively. Future progress could work to address the model’s preference for shorter contexts and improve its ability to generalize to the indirect off topic (OC) contexts. As ROUGE-L and ROUGE-1 are both lexical similarity metrics, the low overall lexical overlap between WordNet and Wikipedia (around 34.7 and 29.38
Automated Context-Aware Lexicography
611
on ROUGE-1 and ROUGE-L respectively) was reflected in the worsened performance of the model when validated against WordNet in comparison with the withheld test set. Although WordNet serves as a reasonable comparison point to validate model utterances independently, the model’s validation performance upon the withheld test-set cannot be directly compared against to that of WordNet. Nevertheless, the performance of the model upon WordNet (27.79 and 24.14 on ROUGE-1 and ROUGE-L) demonstrates the baseline ability for ConDef to serve as a proxy for word definitions. Automatic lexicography has the ability to enable vast new applications in education, legal, and other highly technical spaces with the need for the indexing and explication of domain-specific vocabulary. Furthermore, such models can serve as the basis for manual lexicography, easing and aiding the building of literature that predicate upon access to an actively refreshed resource of vocabulary. Acknowledgments. The authors would like to thank Mr. Albert Huang for providing computational resources in both exploratory and benchmark training; Mr. Barack Yedidia and Mr. Zen Simone for their time, expertise, and editing; Mr. Wes Chao and Dr. John Feland for general guidance relating to experiment design; and finally Dr. Klint Kanopka for providing us with valuable insight and guidance regarding training and validation strategies.
References 1. Aghajanyan, A., Gupta, A., Shrivastava, A., Chen, X., Zettlemoyer, L., Gupta, S.: Muppet: massive multi-task representations with pre-finetuning. arXiv preprint arXiv:2101.11038 (2021) 2. Aghajanyan, A., Shrivastava, A., Gupta, A., Goyal, N., Zettlemoyer, L., Gupta, S.: Better fine-tuning by reducing representational collapse. In: Conference Proceedings of ICLR 2021 (2020) 3. Gage, P.: A new algorithm for data compression. C Users J. 12(2), 23–38 (1994) 4. Gantar, P., Kosem, I., Krek, S.: Discovering automated lexicography: the case of the slovene lexical database. Int. J. Lexicogr. 29(2), 200–225 (2016) 5. Gantar, P., Kosem, I., Krek, S.: Discovering automated lexicography: the case of the slovene lexical database. Int. J. Lexicogr. 29, 200–225 (2016) 6. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Association for Computational Linguistics (2020) 7. Li, W., Suzuki, E.: Hybrid context-aware word sense disambiguation in topic modeling based document representation. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 332–341. IEEE (2020) 8. Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004) 9. Loshchilov, I., Hutter, F.: Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016) 10. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)
612
H. Liu and Z. Sayyah
11. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 12. Wikipedia Contributors. Wikipedia: Manual of Style (2021). Accessed 1 Oct 2021 13. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020) 14. Xiao, D., et al.: Ernie-gen: an enhanced multi-flow pre-training and fine-tuning framework for natural language generation. In: Bessiere, C. (ed.) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI20, pp. 3997–4003. International Joint Conferences on Artificial Intelligence Organization (2020). Main track
On Sensitivity of Deep Learning Based Text Classification Algorithms to Practical Input Perturbations Aamir Miyajiwala1,3 , Arnav Ladkat1,3 , Samiksha Jagadale1,3 , and Raviraj Joshi2,3(B) 1 2
Pune Institute of Computer Technology, Pune, Maharashtra, India Indian Institute of Technology Madras, Chennai, Tamilnadu, India [email protected] 3 L3Cube, Pune, Maharashtra, India
Abstract. Text classification is a fundamental Natural Language Processing task that has a wide variety of applications, where deep learning approaches have produced state-of-the-art results. While these models have been heavily criticized for their black-box nature, their robustness to slight perturbations in input text has been a matter of concern. In this work, we carry out a data-focused study evaluating the impact of systematic practical perturbations on the performance of the deep learning based text classification models like CNN, LSTM, and BERT-based algorithms. The perturbations are induced by the addition and removal of unwanted tokens like punctuation and stop-words that are minimally associated with the final performance of the model. We show that these deep learning approaches including BERT are sensitive to such legitimate input perturbations on four standard benchmark datasets SST2, TREC-6, BBC News, and tweet eval. We observe that BERT is more susceptible to the removal of tokens as compared to the addition of tokens. Moreover, LSTM is slightly more sensitive to input perturbations as compared to CNN based model. The work also serves as a practical guide to assessing the impact of discrepancies in train-test conditions on the final performance of models. Keywords: Natural language processing · Input perturbations Model robustness · Text classification · Input preprocessing
1
·
Introduction
Text classification is a fundamental task of NLP and is used in simplifying tasks like document classification, sentiment analysis of consumer feedback, and fake news detection [13,16,28]. Deep learning models have been used widely to tackle text classification as they tend to perform best compared to other machine learning techniques [15,27]. Some of the most popular methods used for this task are A. Miyajiwala, A. Ladkat, and S. Jagadale—Contributed equally. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 613–626, 2022. https://doi.org/10.1007/978-3-031-10464-0_42
614
A. Miyajiwala et al.
based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs) [11,30,31]. Pre-trained language models such as Bidirectional Encoder Representations from Transformers (BERT) based on transformer architecture, perform better than neural networks trained from scratch [3]. Although the accuracy of the model has always been the primary focus of evaluation, it might overestimate the performance of NLP models and also, fails to account for the robustness of the model to small changes in the input data [7,12,18,22]. The sensitivity of deep learning models to input data leads to practical challenges while deploying the stochastic models in a consumer environment. It is therefore important to study the sensitivity of the models to small input perturbations. In this work, we aim to highlight the limitations of deep learning based text classification models in different practical scenarios [9,21]. The analysis also highlights the need to keep deterministic checks in place while building real-time machine learning systems. Previous research concerning the robustness of models, studied the vulnerabilities of DNNs by devising adversarial attacks [5,10,17,20,23]. These attacks were carried out by making small perturbations in the input sample, primarily in the form of a word substitutions [19]. Our work, however, is not concerned with attacks on DNN or random input perturbations. In this work, we take a black-box approach to determine the impact on the performance of DNNs by inducing input perturbations in the form of systematic addition and removal of tokens. A similar black-box study evaluating the impact of word order on model performance was carried out in [25]. We use standard pre-processing methods as a proxy to introduce simple valid perturbations. Our experimental setup can be described as the application of standard pre-processing on training data followed by evaluation of the performance on raw un-processed text data and vice-versa. We create discrepancies in the train and test data by introducing practical changes like the presence or absence of punctuation, stop-words, and out-of-vocabulary words. We show that deep learning based models are sensitive to these simple addition or removal of unwanted tokens. We alternate between trainable and static word embeddings during the training to evaluate their impact. The three broad classes of models based on CNN, LSTM, and Transformer-based BERT are evaluated on standard SST2, TREC, BBC News, and tweet eval datasets. The motivation behind this approach is that these conditions represent a real-world environment where one could unknowingly have such discrepancies. Also, this study is a step forward in determining if any particular type of input configuration gives the best performance across any model. The paper is organized as follows. We present an overview of the effects of various pre-processing techniques and noisy data on text classification in Sect. 2. Our experimental setup where we discuss the datasets, models, methodology, and proposed approach, is mentioned in Sect. 3. Finally, the results and interpretations are summarized in Sect. 4 with the conclusion in Sect. 5.
Sensitivity of Deep Learning Models to Input Perturbations
2
615
Related Work
Our work is at the intersection of pre-processing methods and evaluating model robustness. This section briefly describes the previous works related to preprocessing techniques, noise present in the data, and their individual effects on the text classification models. Various studies on the pre-processing techniques can be seen in the literature, with the aim to find the most optimal combination of cleaning methods. More than 26 pre-processing methods were identified in [1] and were applied to the data from Arabic-language social media. The results showed that a specific combination of four out of 26 techniques, the removal of duplicate words, and three other variants of normalizing Arabic letters, improved the F1 score of the MNB classifier by two percent. A similar experiment was done in [8], where they explored the impact of six basic pre-processing methods i.e. spelling correction, upper to lower case letters, reduction of repeated characters, HTML tag removal, stopwords removal, and punctuation removal, on four benchmark text corpora using three machine learning methods. It was seen that at least one combination of basic pre-processing methods could be recommended to improve the accuracy results. An extensive comparison of fifteen pre-processing methods on CNN, LSTM, and BERT in [4] across two Twitter datasets namely SS-Twitter and the SemEval, concludes that stemming, replacement of repetitions of punctuation, and removing numbers are the recommended techniques to be applied. The nonrecommended techniques comprised of spelling correction, replacing negations with antonyms, handling capitalized words, replacing slang, and removing punctuation. A study of stopwords removal on the sentiment classification models done in [6]. They found that based on term weighing techniques, the traditional sentiment classifier displays a 9 percent rise inaccuracy when stopwords are removed. Other approaches like ARTFSC, SentiTFIDF, and RTFSC, did not vary much on the removal of stopwords. [14] compares the accuracy of the clean dataset with cross-validation performance measured on the dirty training dataset by listing the performance of CNN, fastText, and bag-of-words SVM performance on 20-Newsgroups, 2016 Yelp Reviews, and a synthetically created dataset from five different document collections. It was observed that the clean dataset continuously outperformed the cross-validation results on the dirty training dataset. However, they do not report results on experiments in which there is a combination of clean and unclean datasets for the training and testing data. Sensitivity analysis of one-layer CNN’s was performed in [30]. They report the results of experiments exploring different configurations of CNN’s run over nine sentence classification datasets. It displays the effects of input word vectors, filter region size, the number of feature maps for each filter region size, activation function, pooling strategy, and regularisation on the one-layer CNN model. This study focuses on the model hyper-parameters whereas we look at the sensitivity from the input data perspective.
616
A. Miyajiwala et al.
Our work is on the lines of [19]. However, instead of using random character and world-level perturbations, we use input perturbations that are induced by the addition and removal of tokens in the form of stopwords and punctuations.
3
Experimental Setup
3.1
Dataset Description
We use Stanford Sentiment Treebank (SST-2) [24], Text Retrieval Conference (TREC-6) [26], TweetEval [2] and BBC News1 datasets for our study. These datasets cover both binary and multi-class classification. The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. It uses the two-way (positive/negative) class split, with only sentence-level labels, and consists of a total of 8741 data samples, out of which 6920 are training set samples and 1821 are test set samples with an average of 19 tokens in a sentence. The TREC-6 dataset consists of open-domain, fact-based questions categorized into 6 classes: Abbreviation, Description and abstract concepts, Entities, Human beings, Locations, and Numeric values. This dataset has 5452 training samples and 500 test samples with an average of 10 tokens in a sentence. The TweetEval dataset consists of seven heterogeneous tasks on Twitter Data, all framed as multi-class tweet classification. The tasks include seven tasks namely - irony, hate, offensive, stance, emoji, emotion, and sentiment. We perform Text Classification on the sentiment analysis task. It consists of 45615 sentences in the training data, 12284 in the test data, and 200 in the validation set. The data is classified into three labels. The BBC News dataset consists of 2225 documents from the BBC news website, having stories that belong to five topical areas of business, entertainment, politics, sport, and tech. 3.2
Models
We use the most basic architectures from CNN and LSTM families in order to avoid any bias towards architecture-specific modifications. The CNN model uses two convolutional layers with 512 and 256 filters in the first and second layer, each having a filter size of 5. This is followed by a GlobalMaxPool Layer. Two linear layers are stacked on top of the GlobalMaxPool Layer, one having 128 nodes and the other being the output layer. A Dropout of 20% is also included before the output layer. The second classifier is a uni-directional LSTM model having two LSTM layers with 512 nodes each, followed by a GlobalMaxPool Layer. Two linear layers are further stacked on the GlobalMaxPool layer, one containing 256 nodes and the other being the output layer. The third classifier we have used is the BERT base cased2 model, pre-trained on the BookCorpus (800 M words) and English Wikipedia (2,500 M words) dataset [29]. It consists of 12 layers of bidirectional Transformer based encoder 1 2
http://mlg.ucd.ie/datasets/bbc.html. https://huggingface.co/docs/transformers/model doc/bert.
Sensitivity of Deep Learning Models to Input Perturbations
617
blocks, wherein each layer has a total of 12 self-attention heads [3]. We observe that the BERT base uncased model provides similar results so we only focus on the cased model to avoid repetition. 3.3
Methodology
The methodology includes pre-processing methods, training of the word embeddings, and the use of OOV tokens. Each of the following has been described in detail below. Pre-processing Techniques. For cleaning of the dataset, we have used the subsequent pre-processing techniques: 1. Expanding contractions. Contractions are words or combinations of words that are shortened by dropping letters and replacing them with an apostrophe. Here, we are removing such contractions and replacing them with expanded words. For example, contractions like ‘can’t’ and ‘mightn’t’ are expanded to ‘can not’ and ‘might not’, respectively. 2. Removal of special characters. Non-word characters i.e. all the characters which are not from [a to z], [A to Z], and [0 to 9] are removed. Thus, we are getting rid of characters like ‘\’ and ‘!’ from the data. 3. Removal of single characters. After the removal of special characters from ‘\t’ and ‘\n’, single characters like ‘t’ and ‘n’ are left behind. For the removal of such single characters, this technique of pre-processing is necessary. 4. Substituting multiple spaces with single space. Due to the application of all the above techniques, a lot of multiple spaces tend to be generated. For the removal of these unnecessary spaces, we substitute those multiple spaces with a single space. 5. Lowercasing. The most common pre-processing technique is to convert all the words to lowercase. Suppose the word ‘Bold’ occurs twice, once as ‘BOLD’ and the second time as ‘Bold’. Both of them are represented as two different words in the vector space model, even though they mean the same. To avoid such unnecessary vectors and reduce the dimensions, lowercasing is used. 6. Removal of Stopwords. Stopwords are a collection of most frequently occurring words like ‘an’, ‘of’, ‘the’. Such words do not reflect a huge impact on the model since they have the least information required for the training of the model. By removing these stopwords, we tend to shift the focus from less important words to more important words, thus improving the accuracy of the classification models. The first five steps are grouped together and termed basic pre-processing steps. Removal of stop-words is separately considered during the analysis since it leads to large-scale token removal or addition. We have used the data in different configurations as described below: – Clean Data: Apply the first five basic pre-processing steps to the data.
618
A. Miyajiwala et al.
– Clean Data - stop words: Further remove stop words as well after applying basic pre-processing. – Unclean Data: No pre-processing is applied on this data and nonalphanumeric tokens are separated from main tokens with space. These pre-processing steps are used to simulate the addition and removal of tokens during model evaluation. The addition of tokens can be simulated by training the model is on pre-processed (clean) data and evaluating it on the un-processed (unclean) data. Similarly, the removal of tokens can be simulated by training the model on unclean data and evaluating it on clean data. These simple pre-processing steps that affect only the unwanted tokens can be used to induce perturbations for evaluating the sensitivity of models to input tokens. Since we are not affecting the important tokens the drop inaccuracy of the model will signify the sensitiveness of the model to input tokens. Word Embeddings. In this work, we are using pre-trained FastText word embeddings that are an extension of the popular Word2Vec embeddings. The FastText word embedding does not consider the vectors for words directly, each word is represented as n-gram of characters. The benefit of using FasText over Glove or traditional Word2Vec is that it will be able to generate an embedding even for an out of vocab word as it splits the word up into n-gram of characters, which can be found in the training data. The embedding layer of CNN and LSTM models is initialized using pretrained fast text word vectors. The embeddings layer can be kept ‘trainable’ to finetune its weights to work with the downstream task. In our analysis, we have experimented with both trainable and static word embeddings. OOV Tokens. Word-based text processing models suffer from the problem of the Out Of Vocabulary(OOV) token due to the limited size of the vocabulary. A special OOV token is reserved for the out of vocab words encountered during testing. We drop 10% of the least frequent words in the training data from the vocabulary so that model sees the OOV token during training as well. During testing or evaluation, a model may encounter OOV tokens in practice. For our study, we evaluate the effect of OOV tokens on the model performance. The input perturbations are simulated in such a way that it results in the addition of OOV tokens. To simulate this the model is trained on pre-processed clean data and the vocabulary is created using tokens remaining post-cleanup. The unwanted or filtered tokens are not made part of the vocabulary. So now when the model is tested on unclean data with all the tokens the unwanted tokens are mapped to OOV token thus simulating the addition of OOV tokens. 3.4
Proposed Approach
We assess the robustness of the model by simulating perturbations in the form of discrepancies between train and test data. The proposed approach is as shown
Sensitivity of Deep Learning Models to Input Perturbations
619
Fig. 1. Proposed approach
Fig. 2. Prepared data sets
in Fig. 1 and the configurations under consideration are shown in Fig. 2. Only one set is pre-processed at a time thus leading to training test mismatch. These discrepancies can be segregated into three ways: – Addition of tokens: Here we train on clean data and test on unclean data. The vocabulary is created before any pre-processing of the data to retain all the tokens. The extra tokens in unclean data will not be mapped to OOV in this case. Note that the vocabulary and OOV token specifications are not relevant to the BERT model. – Addition of OOV tokens: Here we train on clean data and test on unclean data. However, the vocabulary is created using the pre-processed data to leave out unwanted tokens from the vocabulary. Thus the extra tokens in unclean data will now map to the OOV token as they are not part of the vocabulary. – Removal of tokens: This corresponds to training on unclean data and testing on clean data. The vocabulary is created using unclean data. The clean data mentioned in the above three configurations is pre-processed in two ways: – Basic pre-processing which involves the first five steps of pre-processing defined previously. – Basic pre-processing with the removal of stopwords. Removal of stop words helps in large-scale addition and removal of tokens as compared to only employing basic pre-processing. A comparative analysis is performed with respect to the above configurations and their impact on the performance of models has been discussed in the next section.
620
A. Miyajiwala et al.
Table 1. Number of unknown and total token in all four datasets; Total Tokens (TT); OOV Token - Unknown Token (OT) in %; RS (Removed Stopwords); C (Clean); C-U (CleanUnclean); U (Unclean); U-C (UncleanClean); C 2 (Clean, Generation of Vocabulary before Pre-processing); C-U 2 (CleanUnclean, Generation of Vocabulary before Pre-processing)
4
Dataset
Config OT
TT
OT [RS] TT [RS]
SST-2
C 1,998 C-U 3,038 C-2 2,018 C-U 2 2,071 U 2,031 U-C 2,006
30,277 32,891 30,277 32,891 32,891 30,277
1,995 17,132 2,005 2,071 2,031 2,005
17,629 32,891 17,629 32,891 32,891 17,629
TREC-6
C 349 C-U 443 C-2 350 C-U 2 357 U 357 U-C 350
3,123 3,309 3,123 3,309 3,309 3,123
349 443 350 357 357 350
1,493 3,309 1,493 3,309 3,309 1,493
Tweet Eval C 16,053 179,281 C-U 26,988 193,589 C-2 16,139 179,385 C-U 2 19,255 193,489 U 16,091 179,381 U-C 19,205 193,589
16,051 92,961 16,139 19,255 16,091 19,205
118,452 193,589 118,452 193,589 118,452 193,589
BBC News C 4,273 C-U 11,316 C-2 4,574 C-U 2 4,738 U 4,531 U-C 4,493
4,269 87,433 4,574 4738 4531 4493
100,423 183,377 100,423 183,377 1,00,423 183,377
170,845 183,377 170,854 183,377 179,854 183,377
Results
In this section, we discuss the sensitivity of our models with respect to the defined perturbations. The accuracy of the model are compared with clean (train) - clean (test) and unclean (train) - unclean (test) baselines. In the baseline models, both train and test data follow the same pre-processing steps. The results for the four datasets are described in Table 2. The statistics of out of vocabulary tokens for each configuration are given in Table 1. For the TREC dataset, all the models are very sensitive to the input perturbations with a significant drop in performance when we remove stopwords during preprocessing. We performed an analysis on
Sensitivity of Deep Learning Models to Input Perturbations
621
Table 2. Results for all four datasets; EmTr (Embeddings Trainable); Acc (Accuracy) in %; RS (Removed Stopwords); C (Clean); C-U (CleanUnclean); U (Unclean); U-C (UncleanClean); C 2 (Clean, Generation of Vocabulary before Pre-processing); C-U 2 (CleanUnclean, Generation of Vocabulary before Pre-processing) Model
CNN
EmTr
False
True
LSTM
False
True
BERT-base
True
Dataset
SST-2
Trec-6
Tweet-eval
BBC News
Acc
Acc [RS]
Acc
Acc [RS]
Acc
Acc [RS]
Acc
Acc [RS]
C
71.49
69.30
83.16
62.96
61.09
59.68
95.37
97.35
C-U
69.20
56.42
81.20
53.08
59.73
57.83
93.53
86.83
C 2
74.67
69.18
83.60
60.88
61.30
60.45
95.51
96.22
C-U 2
73.47
69.52
83.56
57.00
59.43
58.63
94.83
95.01
U
73.26
71.95
83.76
84.28
60.65
61.04
94.83
94.97
U-C
72.27
70.33
84.76
41.56
60.84
58.74
93.30
93.66
C
77.19
76.17
88.24
71.88
59.09
59.11
95.28
97.35
C-U
76.27
59.78
84.00
61.04
58.62
59.33
94.61
91.69
C 2
77.90
75.89
86.96
74.00
59.57
58.93
95.51
96.58
C-U 2
77.17
75.58
87.28
55.88
59.67
59.13
95.28
95.15
U
77.05
77.40
86.48
86.44
60.49
59.72
95.19
95.51
U-C
76.80
74.66
88.60
46.40
60.27
58.78
93.93
93.57
C
76.34
72.62
82.84
72.52
64.19
63.62
95.60
95.73
C-U
71.30
51.73
82.44
45.20
63.76
62.56
91.51
65.98
C 2
77.02
74.49
83.12
69.84
64.01
62.48
94.74
94.79
C-U 2
75.29
74.06
81.72
60.48
64.78
63.59
93.89
88.54
U
76.83
74.50
83.48
85.08
64.58
64.75
93.21
94.47
U-C
74.94
72.50
83.60
63.64
64.40
62.67
95.46
94.88
C
77.47
75.57
84.92
73.00
61.25
59.29
96.81
96.81
C-U
77.94
63.21
85.16
56.92
60.59
58.37
94.29
72.90
C 2
78.90
77.40
85.40
73.40
61.65
59.74
94.25
95.33
C-U 2
78.77
75.98
85.60
57.52
60.54
58.94
94.16
92.72
U
78.30
77.91
86.36
85.96
61.22
60.75
95.42
95.91
U-C
76.86
74.71
85.56
69.16
61.03
59.84
93.98
95.42
C
90.62
84.11
96.16
85.64
67.19
64.97
97.66
97.66
C-U
90.73
86.82
86.44
55.20
67.44
65.42
98.43
97.98
U
91.11
90.94
97.24
96.88
68.26
68.13
98.02
98.16
U-C
90.01
81.79
93.00
48.80
67.14
64.29
94.97
92.94
the number of tokens after pre-processing with the removal of stopwords, against the raw text. In the case of raw text, we have an average of 10 tokens in the data, but when we remove stopwords during preprocessing, only two tokens remain in the TREC-6 dataset. This attributes to the extreme sensitivity in the performance of all the models on TREC-6 data. 4.1
Addition of Tokens
The addition of tokens corresponds to the clean-unclean (C-U 2) case and is compared against the clean-clean (C 2) model as visualized in Figs. 3 and 4. We notice that the addition of tokens doesn’t have much effect on the CNN model and BERT model. There is a slight decrease in accuracy for CNN, however, the decrease is more pronounced for the LSTM model. Also, the impact of addition
622
A. Miyajiwala et al.
Fig. 3. The accuracy difference for trainable token embeddings configuration, without and with stopwords removal; addition of OOV tokens consists of the difference in the accuracy of clean (C) and accuracy of Clean-Unclean (C-U); addition of tokens consists of the difference in the accuracy of Clean 2 (C 2) and accuracy of CleanUnclean 2 (CU 2); removal of tokens consists of the difference in the accuracy of Unclean (U) and accuracy of UncleanClean (U-C)
is more when the embeddings are static. Similarly, the addition of a large number of tokens in the form of stop words has more impact on the LSTM model. 4.2
Removal of Tokens
The removal of tokens corresponds to the unclean-clean (U-C) case and is compared against the unclean-unclean (U) model as shown in Figs. 3 and 4. The CNN model is found to be quite robust to the removal of small and large number tokens. Whereas the LSTM model shows a drop in performance even when a small number of tokens are removed for a couple of datasets. Finally in the case of the BERT model removing a small number of tokens doesn’t have much impact but when a large number of tokens are removed even BERT shows a significant drop in performance. Since BERT is originally a language model such a drop is justified.
Sensitivity of Deep Learning Models to Input Perturbations
623
Fig. 4. The accuracy difference for static token embeddings (trainable = False) configuration, without and with stopwords removal; addition of OOV tokens consists of the difference in the accuracy of Clean (C) and accuracy of Clean-Unclean (C-U); addition of tokens consists of the difference in the accuracy of Clean 2 (C 2) and accuracy of CleanUnclean 2 (C-U 2); removal of tokens consists of the difference in the accuracy of Unclean (U) and accuracy of UncleanClean (U-C)
4.3
Addition of OOV Tokens
The addition of OOV tokens corresponds to the clean-unclean (C-U) case and is compared against the clean-clean (C) model in Figs. 3 and 4. Also, this case is specific to CNN and LSTM models as there is no OOV token in the BERT model. Both the models are robust to small additions in OOV tokens. However as we increase the number of OOV tokens, in terms of stop words, the performance drop for both CNN and LSTM models is significant. Again the drop in performance is more pronounced when the embeddings are static.
5
Conclusion
In this work, we analyze the impact of practical input perturbations on deep learning based text classification algorithms. The perturbations studied are in
624
A. Miyajiwala et al.
the form of the addition and removal of unwanted tokens. The unwanted tokens are characterized as punctuation and stop words which are minimally important for the classification process. The models under consideration are CNN, LSTM, and BERT. We show that these models are robust to a small number of additions or removals. However, as we increase the number of perturbations the performance of models drops significantly even for the BERT model. The models are more sensitive to removals as compared to additions. Also with trainable word embeddings, the models are more robust. We also find that CNN is more robust to such changes as compared to the LSTM model. Overall we show that these deep learning models are sensitive to such simple perturbations in input data. Acknowledgments. This work was done under the L3Cube Pune mentorship program. We would like to express our gratitude towards our mentors at L3Cube for their continuous support and encouragement.
References 1. Albalawi, Y., Buckley, J., Nikolov, N.S.: Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media. J. Big Data 8(1), 1–29 (2021) 2. Barbieri, F., Camacho-Collados, J., Neves, L., Espinosa-Anke, L.: Tweeteval: unified benchmark and comparative evaluation for tweet classification. arXiv preprint arXiv:2010.12421 (2020) 3. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 4. Effrosynidis, D., Symeonidis, S., Arampatzis, A.: A comparison of pre-processing techniques for Twitter sentiment analysis. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds.) TPDL 2017. LNCS, vol. 10450, pp. 394–406. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67008-9 31 5. Gao, J., Lanchantin, J., Soffa, M.L., Qi, Y.: Black-box generation of adversarial text sequences to evade deep learning classifiers. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 50–56. IEEE (2018) 6. Ghag, K.V., Shah, K.: Comparative analysis of effect of stopwords removal on sentiment classification. In: 2015 International Conference on Computer, Communication and Control (IC4), pp. 1–6. IEEE (2015) 7. Goel, K., Rajani, N.F., Vig, J., Taschdjian, Z., Bansal, M., R´e, C.: Robustness gym: unifying the NLP evaluation landscape. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pp. 42–55 (2021) 8. HaCohen-Kerner, Y., Miller, D., Yigal, Y.: The influence of preprocessing on text classification using a bag-of-words representation. PloS One 15(5), e0232525 (2020) 9. Jacovi, A., Goldberg, Y.: Towards faithfully interpretable NLP systems: how should we define and evaluate faithfulness? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4198–4205 (2020) 10. Jin, D., Jin, Z., Zhou, J.T., Szolovits, P.: Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8018–8025 (2020)
Sensitivity of Deep Learning Models to Input Perturbations
625
11. Joshi, R., Goel, P., Joshi, R.: Deep learning for Hindi text classification: a comparison. In: Tiwary, U.S., Chaudhury, S. (eds.) IHCI 2019. LNCS, vol. 11886, pp. 94–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44689-5 9 12. Kitada, S., Iyatomi, H.: Attention meets perturbations: robust and interpretable attention with adversarial training. IEEE Access 9, 92974–92985 (2021) 13. Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(4), 150 (2019) 14. Andrew Kreek, R., Apostolova, E.: Training and prediction data discrepancies: challenges of text classification with noisy, historical data. In: Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-Generated Text, pp. 104–109 (2018) 15. Kulkarni, A., Mandhane, M., Likhitkar, M., Kshirsagar, G., Jagdale, J., Joshi, R.: Experimental evaluation of deep learning models for Marathi text classification. arXiv preprint arXiv:2101.04899 (2021) 16. Kulkarni, A., Mandhane, M., Likhitkar, M., Kshirsagar, G., Joshi, R.: L3cubemahasent: a Marathi tweet-based sentiment analysis dataset. In: Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 213–220 (2021) 17. La Malfa, E., Wu, M., Laurenti, L., Wang, B., Hartshorn, A., Kwiatkowska, M.: Assessing robustness of text classification through maximal safe radius computation. arXiv preprint arXiv:2010.02004 (2020) 18. Liu, J., et al.: Robustness testing of language understanding in task-oriented dialog. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2467–2480 (2021) 19. Moradi, M., Samwald, M.: Evaluating the robustness of neural language models to input perturbations. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 1558–1570 (2021) 20. Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 427–436 (2015) 21. Paleyes, A., Urma, R.-G., Lawrence, N.D.: Challenges in deploying machine learning: a survey of case studies. arXiv preprint arXiv:2011.09926 (2020) 22. Ribeiro, M.T., Wu, T., Guestrin, C., Singh, S.: Beyond accuracy: behavioral testing of nlp models with checklist. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912 (2020) 23. Singh, R., Joshi, T., Nair, V.N., Sudjianto, A.: Model robustness with text classification: semantic-preserving adversarial attacks. arXiv preprint arXiv:2008.05536 (2020) 24. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642 (2013) 25. Taware, R., et al.: Shuftext: a simple black box approach to evaluate the fragility of text classification models. arXiv preprint arXiv:2102.00238 (2021) 26. Voorhees, E.M., Harman, D.: Overview of the sixth text retrieval conference (trec6). Inf. Process. Manag. 36(1), 3–35 (2000) 27. Wagh, V., Khandve, S., Joshi, I., Wani, A., Kale, G., Joshi R.: Comparative study of long document classification. arXiv preprint arXiv:2111.00702 (2021)
626
A. Miyajiwala et al.
28. Wani, A., Joshi, I., Khandve, S., Wagh, V., Joshi, R.: Evaluating deep learning approaches for Covid19 fake news detection. In: Chakraborty, T., Shu, K., Bernard, H.R., Liu, H., Akhtar, M.S. (eds.) CONSTRAINT 2021. CCIS, vol. 1402, pp. 153– 163. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73696-5 15 29. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020) 30. Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015) 31. Zhou, C., Sun, C., Liu, Z., Lau, F.: A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630 (2015)
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction from Social Media Data Sarang Shaikh(B) , Sule Yildirim Yayilgan, Erjon Zoto, and Mohamed Abomhara Department of Information Security and Communication Technology, Norwegian University of Science and Technology (NTNU), Gjøvik, Norway {sarang.shaikh,sule.yildirim,erjon.zoto, mohamed.abomhara}@ntnu.no
Abstract. Measuring and analyzing user perceptions and behaviors in order to make user-centric decisions has been a topic of research for a long time even before the invention of social media platforms. In the past, the main approaches for measuring user perceptions were conducting surveys, interviewing experts and collecting data through questionnaires. But the main challenge with these methods was that the extracted perceptions were only able to represent a small group of people and not whole public. This challenge was resolved when social media platforms like Twitter and Facebook were introduced and users started to share their perceptions about any product, topic, event using these platforms. As these platforms became popular, the amount of data being shared on these platforms started to grow exponentially and this growth led to another challenge of analyzing this huge amount of data to understand or measure user perceptions. Computational techniques are used to address the challenge. This paper briefly describes the artificial intelligence (AI) techniques, which is one of the types of computational techniques available for analyzing social media data. Along with brief information about the AI techniques, this paper also shows state-of-the-art studies which utilize the AI techniques for measuring user perceptions from the social media data. Keywords: Social data analysis · Social media data · NLP · Machine learning · Deep learning · Transfer learning · User perceptions · Sentiment analysis · Topic modelling · Perception extraction · User perceptions
1 Introductions User perceptions about any events, topics, policies, etc. have always appealed the attention of policy and decision makers. These perceptions are always considered as strong evidence for making and adjusting user-centric decisions [1–3]. The traditional method of analyzing/investigating user perceptions is usually based on data collection from survey polls and questionnaires. Next, the collected data is analyzed using traditional qualitative and quantitative methods [4, 5]. However, some of the researchers have argued that these © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 627–655, 2022. https://doi.org/10.1007/978-3-031-10464-0_43
628
S. Shaikh et al.
approaches more likely represent a small group of individual user perceptions rather than public user perceptions [6, 7]. Furthermore, due to the time and cost constraints involved in survey and questionnaire activities, the amount of collected data is very limited and hence it restricts the overall findings for understanding user perceptions to a large extent [8]. Nowadays, social media platforms like Twitter, Facebook, LinkedIn, etc. offer a new way of understanding and measuring user perceptions. There has been an increase in adoption and use of social media platforms by the general public as well as the enterprises, industry owners, government officials, scientists, scholars, etc. [9]. The understanding and extraction of user perceptions from social media data has been widely studied in several domains including social science [10], education [11], politics [12], marketing [13], healthcare [14], finance [15] and disaster management [16]. Hence, using social media data as a data source for user perception extraction and analysis can overcome the limitations of traditional surveys and questionnaires methods [17]. It is helpful in forecasting future user perceptions related to any event, topic or policies [18]. Xuefan et al. in [19] conducted a review on perception extraction and understanding on social media data. According to the review results, Twitter is the most widely used social media platform for perceptions extraction along with Facebook and other platforms. The most frequent keywords of the studies included in the review are social media, twitter, sentiment analysis, public perception, public engagement, opinion mining, NLP, and perceptions. The major techniques used for perception extraction from the studies included in the review are sentiment analysis and topic modelling. Table 1 shows some of the recent studies where perception extraction techniques from social media data are used for several domains like health, business, tourism, etc. Table 1. State-of-the-art studies for perception extraction from social media data #
Title
Domain
Social media platform
1
Social media data analysis to predict 2021 mental state of users using machine learning techniques [20]
Year
Health
Twitter
2
The application of artificial 2021 intelligence and data integration in COVID-19 studies: a scoping review [21]
Health, COVID-19
Twitter
3
Exploring temporal suicidal behaviour patterns on social media: Insight from Twitter analytics [22]
2020
Health
Twitter
4
Social media insights into US mental health during the COVID-19 pandemic: longitudinal analysis of twitter data [23]
2020
Health
Twitter
(continued)
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
629
Table 1. (continued) #
Title
Year
Domain
Social media platform
5
Analyzing social media, analyzing the social? A methodological discussion about the demoscopic and predictive potential of social media [24]
2020
Politics
Twitter
6
A case study in belief surveillance, 2019 sentiment analysis, and identification of informational targets for e-cigarettes interventions [25]
Health
Twitter
7
Realizing social-media-based analytics for smart agriculture [26]
2019
Agriculture
Twitter
8
Social media analytics: Extracting and visualizing Hilton hotel ratings and reviews from TripAdvisor [27]
2019
Tourism
TripAdvisor
9
Leveraging social media to gain insights into service delivery: a study on Airbnb [28]
2019
Tourism
AirBnB Reviews
10
Using Classification Technique for 2019 Customer Relationship Management based on Thai Social Media Data [29]
Business
Facebook
11, 12
A new approach of social media analytics to predict service quality: evidence from the airline industry [30]
2019
Business
Twitter
13
Topic modeling and sentiment analysis of global climate change tweets [31]
2019
Climate Change
Twitter
14
Identifying racist social media 2018 comments in Sinhala language using text analytics models with machine learning [32]
Social Issues
Facebook
15
The Twitter Bullishness Index: A 2016 Social Media Analytics Indicator for the Stock Market [33]
Stock Market
Twitter
16
Analyzing Twitter to explore 2016 perceptions of Asian restaurants [34]
Tourism
Twitter
Also, Fig. 1 shows the summary of state-of-the-art studies mentioned in Table 1 for perception extraction from social media data for several domains. As we can see from Fig. 1 that the maximum number of studies were conducted in the year 2019 with Twitter as the most selected social media platform. Also, the studies belong to several domains like tourism, business, health, etc.
630
S. Shaikh et al.
The rest of the paper is organized as: Sect. 2 briefly describes the different AI techniques including machine learning, deep learning and natural language processing for social media data analysis. Section 3 discusses the state-of-the-art studies available for social media analysis using the techniques explained in Sect. 2. Section 4 shows the existing challenges in the state-of-the-art studies. Section 5 concludes the overall work.
Fig. 1. Summary of perception extraction techniques from soscial media platforms
2 Background The fast-growing use of social media platforms and their relevant application areas have made major advancements in the different ways which people use to interact with each other [35]. The in-depth analysis of social media data is sometimes difficult because of its unpredictable nature due to several facts like the data is dynamic, wide-ranging, and scattered. The recent advances in computational techniques/methods like Artificial Intelligence (AI) Have made in-depth analysis of social media data quite easier. These techniques help in understanding several patterns on social media platforms like social media usage, online behaviour, data/content sharing, perceptions of different types of people about certain topics, etc. [36]. The extraction of these patterns can give a variety of benefits to organizations, governments, and non-profit organizations to design their services and policies focusing on user-centric methodology [37]. There have been a lot of attempts in literature for extracting valuable insights from vast social media data for
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
631
better decision making. Some of the examples of such insights are analyzing opinions of users towards different products, analyzing election results, and understanding users’ behaviour [32–34]. This section will further discuss different types of ai techniques being used for analyzing social media data. The AI techniques/methods are all about “making machines intelligent” by using a variety of approaches. The field of AI has been around almost more than six decades, and it has faced many ups and downs throughout this period. In the starting days, AI research showed a lot of promises to the communities, but those promises were somehow not fulfilled due to unavailability of digital data and computational power. This was the period of late 1980s and early 1990s and termed as “AI Winter”, where not much progress was achieved in terms of solving real-world problems using AI. However, later, when new techniques were introduced along with the availability of digital data and huge computational powers, AI started to increase in popularity again. Here, digital data is mainly referred to online web and social media platforms data. the different AI techniques/methods for analyzing social media data both (structured and unstructured) can be divided into three major types based on use case and ultimate goal to be achieved. Figure 2 shows the three different types of AI techniques/methods for analyzing social media data to understand user behaviors, usage patterns, communications and perceptions.
Fig. 2. Types of AI techniques
2.1 Machine Learning (ML) Machine learning (ML) is a type of AI technique, used to automate solutions for complex problems that are very difficult to solve using general hand-crafted rules-based approach. This technique does not require any explicit rules/steps to design the solution, instead it learns different set of rules/steps from the set of provided data relevant to the realworld problem that needs to be solved. For example, in case of handwritten character recognition from images, the data can be the collection of several images with variety of numerical characters written by different set of people. In other words, this technique
632
S. Shaikh et al.
learns patterns, relations from the data and the larger the data, the better ML learns [38– 40]. Since ML considers using all the data provided for learning the rules, it becomes more accurate as compared to hand-crafted rules because there is no human-bias involved while defining the rules. As ML is a type of automatic learning from the data to solve any problem, the learning is categorized into three different methods shown in Fig. 3. These learning methods enable ML to learn from the available data. The data is generally of two types [41]: • Labelled Data: The data available with its relevant answers/labels. For example, collection of raw images of dogs and cats along with labels that specify which images are dogs and which cats, respectivel. • Unlabeled Data: The data available without the relevant answers/labels. For example, collection of raw images of dogs and cats but without any labels provided per image. It is also worth to discuss here the concept of training, validation and test data before going into further discussions of learning methods. • Training Data: The data used to train any ML model so that the model can learn the patterns and behaviors inside/from the data. • Validation Data: The data used during the training process to assess the performance of the ML model during each training step. • Test Data: The new unseen data used to measure the performance of final trained ML model to evaluate its learning/capability to make decisions. Figure 3 shows the detailed taxonomy of ML and its learning methods. The taxonomy is further discussed below.
Classificaon
SVM, Naive Bayes, Decision Tree, Random Forest, KNN, AdaBoost
Regression
Linear Regression, Logisc Regression
Unsupervised Learning
Clustering
K-Means
Semi-Supervised Learning
(Supervised + Unsupervised) Learning
Supervised Learning
Machine Learning
Fig. 3. Taxonomy of ML
2.1.1 Supervised Learning In supervised learning method, the ML model is given a collection of data specific to any real-world problem along with answers/labels (i.e. labelled data). The provided data is in the form of mapping from Input(x) → Output(y), where x represents the list of data points and y represents the corresponding answers/labels. The ML model has to learn
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
633
different relationships/rules of data points to their corresponding labels. For example, a dataset/collection of emails with provided labels as spam or non-spam is a type of labelled data and the learned ML model that can identify any new email as spam or non-spam is a type of supervised learning method. Furthermore, the supervised learning methods are divided into two types: Classification [42] and Regression [43]. 2.1.2 Unsupervised Learning In unsupervised learning method, the ML model is given a collection of data specific to any real-world problem without any answers/labels (i.e., unlabeled data). The provided data is in the form of Input(x) only, where x represents the list of data points without any labels. The ML model has to identify and differentiate different data points and distribute these into different groups based on its learning data. For example, a dataset/collection of shopping patterns from several customers. The unsupervised model learns to group the customers based on buying patterns. This type of unsupervised learning is commonly called Clustering [44]. 2.1.3 Semi-supervised Learning The semi-supervised learning model is a combination of both supervised and unsupervised learning as supervised learning requires a lot of labelled data and this labelling requires a lot of time and human effort. Therefore, the semi-supervised learning model learns from the small amount of labelled data in supervised manner and then uses the unsupervised learning to label rest of the remaining data points [45]. Furthermore, more detailed information regarding ML, its learning methods and models is also discussed in [46]. 2.2 Deep Learning (DL) Deep learning is a sub-type of ML which mimics the same way that humans use to gain or understand certain types of information and knowledge. The major difference between ML and DL is the composition of its models where ML models are linear in nature and DL models are stacked hierarchical and complex in nature [47]. Another advantage of using deep learning is the automatic learning of important features from the raw data contributing towards decision making. The machine learning that we discussed previously is more often dependent on human intervention to learn. Deep learning is also referred as “Deep Neural Networks (DNNs)”. The simple neural networks, also referred as “Artificial Neural Networks (ANN)” are structured and motivated by human brain reflecting the same way the biological neurons work [48]. The ANNs are consist of hierarchical node layers containing an input later, one or more than one hidden layer and an output layer. The DNNs are the special kind of ANNs with more than one hidden layer. Figure 4 shows the basic architecture of a simple DNN. Like ML, there are several types of DNNs based on the objective/problem that needs to be achieved/solved using deep learning. Figure 5 shows the detailed taxonomy of DL/DNNs and is further discussed below.
634
S. Shaikh et al.
Fig. 4. Basic architecture of a simple DNN1
DenseNet CNN Deep Supervised Learning
RNN LSTM
Deep Learning
BiLSTM Deep Unsupervised Learning
Deep Semi-Supervised Learning
AutoEncoders Deep Boltzmann Machine Active Learning Weakly-Supervised Learning Transformers
Deep Transfer Learning BERT
Fig. 5. Taxonomy of DL
2.2.1 Deep Supervised Learning The DNNs for supervised learning [49] works in the same way and purpose as in the supervised learning in ML. These DNNs need the labelled data to learn the relationship and extract patterns for mapping of the input data to its corresponding output labels. These DNNs are further divided into same two types, classification, and regression same asin ML. The popular algorithms/models for DNNs for supervised learning are Dense Networks (DenseNets) [50], Convolutional Neural Networks (CNNs) [51], Recurrent
1 https://www.ibm.com/cloud/learn/neural-networks#toc-what-are-n-2oQ5Vepe.
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
635
Neural Networks (RNNs) [52], Long-Short Term Memory (LSTM) Networks [53], BiLSTM Networks [54] etc. 2.2.2 Deep Unsupervised Learning The DNNs for unsupervised learning [55] refers to the learning where there is no labelled data. In other words, it works with the unlabelled data in the same logic as unsupervised learning in ML. The most common types of these DNNs are AutoEncoders [56] and Deep Boltzmann Machines [57]. 2.2.3 Deep Semi-supervised Learning The DNNs have already demonstrated their performance on a large variety of deep supervised and unsupervised learning tasks (i.e. image classification [58]) once trained on extensively large labelled datasets (i.e. ImageNet [59]). However, this creates a bottleneck of creating large datasets while working with DNNs which requires extensive amount time, resources, and effort. To avoid this bottleneck, recently the DNNs for semi supervised learning [60] are introduced. The most common types of these algorithms/models are Active Learning [61] and Weakly-Supervised Learning [62]. 2.3 Natural Language Processing (NLP) The NLP field, also referred as computational linguistics, is the subfield of AI which enables the computers to process and understand the natural language (i.e. human language) in the same way as human do. The natural language is usually in the form of free text. The goal of the NLP field is to read, decode, understand, and extract valuable, sensible insights and information from the human language. One of the scopes of this paper is to explain different methods for analyzing social media data. Therefore, it is worth to explore the field of NLP as most of the social media data is in the form of free text. The first step when it comes to analyzing unstructured text data is to convert it into the structured form such that the computers can understand that data. The common pipeline for converting unstructured text data into structured data is given in Fig. 6. 1. Unstructed Text Data
5. Structured Data
3. Text Parsing
2. Text Preprocessing
4. Feature Engineering
Fig. 6. From unstructured to structured data using NLP
1. Unstructured Text Data The data which does not have any defined data model or structure such that it cannot be processed easily by the computers is called unstructured text data. The most
636
S. Shaikh et al.
common types of such data are: online social media data, online blogs data, news websites data, electronic documents data, etc. 2. Text Preprocessing Most of the times the unstructured text data has a lot of noise and irrelevant information which do not contribute for valuable insights and information extraction. In NLP, we have very good techniques available that can be used to preprocess/clean the unstructured text data. The different text pre-processing techniques are lowercase conversion, stemming/lemmatization, spelling correction, URLs/stopwords/puctuations removal, HTML tags removal, etc. The details for each of the techniques are explained in [63–65]. Most of the programming languages like python, JAVA, etc. have built-in NLP libraries working with text pre-processing. The famous NLP libraries for python language are NLTK2 , CoreNLP3 , Gensim4 , Spacy5 and Pattern6 . Moreover, the famous NLP libraries for JAVA are OpenNLP7 , Stanford CoreNLP8 and Freeling9 . 3. Text Parsing Text parsing is a technique of NLP for understanding the unstructured text data. When it comes to understanding, it involves two types of techniques: Syntactic Analysis and Semantic Analysis. This section will only discuss syntactic analysis because text parsing is a syntactic analysis technique. The term “syntactic” is derived from the word “syntax”. Every language has its own syntax which defines its grammatical structure while writing the text. The syntactic analysis/text parsing techniques are used to understand the grammatical structure of the human language based on formal grammar rules and meaningfulness [66]. The computer program which performs the text parsing is called “Text Parser”. The most common types of text parsing for NLP are parts of speech (POS) tagging [67], shallow parsing [68], constituency parsing [69] and dependency parsing [63]. 4. Feature Engineering Feature Engineering is a technique used in NLP for understanding information from the raw text. This technique is mainly used to convert raw text into numeric form so that it can be further processed and understood by the computers while performing any task [70]. As this is a core step while converting the text into numerics so maintaining the same meaning and scope is very important here. Each numeric feature value is the representation of words and their relationships within the raw text. Figure 7 shows an example of feature engineering for raw data. The main categories of the features are: 1) Meta features [71]: features like no. of words in a text, no. of unique words in a text, no. of characters in a text, average length of 2 https://www.nltk.org/. 3 https://github.com/stanfordnlp/python-stanford-corenlp. 4 https://pypi.org/project/gensim/. 5 https://spacy.io/. 6 https://analyticsindiamag.com/hands-on-guide-to-pattern-a-python-tool-for-effective-text-pro
cessing-and-data-mining/. 7 https://opennlp.apache.org/. 8 https://stanfordnlp.github.io/CoreNLP/. 9 http://nlp.lsi.upc.edu/freeling/node/1.
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
637
a text, average length of words, etc. lies under this category, 2) Text-based features [72]: The common types of text-based features are: Bag-of-words [73], Tf-Idf [74], N-grams [75], and CountVectorizer [76] and 3) Semantic/contextual features [77]: As compared to text-based and meta features, the semantic features help to extract this contextual meaning from the text easily. Word2Vec [77] and Doc2Vec [78] are the very first types of these features. In NLP terminology, these semantic features are often termed as “Word Embeddings” as well. The recent types of these features are FastText [79], Glove [80] and Elmo [81].
Fig. 7. Example of feature engineering10
5. Structured Data Once, the feature engineering is completed the raw text data (i.e. unstructured data) will be in form of numbers and representing the structured data. Next, we will discuss how the different techniques available in NLP are used for analyzing social media text to extract useful insights and information. These extracted insights help to understand user perceptions related to the domain. We will also discuss how NLP is combined with ML and DL models to extract insights from social media data. Figure 8 shows the detailed taxonomy of some of the important NLP techniques used for perception extraction from the social media data. Next, we will discuss different NLP techniques available in the literature for perception extraction from online social media data. Figure 8 shows the detailed taxonomy of widely used NLP techniques for perception extraction. 2.3.1 NLP Techniques for Perception Extraction from Social Media Data I. Sentiment Analysis Techniques Sentiment analysis (SA) also referred as Opinion Mining (OM) is a technique to extract and analyze people’s opinions, attitudes, behaviours, perceptions, etc. towards different 10 https://towardsdatascience.com/how-to-turn-text-into-features-478b57632e99.
638
S. Shaikh et al.
Subjectivity Analysis
Sentiment Analysis
Keyword/Keyphrase Extraction
Lexicon Based
Lexicon Based
Topic Modelling Based
ML/DL Based
ML/DL Based
Statistical Based
Transfer Learning Based
ML/DL Based Graph Based Hybrid Approach
Fig. 8. Taxonomy of NLP techniques for perception extraction
topics, products, issues being discussed on social media platforms. It is a powerful technique for businesses, industries, governments and other entities to extract and understand users mood and views [82]. The general sentiment analysis assesses the data in form of positive, negative, or neutral. However, there is more granular type of sentiment analysis used which is “emotion analysis”. The emotion analysis tends to identify different emotions like anger, fear, sadness, surprise, joy, etc. from the social media text [83]. On social media platforms, people are free to express their opinions and perceptions on wide range of topics. To perform sentiment and emotion analysis on those opinions and feedbacks help to understand the users views and perceptions [84]. Next, we will discuss what are the different approaches available in the literature for performing sentiment analysis for social media data. Figure 9 shows the taxonomy of different techniques available for performing sentiment analysis.
Diconary Based Lexicon Based Corpus Based
Senment Analysis Techniques
Machine Learning Based
SVM, NB, DT, RF, etc
Deep Learning Based
CNN, LSTM, Bi-LSTM, etc
Transfer Learning Based
use of pre-trained models
Fig. 9. Taxonomy of sentiment analysis techniques
Lexicon-based approaches are the simple dictionary-based approaches where different words and phrases are already labelled with different sentiment scores. The overall sentiment score of a text depends on the collective scores of individual words and
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
639
phrases. Christopher et al. in [85] performed comparison of six different lexicon-based approaches to perform sentiment analysis. The authors evaluated all the six approaches against manually labelled amazon reviews dataset. Although, the authors achieved good accuracy in the range of 75%–77% using Hu & Liu lexicon but the problem with this approach is that it is very limited in the scope and only applicable for the domain of product reviews. To overcome the lexicon-based approach challenges, several authors used machine learning based approaches for performing sentiment analysis. To perform machine learning approach, the text data needs to be converted into the numeric data according to the steps provided in the section “Natural Language Processing (NLP)”. Ye et al. in [86] applied supervised machine learning algorithms namely, SVM, NB and N-gram model on yahoo reviews of famous travel destinations. The authors achieved overall 87% accuracy with N-gram model. The authors in [87] performed a systematic literature review for different machine learning models applied for performing sentiment analysis for online reviews data. The main limitation with supervised machine learning model is the availability of already labelled training data into sentiment labels. Another limitation of using machine learning is the manual/hand crafted feature engineering for text data. To solve the manual feature engineering problem recently many researchers have applied the deep learning approaches to perform sentiment analysis in social media text. The deep learning algorithms automatically identify and extract important features from the text. Pasupa et al. in [88] performed sentiment analysis using three deep learning models: CNN, LSTM and BiLSTM. The authors observed overall accuracy of 81% using CNN deep learning model. The major limitation with using deep learning approaches is again the need of large, labelled training dataset for performing sentiment analysis in specific domains. II. Topic Modelling Topic modelling is one of the very excellent technique used in NLP to understand the text. It helps to understand the text in terms analyzing and extraction of its topics. The process of learning, identifying, and extracting the topics from the text is known as “topic modelling”. The extraction of topics from the unstructured text is beneficial for several purposes such as organizing the text on the web with similar topics. Also news agencies use topic modelling to recommend articles to the readers as well. On the online social media platforms, this technique helps to extract and understand different hot trends and topics about which users are speaking through their posts, tweets, etc. Boon et al. in [89] applied topic modelling on Twitter data in order to understand what users are talking about the COVID-19 pandemic. Figure 10 shows the word cloud generated by the authors, showing the main topics related to COVID-19 extracting using topic modelling on twitter data. There are several approaches proposed by authors in [90–92] for performing topic modelling on unstructured text data. The most widely used approaches for topic modelling from the literatures are: 1) Latent Semantic Analysis (LSA) [90], 2) Non-Negative Matrix Factorization (NNMF) [91], Probabilistic Latent Semantic Analysis (PLSA) [92] and 4) Latent Dirichlet Allocation (LDA) [93].
640
S. Shaikh et al.
Fig. 10. Word cloud showing extracted topics from the Twitter against COVID-19 [89]
Although, the traditional approaches show very promising results and were used in variety of studies for performing topic modelling on social media data [94–97]. However, these approaches suffer from variety of limitations such as: • Limit in the number of topics to be detected • Generate generic topics • There is trouble in dealing with topics that have complex generalization-specialization relationships. • Less relevance of the generated topics to topics in the real-world • Do not capture topic-inherent properties To overcome above shortcomings, recently new techniques for topic modelling in social media data analysis are evolved based on the advancements of machine and deep learning models. These approaches are classified into two types: 1) supervised topic modelling, 2) unsupervised topic modelling [98]. The studies focusing on these techniques are discussed in the Sect. 3. III. Keyword/Keyphrase Extraction Keyword/Keyphrase extraction is a NLP technique for extracting important words/concepts, relevant to the underlying document/text [99]. This technique helps in several ways like grouping of texts based on similar keywords, important concepts/topics being discussed in the texts, document searching and filtering based on extracted keywords, etc. [100]. Generally, this extraction is a two-way process: 1) to generate candidate words from the text, 2) to select and rank the candidate words based on their impact on the text [101]. Figure 11 shows the common SoTA approaches for this technique.
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
Statistical Approach
ML + DL Approach
Graph Based Approach
641
Hybrid Approach
Fig. 11. Keyword/keyphrase extraction SOTA approaches
Statistical approach: This approach generates the candidate keywords based on statistics from the text like word frequency, probability, other feature engineering techniques [102] like Tf-Idf [103], N-gram [104] and common word occurrences [105]. One of the popular algorithms using this approach is YAKE [106]. ML + DL Approach: In this approach, generally a ML/DL classifier is trained on a labelled keyword/keyphrase documents where an extracted keyword is labelled as relevant keyword or not. One of the traditional keyword extraction system based on this approach is KEA [107], which uses TF-IDF scores along with NB classifier to predict whether a candidate sentence is a keyword/keyphrase or not. Graph-based approach: This approach generates a graph of keywords/keyphrases related to each other from the text/document. The graph connects co-occurring terms in the text with each other. The famous algorithms using this approach are TextRank [108] and RAKE [109]. Hybrid approach: This approach generates the candidate keywords/keyphrases based on combination of one or more approaches from the above. For example: ExpandRank [110], which is a combination of TF-IDF + graph-based approaches.
3 State-of-the-Art Studies of AI Techniques for Analyzing Social Media Data This section explains various state-of-the-art studies related to the different AI techniques explained in the Sect. 2. Specifically, this section highlights the different models/algorithms/techniques used to analyze social media data from different social media platforms for different application areas using relevant datasets. 3.1 Machine Learning (ML) Singh et al. in [111] analyzed the twitter data to understand the behavior of spammers distributing pornographic content on social media platform using RF machine learning algorithm. The authors reported overall accuracy of 91.96% for predicting pornographic content from Twitter data. Vafeiadis et al. in [112] conducted a comparative study of applying machine learning methods to understand customer behavior for churn prediction on a churn dataset from UCI ML repository. The authors used RF, NB, DT, LR and SVM models, out of which SVM outperformed with overall accuracy of 97%. Table 2 shows additional studies where ML techniques are applied on social media data into various domains for different tasks.
642
S. Shaikh et al. Table 2. SoTA ML techniques for social media data analysis
#
Title
ML model used
Dataset used
Social media Application area platform used
1
Crowdsourcing and collaborative learning environments based on SM [113]
Gaussian Naïve Bayes
Twitter, Facebook, Twitter, LinkedIn Facebook, LinkedIn
Business Intelligence
2
Data analytic learning DT with strategic decision making [114]
Twitter hashtag, Twitter Meme tracker, and Yelp
Business Intelligence
3
Fake profile detection [115]
MRF
Facebook
Facebook
Crime detection
4
Cyberbullying Detection based on Semantic-Enhanced Marginalized Denoising Auto-Encoder [116]
Bow, SVM, LDA
Twitter, Myspace
Twitter, Myspace
Crime detection
5
Identifying Epidemics SVM, NB, and [117] RF
Weibo
6
Detection of influenza Linear epidemics [118] Regression, Multiple Regression
Twitter
7
Disaster management using SM [119]
GIS model
Satellite images
8
Real time crisis mapping of natural disasters using social media [120]
TRIDEC project
Twitter, Google Earth
9
Generating person-specific representations used to boost face recognition performance [121]
SVM, LDA
PubFig83
Image analysis
10
Improving information diffusion in SM [122]
Independent cascade (IC) model and the linear threshold (LT) model
Douban, AMiner, DBLP, and LiveJournal
Recommenders’ systems
Epidemics Twitter
Epidemics
Event detection Twitter
Image analysis
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
643
3.2 Deep Learning (DL) Untawale et al. in [123] performed age groups classification (i.e. general, teenager and adult) on twitter social media platform for tweets related to the medical domain. The authors applied MLP, DCNN, DT, RF and SVM models, out of which DCNN achieved the highest F1-score of 0.93. Guimaraes et al. in [124] applied deep learning for clustering/grouping (i.e. DeepLCRank) of the holiday photo images from Flickr and Youtube. Table 3 shows additional studies where DL techniques are applied on social media data into various domains for different tasks. Table 3. SoTA DL techniques for social media data analysis Title
DL model used
Dataset used
Social media platform used
Year
Application area
Deep Learning for Hate Speech Detection in Tweets [125]
FastText + CNN, LSTM
Racist tweets dataset
Twitter
2017
Hate Speech
Detecting RNN Offensive Language in Tweets Using Deep Learning [126]
Hate speech tweets Twitter
2018
Hate Speech
Multi-layers CNN Convolutional Neural Network for Twitter Sentiment Ordinal Scale Classification [127]
SemEval challenge Twitter dataset11
2018
Sentiment Analysis
Bloom’s Learning Outcomes’ Automatic Classification Using LSTM and Pretrained Word Embeddings [65]
FastText + LSTM
Course learning outcomes dataset
Twitter
2021
Bloom’s Taxonomy
Evaluating Polarity Trend Amidst the Coronavirus Crisis in Peoples’ Attitudes toward the Vaccination Drive [128]
FastText + LSTM
Covid-19 tweets
Twitter
2021
Sentiment Analysis
(continued) 11 http://alt.qcri.org/semeval2016/task4/index.php%3fid%3ddata-and-tools.
644
S. Shaikh et al. Table 3. (continued)
Title
DL model used
Dataset used
Social media platform used
Year
Application area
Deep learning-based personality recognition from text posts of online social networks [129]
CNN, RNN
Facebook posts
Facebook
2018
Personality Recognition
Personality recognition from Facebook text for Portuguese language [130]
LSTM
Facebook posts
Facebook
2018
Personality Recognition
3.3 Natural Language Processing (NLP) Shamantha et al. in [131] performed sentiment analysis using ML and NLP techniques on twitter data using three ML models: NB, SVM and RF. Out of all models, RF model outperformed with overall accuracy of 80% to predict sentiments from the twitter data. Tembhurnikar et al. in [132] performed topic modelling on twitter data using sentiment analysis and n-gram approaches along with K-means algorithm to understand important topics related to events like land acquisition bills, swine fight model, etc. Table 4 shows additional studies where NLP techniques are applied by combining with Table 4. SoTA NLP techniques for social media data analysis Reference
Type
Feature extraction
Learning Algorithm
Domain
Year Task
Implementation of sentiment classification of movie reviews by supervised machine learning approaches [133]
Supervised
Unigram
NB, RF
Movies
2019 Sentiment Analysis
Evaluation of deep learning techniques in sentiment analysis from twitter data [134]
Supervised
Word2Vec, Glove
CNN, LSTM
General
2019 Sentiment Analysis
(continued)
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
645
Table 4. (continued) Reference
Type
Feature extraction
Learning Algorithm
Domain
Year Task
Machine learning based aspect level sentiment analysis for Amazon products [135]
Supervised
–
SVM
Twitter
2020 Sentiment Analysis
Experimental Supervised investigation of automated system for twitter sentiment analysis to predict the public emotions using machine learning algorithms [136]
–
SVM, DNN Twitter
2020 Sentiment Analysis
Effect of Negation in Sentences on Sentiment Analysis and Polarity Detection [137]
Supervised
TF IDF
NB, SVM, Cellphone DNN, RNN
2021 Sentiment Analysis
Deep representation learning for clustering of health tweets [138]
Unsupervised
Contextual K-Means Word Embedding
Health
Tweets classification with bert in the field of disaster management [139]
Supervised
Contextual BERT Word Embedding
Disaster 2018 Topic Modelling Management
Short text classification with a convolutional neural networks based method [140]
Supervised
Pre-trained CNN, SVM Sentiment Word Analysis Embedding
2018 Topic Modelling
2018 Topic Modelling
(continued)
646
S. Shaikh et al. Table 4. (continued)
Reference
Type
Feature extraction
Learning Algorithm
Domain
Year Task
Real-time event Unsupervised detection from the Twitter data stream using the TwitterNews + Framework [141]
Word Incremental Twitter Embedding Clustering News
2019 Topic Modelling
Bi-LSTM-CRF Unsupervised sequence labeling for keyphrase extraction from scholarly documents [142]
–
BiLSTM CRF
Scientific documents
2019 Keyword/Keyphrase Extraction
Exploiting topic-based adversarial neural network for cross-domain keyphrase extraction [143]
Unsupervised
–
BiLSTM
Scientific documents
2018 Keyword/Keyphrase Extraction
Bidirectional lstm recurrent neural network for keyphrase extraction [144]
Unsupervised
–
BiLSTM
Scientific documents
2018 Keyword/Keyphrase Extraction
Semi-supervised Semi-supervised – learning for neural keyphrase generation [145]
Seq2Seq
multiple
2018 Keyword/Keyphrase Extraction
Learning feature Graph based representations for keyphrase extraction [146]
–
News
2018 Keyword/Keyphrase Extraction
–
different ML/DL algorithms on social media data into various domains for different tasks including sentiment analysis, topic modelling, intent detection, keyword/keyphrase extraction.
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
647
4 Findings from State-of-the-art Studies This section discusses the overall findings from the state-of-the-art studies applied for social media data analysis using ML/DL and NLP techniques for perception extraction. Figure 12 summarizes the findings of state-of-the-art studies for NLP techniques like sentiment analysis, topic modelling, intent detection, etc. applied in several domains like twitter, health, disaster management and news. We can further observe that for feature extraction techniques NLP has come along a long way from simple features like n-gram, Tf-Idf to more complex features like word embeddings, Elmo and Bert to understand the more complex semantics involved in the raw text. Again, for learning algorithms from past several years NLP techniques are used in combination with from simple ML algorithms like SVM, RF, NB, etc. to more complex deep learning and transformers algorithms like DNN, CNN, LSTM and BERT.
Fig. 12. Findings from SoTA NLP studies
Figure 13 summarizes state-of-the-art studies for ML/DL algorithms applied in several domains like business analytics, epidemics, recommendation systems, crime detection, etc. From the figure, we can observe that there is a shift of applying DL algorithms as compared to ML algorithms in recent years for domains like business intelligence, hate speech on social media platforms. One possible reason might be that DL algorithms are more context-aware and semantically rich while understanding the raw text.
648
S. Shaikh et al.
Fig. 13. Findings from SoTA ML/DL studies
5 Conclusion In this paper, we described several artificial intelligence techniques including machine learning, deep learning and natural language processing in detail for the purpose of social media data analysis. Along with describing the techniques, we also conducted state-of-the-art review of studies where these techniques are applied for social media data analysis. The findings of review resulted in the identification of existing domains where the techniques are most widely used. The identified domains are health, disaster management, online social networks, news and scientific documents. Also, the review revealed the shift of the learning algorithms from ML to DL in recent studies in the year 2021. Overall, the review provides good discussions related to the different type of algorithms applied in various domains for achieving various types of tasks.
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
649
References 1. McGregor, S.C.: Social media as public opinion: how journalists use social media to represent public opinion. Journalism 20(8), 1070–1086 (2019) 2. Sofalvi, A.J., Airhihenbuwa, C.O.: An analysis of the relationship between news coverage of health topics and public opinion of the most important health problems in the United States. J. Health Edu. 23(5), 296–300 (1992) 3. Krugman, H.E.: The impact of television advertising: learning without involvement. Public Opin. Q. 29(3), 349–356 (1965) 4. Tobin, C., Moodie, A.R., Livingstone, C.: A review of public opinion towards alcohol controls in Australia. BMC Pub. Health 11(1), 1–9 (2011) 5. D’Andrea, E., et al.: Monitoring the public opinion about the vaccination topic from tweets analysis. Expert Syst. Appl. 116, 209–226 (2019) 6. Murphy, J., et al.: Social media in public opinion research: executive summary of the AAPOR task force on emerging technologies in public opinion research. Public Opin. Q. 78(4), 788–794 (2014) 7. Fishkin, J.S.: Beyond polling alone: the quest for an informed public. Crit. Rev. 18(1–3), 157–165 (2006) 8. Bian, J., et al.: Mining Twitter to assess the public perception of the “Internet of Things.” PLoS ONE 11(7), e0158450 (2016) 9. Khan, S., et al.: Antecedents of trust in using social media for e-government services: an empirical study in Pakistan. Technol. Soc. 64, 101400 (2021) 10. Huang, M.-H., Whang, T., Xuchuan, L.: The internet, social capital, and civic engagement in Asia. Soc. Indic. Res. 132(2), 559 (2017) 11. Kang, Y., et al.: The public’s opinions on a new school meals policy for childhood obesity prevention in the US: a social media analytics approach. Int. J. Med. Informatics 103, 83–88 (2017) 12. Anstead, N., O’Loughlin, B.: Social media analysis and public opinion: the 2010 UK general election. J. Comput.-Mediat. Commun. 20(2), 204–220 (2015) 13. Chen, S.-C., Lin, C.-P.: Understanding the effect of social media marketing activities: the mediation of social identification, perceived value, and satisfaction. Technol. Forecast. Soc. Chang. 140, 22–32 (2019) 14. Mollema, L., et al.: Disease detection or public opinion reflection? Content analysis of tweets, other social media, and online newspapers during the measles outbreak in The Netherlands in 2013. J. Med. Internet Res. 17(5), e3863 (2015) 15. Xue, Y., Xu, L., Qiu, B., Wang, L., Zhang, G.: Relationship discovery in public opinion and actual behavior for social media stock data space. EURASIP J. Wirel. Commun. Netw. 2016(1), 1–13 (2016). https://doi.org/10.1186/s13638-016-0684-3 16. Pourebrahim, N., et al.: Understanding communication dynamics on Twitter during natural disasters: a case study of Hurricane Sandy. International journal of disaster risk reduction 37, 101176 (2019) 17. Adams-Cohen, N.J.: Policy change and public opinion: measuring shifting political sentiment with social media data. Am. Politics Res. 48(5), 612–621 (2020) 18. Salleh, S.M.: From survey to social media: public opinion and politics in the age of big data. Adv. Sci. Lett. 23(11), 10696–10700 (2017) 19. Dong, X., Lian, Y.: A review of social media-based public opinion analyses: challenges and recommendations. Technol. Soc. 67, 101724 (2021) 20. Lokeshkumar, R., Mishra, O.A., Kalra, S.: Social media data analysis to predict mental state of users using machine learning techniques. Journal of Education and Health Promotion 10(1), 301 (2021)
650
S. Shaikh et al.
21. Guo, Y., et al.: The application of artificial intelligence and data integration in COVID-19 studies: a scoping review. J. Am. Med. Inf. Assoc. 28, 2050–2067 (2021) 22. Luo, J., et al.: Exploring temporal suicidal behavior patterns on social media: Insight from Twitter analytics. Health Informatics J. 26(2), 738–752 (2020) 23. Valdez, D., et al.: Social media insights into US mental health during the COVID-19 pandemic: longitudinal analysis of twitter data. J. Med. Internet Res. 22(12), e21418 (2020) 24. Santander, P., Alfaro, R., Allende-Cid, H., Elórtegui, C., González, C.: Analyzing social media, analyzing the social? A methodological discussion about the demoscopic and predictive potential of social media. Qual. Quant. 54(3), 903–923 (2020). https://doi.org/10. 1007/s11135-020-00965-z 25. Martinez, L.S., Tsou, M.-H., Spitzberg, B.H.: A case study in belief surveillance, sentiment analysis, and identification of informational targets for e-cigarettes interventions. In: Proceedings of the 10th International Conference on Social Media and Society (2019) 26. Saravanan, M., Perepu, S.K.: Realizing social-media-based analytics for smart agriculture. Rev. Socio-Netw. Strat. 13(1), 33–53 (2019) 27. Chang, Y.-C., Ku, C.-H., Chen, C.-H.: Social media analytics: extracting and visualizing Hilton hotel ratings and reviews from TripAdvisor. Int. J. Inf. Manage. 48, 263–279 (2019) 28. von Hoffen, M., Hagge, M., Betzing, J.H., Chasin, F.: Leveraging social media to gain insights into service delivery: a study on Airbnb. IseB 16(2), 247–269 (2017). https://doi. org/10.1007/s10257-017-0358-7 29. Chumwatana, T., Wongkolkitsilp, K.: Using classification technique for customer relationship management based on Thai social media data. In: Proceedings of the 2019 11th International Conference on Computer and Automation Engineering (2019) 30. Tian, X., et al.: A new approach of social media analytics to predict service quality: evidence from the airline industry. J. Enter. Inf. Manage. (2019) 31. Dahal, B., Kumar, S.A.P., Li, Z.: Topic modeling and sentiment analysis of global climate change tweets. Soc. Netw. Anal. Min. 9(1), 1–20 (2019). https://doi.org/10.1007/s13278019-0568-8 32. Dias, D.S., Welikala, M.D. Dias, N.G.: Identifying racist social media comments in Sinhala language using text analytics models with machine learning. In: 2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer), IEEE (2018) 33. Barrelet, C.J., Kuzulugil, S.S. Bener, A.B.: The Twitter bullishness index: a social media analytics indicator for the stock market. In: Proceedings of the 20th International Database Engineering & Applications Symposium (2016) 34. Park, S.B., Jang, J. Ok, C.M.: Analyzing Twitter to explore perceptions of Asian restaurants. J. Hosp. Tour. Technol. 7, 405–422 (2016) 35. Pushpam, C.A., Jayanthi, J.G.: Overview on data mining in social media. Int. J. Comput. Sci. Eng. 5(11), 147–157 (2017) 36. Camacho, D., Luzón, M.V., Cambria, E.: New trends and applications in social media analytics. Future Gen. Comput. Syst. 114, 318–321 (2021) 37. Balan, S., Rege, J.: Mining for social media: usage patterns of small businesses. Bus. Syst. Res. Int. J. Soc. Adv. Innov. Res. Econ. 8(1), 43–50 (2017) 38. Quinlan, J.R.: C4. 5: Programs for Machine Learning. Elsevier (2014) 39. Whelan, E., Islam, A.N., Brooks, S.: Applying the SOBC paradigm to explain how social media overload affects academic performance. Comput. Educ. 143, 103692 (2020) 40. Alpaydin, E.: Introduction to Machine Learning. MIT Press (2020) 41. Rebala, G., Ravi, A., Churiwala, S.: Learning Models: An Introduction to Machine Learning. Springer, Cham, pp. 19–23 (2019) 42. Rebala, G., Ravi, A., Churiwala, S.: Classification: An Introduction to Machine Learning. Springer, Cham, pp. 57–66 (2019)
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
651
43. Rebala, G., Ravi, A., Churiwala, S.: Regressions: An Introduction to Machine Learning. Springer, Cham, pp. 25–40 (2019) 44. Rebala, G., Ravi, A., Churiwala, S.: Clustering: An Introduction to Machine Learning. Springer, Cham. pp. 67–76 (2019). 45. Zhu, X., Goldberg, A.B.: Introduction to semi-supervised learning. Syn. Lect. Artif. Intell. Mach. Learn. 3(1), 1–130 (2009) 46. Balaji, T.K., Annavarapu, C.S.R., Bablani, A.: Machine learning algorithms for social media analysis: a survey. Comput. Sci. Rev. 40, 100395 (2021) 47. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016) 48. Yegnanarayana, B., Artificial Neural Networks. PHI Learning (2009) 49. Makantasis, K., et al.: Deep supervised learning for hyperspectral data classification through convolutional neural networks. In: 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), IEEE (2015) 50. Mele, A.: A structural model of dense network formation. Econometrica 85(3), 825–850 (2017) 51. Albawi, S., Mohammed, T.A., Al-Zawi, S.: Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), IEEE (2017) 52. Merrill, W., et al.: A formal hierarchy of RNN architectures. arXiv preprint arXiv:2004. 08500 (2020) 53. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Thirteenth Annual Conference of the International Speech Communication Association (2012) 54. Xu, G., et al.: Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7, 51522–51532 (2019) 55. Valpola, H.: From neural PCA to deep unsupervised learning. In: Advances in independent component analysis and learning machines, pp. 143–171. Elsevier (2015) 56. Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforcement learning. In: The 2010 International Joint Conference on Neural Networks (IJCNN), IEEE (2010) 57. Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: Artificial intelligence and statistics, PMLR (2009) 58. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012) 59. Deng, J., et al.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE (2009) 60. Ouali, Y., Hudelot, C., Tami, M.: An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278 (2020) 61. Settles, B.: Active Learning Literature Survey. University of Wisconsin-Madison (2009) 62. Ratner, A., et al.: Weak supervision: the new programming paradigm for machine learning. Hazy Research, p. 5–9. https://dawn.cs.stanford.edu//2017/07/16/weak-supervision/. Accessed 2019 63. Shaikh, S., Doudpotta, S.M.: Aspects based opinion mining for teacher and course evaluation. Sukkur IBA J. Comput. Math. Sci. 3(1), 34–43 (2019) 64. Shaikh, S., et al.: Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models. Appl. Sci. 11(2), 869 (2021) 65. Shaikh, S., Daudpotta, S.M., Imran, A.S.: Bloom’s learning outcomes’ automatic classification using LSTM and pretrained word embeddings. IEEE Access 9, 117887–117909 (2021) 66. Granaas, M.M.: Simple, applied text parsing. Behav. Res. Methods Instrum. Comput. 17(2), 209–216 (1985) 67. Kumawat, D., Jain, V.: POS tagging approaches: a comparison. Int. J. Comput. Appl. 118(6) (2015)
652
S. Shaikh et al.
68. Munoz, M., et al.: A learning approach to shallow parsing. arXiv preprint cs/0008022 (2000) 69. Rodrıguez, C.G.: Parsing Schemata for Practical Text Analysis. Citeseer (2011) 70. Guyon, I., Elisseeff, A.: An introduction to feature extraction. In: Feature Extraction, pp. 1– 25. Springer (2006) 71. Chen, W., et al.: Exploiting meta features for dependency parsing and part-of-speech tagging. Artif. Intell. 230, 173–191 (2016) 72. Blackstock, A., Spitz, M.: Classifying Movie Scripts by Genre With a MEMM Using NLPBased Features. Citeseer (2008) 73. Zhang, Y., Jin, R., Zhou, Z.-H.: Understanding bag-of-words model: a statistical framework. Int. J. Mach. Learn. Cybern. 1(1–4), 43–52 (2010) 74. Shi, C.-Y., Xu, C.-J., Yang, X.-J.: Study of TFIDF algorithm. J. Comput. Appl. 29(6), 167–170 (2009) 75. Pizarro, J.: Using N-grams to detect Bots on Twitter. In CLEF (Working Notes) (2019) 76. Kulkarni, A., Shivananda, A.: Converting text to features. In: Natural Language Processing Recipes, pp. 63–106. Springer (2021) 77. Church, K.W.: Word2Vec. Nat. Lang. Eng. 23(1), 155–162 (2017) 78. Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016) 79. Joulin, A., et al.: Fasttext. zip: compressing text classification models. arXiv preprint arXiv: 1612.03651 (2016) 80. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (2014) 81. Zhu, H., Paschalidis, I.C., Tahmasebi, A.: Clinical concept extraction with contextual word embedding. arXiv preprint arXiv:1810.10566 (2018) 82. Birjali, M., Kasri, M., Beni-Hssane, A.: A comprehensive survey on sentiment analysis: approaches, challenges and trends. Knowledge-Based Syst., 107134 (2021) 83. Nandwani, P., Verma, R.: A review on sentiment analysis and emotion detection from text. Soc. Netw. Anal. Min. 11(1), 1–19 (2021). https://doi.org/10.1007/s13278-021-00776-6 84. Ahmad, Z., et al.: Borrow from rich cousin: transfer learning for emotion detection using cross lingual embedding. Expert Syst. Appl. 139, 112851 (2020) 85. Khoo, C.S., Johnkhan, S.B.: Lexicon-based sentiment analysis: comparative evaluation of six sentiment lexicons. J. Inf. Sci. 44(4), 491–511 (2018) 86. Ye, Q., Zhang, Z., Law, R.: Sentiment classification of online reviews to travel destinations by supervised machine learning approaches. Expert Syst. Appl. 36(3), 6527–6535 (2009) 87. Jain, P.K., Pamula, R., Srivastava, G.: A systematic literature review on machine learning applications for consumer sentiment analysis using online reviews. Comput. Sci. Rev. 41, 100413 (2021) 88. Pasupa, K., Ayutthaya, T.S.N.: Thai sentiment analysis with deep learning techniques: a comparative study based on word embedding, POS-tag, and sentic features. Sustain. Cities Soc. 50, 101615 (2019) 89. Boon-Itt, S., Skunkan, Y.: Public perception of the COVID-19 pandemic on Twitter: sentiment analysis and topic modeling study. JMIR Public Health Surveill. 6(4), e21978 (2020) 90. Deerwester, S., et al.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990) 91. Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994) 92. Cooper, G.F., Moral, S.: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann (1998)
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
653
93. De Finetti, B.: Theory of Probability: A Critical Introductory Treatment, vol. 6, Wiley (2017) 94. Nguyen, D.Q., et al.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Ling. 3, 299–313 (2015) 95. Yin, H., et al.: A unified model for stable and temporal topic detection from social media data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), IEEE (2013) 96. Xie, W., et al.: Topicsketch: Real-time bursty topic detection from Twitter. IEEE Trans. Knowl. Data Eng. 28(8), 2216–2229 (2016) 97. Cataldi, M., Di Caro, L., Schifanella, C.: Emerging topic detection on twitter based on temporal and social terms evaluation. In: Proceedings of the Tenth International Workshop on Multimedia Data Mining (2010) 98. Mottaghinia, Z., et al.: A review of approaches for topic detection in Twitter. J. Exp. Theor. Artif. Intell. 33(5), 747–773 (2021) 99. Siddiqi, S., Sharan, A.: Keyword and keyphrase extraction techniques: a literature review. Int. J. Comput. Appl. 109(2) (2015) 100. Lahiri, S., Choudhury, S.R., Caragea, C.: Keyword and keyphrase extraction using centrality measures on collocation networks. arXiv preprint arXiv:1401.6571 (2014) 101. Zhao, D., et al.: Keyword extraction for social media short text. In: 2017 14th Web Information Systems and Applications Conference (WISA), IEEE (2017) 102. Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM J. Res. Dev. 1(4), 309–317 (1957) 103. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988) 104. Cohen, J.D.: Highlights: language-and domain-independent automatic indexing terms for abstracting. J. Am. Soc. Inf. Sci. 46(3), 162–174 (1995) 105. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. Int. J. Artif. Intell. Tools 13(01), 157–169 (2004) 106. Campos, R., et al.: Yake! collection-independent automatic keyword extractor. In: European Conference on Information Retrieval, Springer (2018) 107. Witten, I.H., et al.: Kea: practical automated keyphrase extraction. In: Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, IGI global, pp. 129–152 (2005) 108. Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (2004) 109. Mothe, J., Ramiandrisoa, F., Rasolomanana, M.: Automatic keyphrase extraction using graph-based methods. In: Proceedings of the 33rd Annual ACM Symposium on Applied Computing (2018) 110. Wan, X., Xiao, J.: Single document keyphrase extraction using neighborhood knowledge. In: AAAI (2008) 111. Singh, M., Bansal, D., Sofat, S.: Behavioral analysis and classification of spammers distributing pornographic content in social media. Soc. Netw. Anal. Min. 6(1), 1–18 (2016). https://doi.org/10.1007/s13278-016-0350-0 112. Vafeiadis, T., et al.: A comparison of machine learning techniques for customer churn prediction. Simul. Model. Pract. Theory 55, 1–9 (2015) 113. Stantchev, V., Prieto-González, L., Tamm, G.: Cloud computing service for knowledge assessment and studies recommendation in crowdsourcing and collaborative learning environments based on social network analysis. Comput. Hum. Behav. 51, 762–770 (2015) 114. Chen, Y., et al.: Decision learning: Data analytic learning with strategic decision making. IEEE Signal Process. Mag. 33(1), 37–56 (2015) 115. Ramalingam, D., Chinnaiah, V.: Fake profile detection techniques in large-scale online social networks: a comprehensive review. Comput. Electr. Eng. 65, 165–177 (2018) 116. Zhao, R., Mao, K.: Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder. IEEE Trans. Affect. Comput. 8(3), 328–339 (2016)
654
S. Shaikh et al.
117. Li, L., et al.: Characterizing the propagation of situational information in social media during covid-19 epidemic: a case study on Weibo. IEEE Trans. Comput. Soc. Syst. 7(2), 556–562 (2020) 118. Culotta, A. Towards detecting influenza epidemics by analyzing Twitter messages. In: Proceedings of the First Workshop on Social Media Analytics (2010) 119. Xu, Z.-X., Liu, Y., Zhang, J.: A pair of novel 4-connected homochiral coordination polymers based on proline-tetrazole ligand. Inorg. Chem. Commun. 67, 44–46 (2016) 120. Middleton, S.E., Middleton, L., Modafferi, S.: Real-time crisis mapping of natural disasters using social media. IEEE Intell. Syst. 29(2), 9–17 (2013) 121. Chiachia, G., et al.: Learning person-specific representations from faces in the wild. IEEE Trans. Inf. Forensics Secur. 9(12), 2089–2099 (2014) 122. Wang, Z., et al.: Activity maximization by effective information diffusion in social networks. IEEE Trans. Knowl. Data Eng. 29(11), 2374–2387 (2017) 123. Guimaraes, R.G., et al.: Age groups classification in social network using deep learning. IEEE Access 5, 10805–10816 (2017) 124. Zin, T.T., Tin, P., Hama, H.: Deep learning model for integration of clustering with ranking in social networks. In: International Conference on Genetic and Evolutionary Computing, Springer (2016) 125. Badjatiya, P., et al.: Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web companion (2017) 126. Pitsilis, G.K., Ramampiaro, H., Langseth, H.: Detecting offensive language in tweets using deep learning. arXiv preprint arXiv:1801.04433 (2018) 127. Alali, M., et al.: Multi-layers convolutional neural network for twitter sentiment ordinal scale classification. In: International Conference on Soft Computing and Data Mining, Springer (2018) 128. Batra, R., et al.: Evaluating polarity trend amidst the coronavirus crisis in peoples’ attitudes toward the vaccination drive. Sustainability 13(10), 5344 (2021) 129. Xue, D., et al.: Deep learning-based personality recognition from text posts of online social networks. Appl. Intell. 48(11), 4232–4246 (2018). https://doi.org/10.1007/s10489018-1212-4 130. da Silva, B.B.C., Paraboni, I.: Personality recognition from Facebook text. In: International Conference on Computational Processing of the Portuguese Language, Springer (2018) 131. Shamantha, R.B., Shetty, S.M., Rai, P.: Sentiment analysis using machine learning classifiers: evaluation of performance. In: 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), IEEE (2019) 132. Tembhurnikar, S.D., Patil, N.N.: Topic detection using BNgram method and sentiment analysis on Twitter dataset. In: 2015 4th International Conference on Reliability, Infocom Technologies and Optimization (ICRITO) (Trends and Future Directions), IEEE (2015) 133. Untawale, T.M., Choudhari, G.: Implementation of sentiment classification of movie reviews by supervised machine learning approaches. In: 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), IEEE (2019) 134. Goularas, D., Kamis, S.: Evaluation of deep learning techniques in sentiment analysis from Twitter data. In: 2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML), IEEE (2019) 135. Nandal, N., Tanwar, R., Pruthi, J.: Machine learning based aspect level sentiment analysis for Amazon products. Spat. Inf. Res. 28(5), 601–607 (2020). https://doi.org/10.1007/s41 324-020-00320-2 136. Sharma, P., Sharma, A.: Experimental investigation of automated system for twitter sentiment analysis to predict the public emotions using machine learning algorithms. In: Materials Today: Proceedings (2020)
A Survey of Artificial Intelligence Techniques for User Perceptions’ Extraction
655
137. Mukherjee, P., et al.: Effect of negation in sentences on sentiment analysis and polarity detection. Procedia Comput. Sci. 185, 370–379 (2021) 138. Gencoglu, O.: Deep representation learning for clustering of health tweets. arXiv preprint arXiv:1901.00439 (2018) 139. Ma, G.: Tweets Classification with BERT in the Field of Disaster Management. Stanford University, Stanford, CA, 15785631 (2019) 140. Hu, Y., et al. Short text classification with a convolutional neural networks based method. In: 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), IEEE (2018) 141. Hasan, M., Orgun, M.A., Schwitter, R.: Real-time event detection from the Twitter data stream using the TwitterNews+ Framework. Inf. Process. Manage. 56(3), 1146–1165 (2019) 142. Alzaidy, R., Caragea, C., Giles, C.L.: Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents. In: The World Wide Web Conference (2019) 143. Wang, Y., et al.: Exploiting topic-based adversarial neural network for cross-domain keyphrase extraction. In: 2018 IEEE International Conference on Data Mining (ICDM), IEEE (2018) 144. Basaldella, M., et al. Bidirectional LSTM recurrent neural network for keyphrase extraction. In: Italian Research Conference on Digital Libraries, Springer (2018) 145. Ye, H., Wang, L.: Semi-supervised learning for neural keyphrase generation. arXiv preprint arXiv:1808.06773 (2018) 146. Florescu, C., Jin, W.: Learning feature representations for key phrase extraction. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)
Social Media Self-expression as Form of Coping During the 2020 Pandemic Lockdown Macrina P. Lazo(B) and Christine Diane Ramos College of Computer Studies, Dela Salle University, Taft Avenue, Metro Manila, Philippines [email protected]
Abstract. Using the transactional theory of stress and coping, this study examines how the different emerging themes of social media self-expression during the 2020 pandemic lockdown influence the coping process and coping outcome. An online survey was conducted to determine the themes of social media selfexpressions used by the respondents during the lockdown, how the respondents appraised the lockdown and the level of subjective well-being during the lockdown. Among the five themes of social media self-expressions included in the study, only two themes emerged as significantly associated with appraisal and subjective well-being. Expression of faith and religion emerged as inversely associated with appraisal of the lockdown as harmful and directly associated with subjective well-being. On the other hand, expression of feelings and emotions emerged as directly associated with appraisal of the lockdown as a threat and inversely associated with subjective well-being. Coping expression, appraisal, and coping outcome were found to differ by age and gender. Results of this study suggest that different themes of social media self-expression impact the coping process and that demographics may act as differentiating factor in the coping process. This has implications in using social media postings as a basis in designing intervention programs in situations that are uncertain and stressful. Keywords: Coping · Social media self-expression · Lockdown · Subjective well-being
1 Introduction On March 16, 2020, the Philippine government placed Luzon island, the largest island in the Philippines, under pandemic lockdown from March 17, 2020 to April 30, 2020. This was further extended until May 31, 2020, in high-risk provinces. During the pandemic lockdown, all households were under strict home quarantine, classes at all levels were suspended, mass gatherings were prohibited, mass public transport was suspended, and only establishments providing basic needs were allowed to open [1]. The pandemic lockdown presented an event that was never experienced before and consequently caused uncertainty which was considered stressful [2–4]. As the 2020 pandemic lockdown limits face-to-face interaction, social media usage has become one of the most predominant tools used for social interaction [3, 5–7]. Initial © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 656–671, 2022. https://doi.org/10.1007/978-3-031-10464-0_44
Social Media Self-expression as Form of Coping
657
studies indicate that social media usage is a popular stress-reliever during the pandemic lockdown [8–10]. However, studies investigating the relationship between social media usage and subjective well-being during the lockdown remain sparse. To fill this gap, this study was undertaken to investigate the effect of social media on subjective well-being as mediated by the coping process. Specifically, this study aims to understand how different forms of social media coping self-expressions during the 2020 pandemic lockdown in the Philippines influenced the appraisal of the lockdown as harmful, threat, or challenge and how the type of appraisal affected subjective well-being. Using the transactional model of stress and coping (TMSC) of Lazarus and Folkman as a theoretical lens, this study investigates the inter-relationship of coping expressions through social media, appraisal of the pandemic lockdown, and subjective well-being. Personal variables such as age, gender, health status, and employment status were explored as potential explanatory variables to explain the differences in the inter-relationship of the factors involved in the coping process. This study contributes to the literature on coping, social media, and subjective wellbeing. First, this study corroborates the inverse relationship of appraisal of an event as harmful vis-a-vis subjective well-being and the positive relationship of appraisal of an event as a challenge vis-a-vis subjective well-being. Second, this study reveals that different themes of social media self-expressions have a different impact on subjective well-being. Expression of faith-religion was found to have a direct correlation to subjective well-being as mediated by lower appraisal of the pandemic lockdown as harmful. On the other hand, expression of emotions was found to have an inverse correlation to subjective well-being as mediated by the appraisal of the pandemic as a threat. Third, this study provides useful inputs to social health workers in designing social media-enabled intervention programs during stressful events.
2 Literature Review 2.1 Stress and Coping The Transactional Model of Stress and Coping (TMSC) of Lazarus and Folkman define coping as consisting of efforts whether cognitive or behavioral which are intended to manage demands that are strenuous or beyond the resources of the person which are considered as a stressor [4]. Lazarus and Folkman consider coping as serving two functions namely the management of the source of stress, referred to as problem-focused coping, and management of stressful emotions referred to as emotion-focused coping. Although these two functions differ in focus, these two do not occur in isolation. Studies by Lazarus have shown that the coping process involves both problem-focused and emotion-focused coping but with varying proportions depending on the degree of interaction between the person and the environment [12]. Lazarus and Folkman consider coping as a process where it starts with a dual-level of appraisal from primary appraisal which considers the relevance of the event to the person’s well-being whether it is harmful, a threat, or a challenge, and secondary appraisal which considers the options and resources that are available for coping. The result of the person’s appraisal determines the coping mechanism that will be implemented, the outcome of which leads to a re-appraisal which in turn results in new coping efforts [11]. The stress and coping theory of Lazarus and
658
M. P. Lazo and C. D. Ramos
Folkman are extensively used in studies on coping and applied to various fields since its formulation in the 1960s [13]. One of the major criticisms of the TMSC theory is the focus on the subjective assessment of stress while understating the importance of objective external factors which may minimize or aggravate stress. Lazarus defends his position by arguing that there is empirical proof that subjective assessments are better predictors of coping outcomes than objective measures. He further argues that it is impossible to create a purely objective measure of stress without any influence of the person’s selective attention, interpretation, or perception [14]. The Conservation of Resources (COR) Theory of Hobfoll provides an alternative framework in studying stress and coping. Contrary to the stress and coping model of Lazarus and Folkman which focuses on the person’s appraisal, COR theory focuses on resources as the main driver in the assessment of stressor and the determinant of the coping strategies in the face of a stressor. Resources are defined by Hobfoll as factors that are valued in themselves or those that are instrumental to the achievement or protection of resources that are considered valuable [15]. According to the COR theory, stress occurs when there is a threat to the availability, stability, or protection of resources. Efforts to conserve these resources lead to either a spiral of loss when resources are exhausted such that further threat of loss cannot be prevented or a spiral of gain when successful protection of resources leads to renewal of resources [15]. One criticism of the COR theory is that the evaluation of loss can be a product of negative personality traits such as neuroticism and extroversion. Although Hobfoll attempted to defend his theory against this criticism, he nevertheless accepted that neuroticism and introversion indeed lower the ability of the person to recover from losses [15]. 2.2 Social Media, Coping and Well-Being Despite the abundance of studies on coping strategies for stress, there is a scarcity in the number of studies focusing on the use of social media to cope with stress [16–19]. A study published in 2017 on the preferred coping device among undergraduate students showed that media-mediated coping is among the top four coping strategies in terms of perceived effectiveness in reducing stress [16]. Media in this study includes a variety of media-based entertainment where social media is just one among other media which includes surfing the internet, playing video games, listening to music, watching tv, and reading books or magazines. The use of media as a tool for coping is based on its ability to provide a diversion from stressors [16]. It was found to be comparatively as effective in managing stress like exercise, social support, and calming behaviors. A study published in 2020 evaluated the effectiveness of social media usage in reducing stress after exposure to stressors among undergraduate college students. Results of the study showed that the use of social media after exposure to stressors resulted in a significant decrease in the physiological manifestation of stress as compared to the control group who were asked to read after exposure to a stressor [17]. The study concluded that social media usage can be an effective emotion-oriented tool in coping with stress. The limited number of studies on the effectiveness of social media usage as a tool for coping with stress opens an area to conduct further studies on the different aspects of social media usage, especially since the use of social media has become ubiquitous [16, 17, 19–21].
Social Media Self-expression as Form of Coping
659
Studies linking social media usage to well-being have shown promising results. There are different schools of thought in the definition and classification of well-being. Martin Seligman defines well-being in terms of 5 elements namely positive emotion, positive relationships, meaning, engagement, and achievement [22]. Seligman’s theory of well-being posits that well-being can be defined, measured, and taught. Diener further categorized well-being into subjective well-being which is measured in terms of positive or negative feelings and overall life satisfaction versus eudemonic well-being which refers to optimal functioning or flourishing [22]. A two-panel study conducted in Hongkong in 2016 to understand how social media usage contributes to psychological well-being indicated that self-disclosure through social media is positively correlated to psychological well-being [23]. Self-disclosure refers to activities related to revealing oneself on social media such as posting of feelings and emotions, personal profile, activities, interests, location, and the likes [23]. While self-disclosure in social media affects psychological well-being directly, its strongest impact on well-being is mediated by its effect on building social capital. Maintaining and expanding connections and social interactions online were found to increase bridging social capital [24]. On the other hand, positive feedback on profiles and posts coming from contacts who share the same values and beliefs, on the other hand, was found to increase bonding social capital. As social capital increases, psychological well-being increases [23]. The authenticity of expression in social media was found to be directly related to well-being regardless of the personality type [25]. Likewise, the positive effect of social media usage on connectedness was observed to improve subjective well-being [24]. Social sharing is another area where the study on the potential impact of social media usage on well-being has gained interest among researchers. Social sharing is defined as composing of activities related to the disclosure of significant and meaningful experiences to others as prompted by the need to tackle emotions that are brought about by an event or an experience [26]. While studies on social sharing largely refer to faceto-face sharing, extant study on its implication on social media usage shows promising applicability. A study on social sharing across different media shows that social sharing through social media (through Facebook in particular) resulted in capitalization even when the interaction was very brief such as the use of emoticons or a like or a short phrase and even when the interaction is with a loose contact [20]. However, the study further showed that social media is used primarily to share mundane activities while a phone call is still preferred when sharing deep and personal experiences and emotions.
3 Theoretical Framework This study aims to analyze the impact of social media self-expression during the 2020 pandemic lockdown by using the Transactional Model of Stress and Coping (TMSC) theory of Lazarus and Folkman as the basis for the theoretical framework (refer to Fig. 1). 3.1 Stressor Lazarus defines a stressor as a situation that is assessed by an individual as strenuous or beyond one’s resources and consequently affecting one’s well-being (Folkman &
660
M. P. Lazo and C. D. Ramos
Coping Antecedent Stressor 2020 Pandemic Lockdown
Demographics Health Condition Employment Status
Mediating Process Social Media Coping Expression
Coping Outcome
Subjective Wellbeing
Appraisal
Fig. 1. Theoretical framework: social media self-expression as form of coping during the 2020 pandemic lockdown
Lazarus, 1980). Early studies on the impact of Covid19 were found to cause stress across segments of society due to the life disruptions that it has brought about [3, 5– 7]. Among students, the stress was caused by the cancellation of face-to-face classes, abrupt instruction to leave the campus, and the forced order to stay at home [3]. Early studies among adults have shown that covid-19 resulted in higher stress levels due to uncertainties brought about by the lockdown, health concerns, economic disruption, lack of information and supplies [7]. 3.2 Coping Antecedents Coping antecedents refer to environmental, social, and personal variables that influence the appraisal of a situation and the choice of coping strategies [4, 27]. Antecedents to coping include factors that may restrict or broaden access to coping resources which in turn affect the coping expression and appraisal of the situation [27, 28]. For this study, personal variables such as age, gender, marital status, health condition, and employment status were analyzed as coping antecedents and established their influence on the coping process. 3.3 Mediating Process – Appraisal Individuals or groups do not react to a stressful situation in the same way. Appraisal accounts for the differences in the reaction and assessment of individuals towards the same stressful situation [4, 29]. Personal differences of individuals result in differences in which the same situation is assessed. While one may assess a situation as a source of harm or threat another may assess it as a challenge. The way an individual evaluates the situation mediates subjective well-being [4, 29, 30]. 3.4 Mediating Process – Social Media Coping Expression As the 2020 pandemic lockdown limits face-to-face interaction, social media usage has become one of the most predominant tools used for social interaction [3, 5–7]. This
Social Media Self-expression as Form of Coping
661
study looks into social media self-expression as a means for coping during the pandemic lockdown and assess its relationship to the appraisal of the lockdown as harmful, threat or challenge and ultimately on its impact on subjective well-being. Initial consolidation of social media self-expressions gathered from Tweets related to the pandemic lockdown in the Philippines from March 2020 to May 2020 was used as the basis in identifying themes of social media self-expression. These were categorized into five themes of selfexpression namely: daily activities; self-reflections, memories, and plans; feelings and emotions; faith and religion; support for others; news and information. 3.5 Coping Outcome The effectiveness of the coping process ultimately impacts subjective well-being [29]. Subjective well-being encompasses multiple facets of life which comprises but is not limited to physical, social, psychological, economic, emotional aspects. Subjective wellbeing refers to overall life satisfaction, the presence of positive emotions, and the absence of negative emotions [22]. The five-item World Health Organization Well-being Index (WHO-5) was used as the basis in formulating the questionnaire for coping outcomes. The tool was published in 1998 and was found to be clinically valid and is widely accepted across diverse fields of study [31].
4 Methodology 4.1 Data Gathering There were two main sources of data for this analysis. The first set of data was gathered from Tweets posted from March 15, 2020 to June 20, 2020 during the Philippine pandemic lockdown with geocoordinates corresponding to the Philippine geocoordinates. A total of 7,645 Tweets were analyzed and categorized into themes based on agreed criteria among 8 coders. The resulting themes were used as the basis in formulating questions for social media coping expression during the lockdown. These themes were narrowed down into 5 self-expressions with the following topics: Everyday activities; Reflections, memories, and plans; Feelings and emotions; Faith and religion; Support to others; News and information. The second source of data was an online survey conducted from March 2021 to April 2021 where respondents were asked to recall their experience during the 2020 Pandemic lockdown and answer the questionnaire accordingly. There were a total of 72 respondents with 49% from the 18–24 age range, 24% from the 25–34 age range; 20% from the 45–54 age range. The majority of the respondents were college students (82%), single (82%), and female (63%). Cognizant of the relatively small sample size, WilcoxonMann-Whitney(WMW) test was used in analyzing the difference in the response by segment since the WMW test works well with datasets with small sample size [32, 33]. 4.2 Survey Questionnaire Appraisal. The questionnaire on appraisal was based on the 5-scale Stress Appraisal Measure (SAM) developed by Dr. Paul Wong and Dr. Edward Peacock [34]. Respondents were asked to go back to their experience during the 2020 pandemic lockdown
662
M. P. Lazo and C. D. Ramos
when answering the questionnaire. The terms “experience”/ “situation” in the original questionnaire were replaced with “during the 2020 pandemic lockdown”. Social Media Coping Expressions. The social media coping expressions used in the survey were based on the tweets gathered during the pandemic lockdown and categorized into five themes: Self-expression of everyday activities; Self-expression of reflections, memories, and plans; Self-expression of feelings and emotions; Self-expression of faith and religion; Self-expression of support to others; Self-expression of news and information. Respondents were asked to report how frequently they used these themes in their social media postings during the 2020 pandemic lockdown. Frequency options range from Never; Less-than-half-of-the-time; More-than-half-of-the-time; Most-of-the-time; All-the-time. Subjective Well Being. Subjective well-being was measured based on the World Health Organization Well-Being Index (WHO-5). The questionnaire consisted of 5 statements where respondents were asked to report how often they experienced those statements during the pandemic lockdown. Social Media Usage. Respondents were asked to assess their frequency of usage for each of the four most popular social media platforms in the Philippines (Facebook, Instagram, Twitter, Tiktok, Others) before and during the pandemic lockdown. Frequency is as follows: 4Daily; 4–5x Weekly; 1–3X weekly; At least 1x per month; Less than 1x per month; Never. Access to Social Media. Respondents were asked to report the type of internet connection used during the 2020 pandemic lockdown (Prepaid SIM card; Postpaid SIM card; Pocket Wifi; Free Wifi; Home Internet Plan) and the quality of their internet connection during the same period. 4.3 Data Analysis Responses to questions employing the Likert-type scales were converted to numeric integer values with zero as starting value. R open-source programming language was used to process all data. Multiple linear regression analysis was used to analyze the relationship among antecedent variables, appraisal, coping expressions, and subjective well-being. Wilcoxon-Mann-Whitney (WMW) test was used to determine the significance of the difference in the coping appraisal, expressions, and outcome as cross-tabulated versus the personal variables. WMW test is a non-parametric test that is appropriate for datasets with small sample size and does not require the data to have a normal distribution [32, 33].
5 Result 5.1 Social Media Self-expression Versus Appraisal and Subjective Well-Being Among the social media coping self-expressions included in the study, self-expression of faith and religion emerged as the only significantly correlated self-expression to the appraisal of the lockdown as harmful with a p-value of 0.0009. Self-expression of faith and religion has a negative regression coefficient of –.0.35, indicating an inverse relationship with the appraisal of the lockdown as harmful (Table 1). For the appraisal of the
Social Media Self-expression as Form of Coping
663
lockdown as a threat, self-expression of feelings and emotion emerged with the strongest significance at 0.0039 P-value and with a positive coefficient of 0.43. Self-expression of reflections, memories, and plans and Self-expression of faith and religion were likewise significantly associated with threat appraisal both with a negative coefficient. Appraisal of the lockdown as a challenge did not show any significant correlation to any form of coping self-expression. In terms of the direct relationship of coping self-expression to subjective well-being, Self-expression of faith and religion emerged as significant and positively correlated with subjective well-being while self-expression of feelings and emotions emerged as significant with a negative correlation to subjective well-being. Appraisal of the pandemic lockdown as harmful showed strong significance as a predictor of subjective well-being with a negative regression coefficient (Table 2). This Table 1. Multiple regression analysis: social media coping expression versus appraisal and subjective well-being Appraisal as harm
Appraisal as threat
Regression coefficient
P value
Regression coefficient
P value
2.20
0.000
1.97
0.000
Everyday activities 0.12
Intercept
0.357
0.06
0.613
Reflections, memories, plans
−0.17
0.211
−0.30
0.025
*
Feelings and emotions
0.29
0.051
0.43
0.004
**
Faith and religion
−0.35
0.001
−0.21
0.032
*
Support to others
0.10
0.594
0.11
0.549
News and information
0.00
0.998
−0.06
0.688
***
Appraisal as challenge
Subjective well-being
Regression coefficient
P value
Regression coefficient
P value
1.93
0.000
1.61
0.000
Everyday activities 0.05
Intercept
0.448
0.10
0.263
Reflections, memories, plans
−0.11
0.154
0.02
0.806
Feelings and emotions
0.02
0.782
−0.23
0.029
*
Faith and religion
0.02
0.760
0.20
0.004
**
Support to others
0.02
0.838
−0.02
0.851
News and information
0.02
0.828
−0.05
0.651
664
M. P. Lazo and C. D. Ramos
implies that increased perception of the pandemic lockdown as harmful reduced the subjective well-being of the respondents. Appraisal of the lockdown as a challenge emerged as a significant positive predictor of subjective well-being. Respondents who saw the positive side of the lockdown recorded better subjective well-being. Although Table 1 showed that there were self-expressions that are correlated to appraisal as a threat, appraisal as a threat did not emerge to be significantly correlated to subjective well-being. Table 2. Multiple regression analysis: appraisal versus subjective well-being Subjective well-being Regression coefficient Intercept
P value
2.32
0.0000
Appraisal as harm
–0.41
0.0007
Appraisal as threat
–0.14
0.2383
0.27
0.0447
Appraisal as challenge
*** *
5.2 Influence of Coping Antecedents on Appraisal, Coping Expression, and Outcome Age emerged as a differentiating factor in social media coping self-expression. The age group 18–35 showed a significantly lower frequency of self-expression of faith and religion compared to those belonging to the 36 & up age bracket. In terms of appraisal, respondents belonging to the 18–35 age bracket showed a significantly higher appraisal of the pandemic lockdown as harmful and a threat compared to those belonging to the 36 & up age bracket. Conversely, respondents from the 18–35 age bracket exhibited lower subjective well-being as compared to those belonging to the 36 & up age bracket (Table 3). Self-expression of faith and religion emerged as the only self-expression that showed significant differentiation by gender (Table 4). Female respondents exhibited more frequent social media expressions of faith and religion as compared to Male respondents. For all the other themes of social media self-expression, results did not show any significant difference between genders. However, gender difference was not observed in the appraisal of the lockdown as harmful, threat, or challenge. Likewise, subjective well-being did not show a significant difference between gender. Differences in marital status did not result in a significant difference in the frequency of social media coping self-expression across the six themes included in this study (Table 5). However, the Single group exhibited a markedly higher appraisal of the lockdown as harmful and a threat and lower subjective well-being as compared to the Married group.
Social Media Self-expression as Form of Coping
665
Table 3. Test of difference in means: by age Ho: Difference in mean is equal to zero Ha: Difference in mean is not equal to zero Mean score
N
18–35 Y.O.
36 & Up Y.O.
WMW statistics
P-value
52.00
20.00
Everyday activities
1.77
1.80
515
0.95
Reflections, memories, plans
1.37
1.45
493
0.73
Feelings and emotions
1.56
1.40
44
0.75
Faith and religion
0.77
1.70
330
0.01
Support to others
1.81
1.45
595
0.33
News and information
1.92
1.40
627
0.17
Appraisal as harm
2.77
1.60
813
0.00
***
Appraisal as threat
2.40
1.60
734
0.01
**
Appraisal as challenge
2.04
1.89
585
0.41
Subjective well-being
1.32
2.18
236.50
0.00
**
**
Table 4. Test of difference in means: by gender Ho: Difference in mean is equal to zero Ha: Difference in mean is not equal to zero Mean score
WMW statistics
P-value
1.56
648
0.36
1.52
1.12
667
0.24
1.65
1.24
656
0.31
1.30
0.56
735
0.03
Female
Male
46.00
25.00
Everyday activities
1.89
Reflections, memories, plans Feelings and emotions Faith and religion
N
Support to others
1.76
1.64
604
0.73
News and information
1.76
1.80
563
0.89
Appraisal as harm
2.52
2.26
661
0.30
Appraisal as threat
2.27
1.98
661
0.30
Appraisal as challenge
1.94
2.07
513
0.45
Subjective well-being
1.54
1.59
558
0.84
**
666
M. P. Lazo and C. D. Ramos Table 5. Test of difference in means: by marital status
Ho: Difference in mean is equal to zero Ha: Difference in mean is not equal to zero Mean score N
WMW statistics
Married
Single
13.00
59.00
P-value
Everyday activities
1.54
1.83
343
0.53
Reflections, memories, plans
1.31
1.41
375
0.90
Feelings and emotions
1.08
1.61
309
0.25
Faith and religion
1.54
0.92
479
0.11
Support to others
1.31
1.80
308
0.26
News and information
1.38
1.86
314
0.30
Appraisal as harm
1.42
2.67
161
0.00
***
Appraisal as threat
1.35
2.36
186
0.00
***
Appraisal as challenge
1.77
2.05
314
0.31
Subjective well-being
2.18
1.42
563
0.01
**
Other demographic variables such as employment type and health status were gathered, however, the distribution of observations across sub-segments resulted in an insufficient number of observations for meaningful analysis (Table 6). Table 6. Distribution of other personal variables By employment type
Number observations
% Share
Employed-reported to office
9
13%
Employed-work-from-home
29
40%
Home Maker
4
6%
Student-enrolled
28
39%
Unemployed
2
3%
Total
72
100%
By health status
Number observations
% Share
Healthy
65
90%
Sick or immuno-compromised
7
10%
Total
72
100%
Social Media Self-expression as Form of Coping
667
6 Discussion Among the themes of social media coping self-expression analyzed in this study, faith and religion emerged with the strongest significant correlation to the appraisal of the lockdown as harmful whilst exhibiting a negative correlation. Likewise, coping selfexpression of faith and religion registered a strong positive association with subjective well-being. This result corroborates with the findings of a systematic review of studies on religious coping from 1986 to 2016 where reliance on religion to cope with life challenges was found to have a high prevalence (68%) and found to be associated with better mental health [35]. Studies further showed that better mental health was observed among those with a higher level of religious commitment [36] in countries where the practice of religion is accepted while the reverse was observed in countries where the practice of religion was suppressed [35]. Religious coping refers to the reliance on religious faith and activities to understand and find meaning in the difficulties and advantages of life [35]. This result is not surprising since 90% of the Philippine population are Christians [37]. The strong but inverse association of coping expression of faith and religion to harm and threat and the positive association with subjective well-being can be explained by the mechanism of religious coping wherein faith and religion provide a sense of higher meaning and purpose to the pandemic lockdown. Social media coping self-expression of feelings and emotions emerged with the highest association to the appraisal of the lockdown as a threat and registered an inverse association with subjective well-being. Emotion-focused coping is commonly used when the situation is judged to be unalterable and outside the control of the individual [4]. This form of coping includes processing and expression of emotions. Studies have shown that emotion-focused coping resulted in negative emotion and a higher level of distress [36]. Results of the study have shown that expressing emotions in social media during the pandemic lockdown has resulted in an increase in the perception of the lockdown as a threat and consequently led to the reduction in subjective well-being. The increase in distress can be attributed to the tendency of emotion-focused coping to become ruminative in nature and fosters focusing on the negative feelings over a long period [36]. Selfexpression of feelings and emotions magnified the appraisal of the pandemic lockdown as a threat and consequently reduced subjective well-being. Respondents who appraised the pandemic lockdown as harmful reported lower subjective well-being while respondents who appraised the pandemic lockdown as a challenge reported higher subjective well-being. This result is consistent with the literature on coping appraisal and outcome. Appraisal of a situation as harm is linked with distress and anxiety, consequently resulting in lower levels of well-being [38]. On the other hand, appraisal of a situation as a challenge implies that one considers the event as an occasion for growth and improvement which leads to improved subjective well-being [38]. Among the personal factors considered in the study, age and gender emerged as a source of differentiation in coping strategy and subjective well-being. The younger age group (35 years old and below) reported lower faith/religious social media coping selfexpression compared to the older age group (36 years old and above) who reported more frequent use of faith/religious social media coping self-expression. Consistent with the general result of this study, the older age group who reported more frequent social media
668
M. P. Lazo and C. D. Ramos
self-expression of faith and religion appraised the pandemic lockdown as less harmful and reported higher subjective well-being. In terms of gender, female respondents reported more frequent use of faith/religious coping self-expression compared to their male counterparts, however, this did not result in differentiation in terms of appraisal and subjective well-being.
7 Limitations The following are the limitations of this study. First, data gathering was conducted one year after the actual occurrence of the event (2020 pandemic lockdown). The survey was conducted from March 2021 to April 2021 and respondents were asked to recall their activities and how they reacted during the lockdown from March 2020 to May 2020. The length of time that elapsed from the actual occurrence of the event to the time of the survey may pose issues in the accuracy of recall. Experiences that transpired during the 1-year period that elapsed may challenge the objectivity of their statements. Second, the survey was administered among the contacts of the researcher which yielded a modest number of observations. Although the data analysis techniques that were used are appropriate for a small sample size, the conclusion from this study cannot be generalized to the larger Philippine population. Third, all the constructs used in the questionnaire were assessed using self-report questions which may result in self-report bias.
8 Conclusions This study investigated how different themes of social media self-expressions are associated with the coping process during the 2020 pandemic lockdown in the Philippines. In particular, the study looked at how different themes of social media self-expressions are correlated with the appraisal of the pandemic lockdown as harmful, threat, or challenge and consequently how the type of appraisal affected subjective well-being. Among the five emerging themes of social media coping self-expressions explored in this study, only two themes emerged as associated with coping appraisal, namely social media self-expression of faith and religion and social media self-expression of feelings and emotions. Respondents who reported a higher frequency of posting religion/faithrelated posts on social media during the 2020 lockdown viewed the pandemic as less harmful and reported higher subjective well-being. Cross-tabulating coping expression with age showed that the older age group (36 & up) tend to have a higher frequency of posting faith and religious themes in social media and reported lower appraisal of the pandemic lockdown as harmful and higher subjective well-being. This result suggests that facilitating access to online religious activities and support can foster an optimistic outlook and improve subjective well-being during the lockdown. This can be an area where faith groups can work hand-in-hand with government and private sectors in enabling online access especially for those with limited means so that they can attend religious services of their affiliations as one way of improving the mental health of the population during a lockdown.
Social Media Self-expression as Form of Coping
669
In this study, social media self-expression of feelings and emotions emerged as directly associated with the assessment of the pandemic lockdown as a threat and inversely associated with subjective well-being. Respondents who posted feelings and emotions with more frequency showed a higher level of assessment of the pandemic as a threat and lower subjective well-being. During times of pandemic where the main focus of policymakers is physical health and economic recovery, mental health is an equally important concern. Social media postings can be analyzed to determine if there is an increasing trend in postings of negative emotions. Analysis of trends in social media postings can also be used by health care workers to determine the mental health of the population and use this as a basis in designing activities to improve the mental health of the public.
References 1. IATF: Iatf 37 (2020) 2. Zacher, H., Rudolph, C.W.: Individual differences and changes in subjective wellbeing during the early stages of the COVID-19 pandemic. Am. Psychol. 76(1), 50–62 (2021). https://doi. org/10.1037/amp0000702 3. Zhen, L., Nan, Y., Pham, B.: College students coping with COVID-19: stress-buffering effects of self-disclosure on social media and parental support. Commun. Res. Reports 38(1), 23–31 (2021). https://doi.org/10.1080/08824096.2020.1870445 4. Folkman, S., Lazarus, R.S.: The relationship between coping and emotion: implications for theory and research. Soc. Sci. Med. 26(3), 309–317 (1988). https://doi.org/10.1016/0277-953 6(88)90395-4 5. Eden, A.L., Johnson, B.K., Reinecke, L., Grady, S.M.: Media for coping during COVID19 social distancing: stress, anxiety, and psychological well-being. Front. Psychol. 11(December), 1–21 (2020). https://doi.org/10.3389/fpsyg.2020.577639 6. Zhang, Z., Zhang, L., Xiao, H., Zheng, J.: Information quality, media richness, and negative coping: a daily research during the COVID-19 pandemic. Pers. Indiv. Dif., 176, p. 110774 (2021). https://doi.org/10.1016/j.paid.2021.110774 7. Adamson, M.M., et al.: International prevalence and correlates of psychological stress during the global COVID-19 pandemic. Int. J. Environ. Res. Public Health 17(24), 1–16 (2020). https://doi.org/10.3390/ijerph17249248 8. Millar, E. B. et al.: Health anxiety, coping mechanisms and COVID 19: an Indian community sample at week 1 of lockdown. PLoS One, 16(4 ), pp. 1–14 (2021). https://doi.org/10.1371/ journal.pone.0250336 9. Garfin, D.R.: Technology as a coping tool during the coronavirus disease 2019 (COVID-19) pandemic: implications and recommendations. Stress Heal. 36(4), 555–559 (2020). https:// doi.org/10.1002/smi.2975 10. Gao, G., Sai, L.: Towards a ‘virtual’ world: social isolation and struggles during the COVID19 pandemic as single women living alone. Gender, Work Organ. 27(5), 754–762 (2020). https://doi.org/10.1111/gwao.12468 11. Folkman, S., Lazarus, R.S.: An analysis of coping in a middle-aged community sample author(s): Susan Folkman and Richard S . Lazarus Source. J. Health Soc. Behav., 21(3), pp. 219–239 (1980) 12. Lazarus, R.S.: Emotions and interpersonal relationships: toward a person-centered conceptualization of emotions and coping. J. Pers. 74(1), 9–46 (2006). https://doi.org/10.1111/j.14676494.2005.00368.x
670
M. P. Lazo and C. D. Ramos
13. Oakland, S., Ostell, A.: Measuring coping: a review and critique. Hum. Relations 49(2), 133–155 (1996). https://doi.org/10.1177/001872679604900201 14. Lazarus, R.S.: Theory-based stress measurement. Psychol. Inquiry, 1(1), pp. 3–13 (2016). http://www.jstor.org/stable/1449700 15. Hobfoll, S.E.: The influence of culture, community, and the nested-self in the stress process: advancing conservation of resources theory. Appl. Psychol. 50(3), 337–421 (2001). https:// doi.org/10.1111/1464-0597.00062 16. Nabi, R.L., Torres, D.P., Prestin, A.: Guilty pleasure no more: the relative importance of media use for coping with stress. J. Media Psychol. 29(3), 126–136 (2017). https://doi.org/10.1027/ 1864-1105/a000223 17. Johnshoy, Q. et al.: Social media use following exposure to an acute stressor facilitates recovery from the stress response. Physiol. Behav. 223(April), p. 113012 (2020). https://doi.org/ 10.1016/j.physbeh.2020.113012 18. Naslund, J.A., Aschbrenner, K.A., Marsch, L.A., Bartels, S.J.: The future of mental health care: peer-to-peer support and social media. Epidemiol. Psychiatr. Sci. (2016). https://doi. org/10.1017/S2045796015001067 19. Sriwilai, K., Charoensukmongkol, P.: Face it, don’t Facebook it: impacts of social media addiction on mindfulness, coping strategies and the consequence on emotional exhaustion. Stress Heal. (2016). https://doi.org/10.1002/smi.2637 20. Choi, M., Toma, C.L.: Social sharing through interpersonal media: patterns and effects on emotional well-being. Comput. Human Behav. 36, 530–541 (2014). https://doi.org/10.1016/ j.chb.2014.04.026 21. Ng, J.C.Y., Shao, I.Y.T., Liu, Y.: This is not what I wanted: the effect of avoidance coping strategy on non-work-related social media use at the workplace. Empl. Relat. (2016). https:// doi.org/10.1108/ER-12-2015-0216 22. Howell, A.J., Buro, K.: Measuring and predicting student well-being: further evidence in support of the flourishing scale and the scale of positive and negative experiences. Soc. Indic. Res. 121(3), 903–915 (2014). https://doi.org/10.1007/s11205-014-0663-1 23. Chen, H.T., Li, X.: The contribution of mobile social media to social capital and psychological well-being: Examining the role of communicative use, friending and self-disclosure. Comput. Human Behav. 75, 958–965 (2017). https://doi.org/10.1016/j.chb.2017.06.011 24. Ahn, D., Shin, D.H.: Is the social use of media for seeking connectedness or for avoiding social isolation? Mechanisms underlying media use and subjective well-being. Comput. Human Behav. (2013). https://doi.org/10.1016/j.chb.2012.12.022 25. Bailey, E.R., Matz, S.C., Youyou, W., Iyengar, S.S.: Authentic self-expression on social media is associated with greater subjective well-being. Nat. Commun. 11(1), 1–9 (2020). https:// doi.org/10.1038/s41467-020-18539-w 26. Rimé, B.: The social sharing of emotion as an interface between individual and collective processes in the construction of emotional climates. J. Soc. Issues 63(2), 307–322 (2007). https://doi.org/10.1111/j.1540-4560.2007.00510.x 27. Scheck, C.L., Kinicki, A.J.: Identifying the antecedents of coping with an organizational acquisition: a structural assessment. J. Organ. Behav. 21(6), 627–648 (2000). https://doi.org/ 10.1002/1099-1379(200009)21:6%3c627::AID-JOB43%3e3.0.CO;2-D 28. Meyer, I.H., Schwartz, S., Frost, D.M.: Social patterning of stress and coping: Does disadvantaged social statuses confer more stress and fewer coping resources? Soc. Sci. Med. 67(3), 368–379 (2008). https://doi.org/10.1016/j.socscimed.2008.03.012 29. Biggs, A., Brough, P., Drummond, S.: Lazarus and Folkman’s Psychological Stress and Coping Theory. The Handbook of Stress and Health (2017) 30. Kristofferzon, M.-L., Engström, M., Nilsson, A.: Coping mediates the relationship between sense of coherence and mental quality of life in patients with chronic illness: a cross-sectional study. Qual. Life Res. 27(7), 1855–1863 (2018). https://doi.org/10.1007/s11136-018-1845-0
Social Media Self-expression as Form of Coping
671
31. Topp, C.W., Østergaard, S.D., Søndergaard, S., Bech, P.: The WHO-5 well-being index: a systematic review of the literature. Psychother. Psychosom. 84(3), 167–176 (2015). https:// doi.org/10.1159/000376585 32. G uo, J.: Optimal sample size planning for the Wilcoxon – Mann. Whitney and van Elteren, 39(10), pp. 2153–2164 (2012) 33. Vidal-Salazar, M.D., Ferrón-Vílchez, V., Cordón-Pozo, E.: Coaching: an effective practice for business competitiveness. Compet. Rev. 22(5), 423–433 (2012). https://doi.org/10.1108/ 10595421211266302 34. Peacock, E.J., Wong, P.T.P.: The stress appraisal measure (SAM): a multidimensional approach to cognitive appraisal. Stress Med. 6(3), 227–236 (1990). https://doi.org/10.1002/smi. 2460060308 35. Pargament, K., Brant, C.: Religion and coping, pp. 111–128 (1998). https://doi.org/10.1016/ b978-012417645-4/50011-0 36. Folkman, S., Moskowitz, J.T.: Coping: pitfalls and promise. Annu. Rev. Psychol. 55, 745–774 (2004). https://doi.org/10.1146/annurev.psych.55.090902.141456 37. Philippines Statistical Authority: Philippines statistical yearbook, 2019. Philipp. Stat. Auth., p. 628 (2019) 38. Gardner, D., Fletcher, R.: Demands, appraisal, coping and outcomes: positive and negative aspects of occupational stress in veterinarians. Int. J. Organ. Anal. 17(4), 268–284 (2009). https://doi.org/10.1108/19348830910992095
Building Wikipedia N-grams with Apache Spark Armin Esmaeilzadeh(B) , Jorge Ram´on Fonseca Cacho, Kazem Taghva, Mina Esmail Zadeh Nojoo Kambar, and Mahdi Hajiali University of Nevada Las Vegas, Las Vegas, NV 89119, USA {esmaeilz,esmailza}@unlv.nevada.edu, {jorge.FonsecaCacho,kazem.taghva,mahdi.hajiali}@unlv.edu
Abstract. With the rise of the natural language processing models over the last two decades, there is a need to construct large pre-processed datasets such as N-grams for language modeling. N-grams, which are defined as a continuous sequence of words of length N , are useful in numerous language models. In this work, we introduce a distributed data processing approach to build an N-gram corpus, which is many times faster when compared to previous single machine techniques. Keywords: Distributed systems · Big data · Apache Spark Natural language processing · NLP · Language models
1
· N-gram ·
Introduction
The field of Natural Language Processing has been expanding rapidly over the last two decades with the introduction of new language models such as word2vec and transformers that can capture syntax and semantic relationships in text documents [1]. There is a broad range of NLP applications such as sentiment analysis [2] or detecting fake news on social media platforms [3] to more critical applications such as identifying misinformation regarding the COVID-19 pandemic [4]. A common requirement for these models is to use high-quality text data to capture the context or ideas that occur in the literature or world more accurately [5]. At the same time, the tools and sources that generate text data have been increasing significantly, and there is a need to process text data in a more efficient and timely manner [1]. The high-quality text datasets can play an important role in machine learning models in medical applications as described in [6]. These cleaned datasets can also simplify the process of making the models publicly accessible [7]. For any given text document, N-grams are defined as a sequence of continuous words of length N . These N-grams have been historically one of the main features of text documents that were used in a variety of language modeling applications such as machine translation, speech recognition, and Optical Character Recognition (OCR) [8]. The N-grams can also be used in building document embeddings for text classification or topic modeling [9]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 672–684, 2022. https://doi.org/10.1007/978-3-031-10464-0_45
Building N-grams with Spark
673
Given the demand for high-quality text features in large corpora, Google open-sourced an N-gram collection in 2006 [8]. This collection includes 1 trillion words which were extracted from public web pages. However, this corpus has not been updated with more recent documents. Moreover, the source code to generate N-grams on large datasets are not available for other public sources such as [10]. One of the recent works on how to generate N-grams on a large corpus was developed by [11]. In this current work, we revisit the N-gram generation technique proposed by [11] and focus on improving the run time complexity by introducing a distributed technique to process large-scale datasets. We give tools for the entire data transformation process to clean and generate N-grams on a cluster of nodes running Apache Spark. We gain performance that is many times faster compared to the prior work. In Sect. 2, we discuss the publicly available N-gram collection and its limitations. In Sect. 3, we outline the Apache Spark architecture for data processing. In Sect. 4, we describe the dataset used in our experiments, the configuration of the cluster, the processing steps, and the final run time cost of the N-gram extraction program. Finally, we describe our attempt to improve the run time and space time complexity of previous approaches in the conclusion (Sect. 5).
2
Related Works
To the best of our knowledge, there are only two publicly available large-scale Ngram data sources. The first one was published by Google in 2006 [8]. It contains 1 trillion word tokens extracted from public web pages. The second resource is developed by [10] and is extracted from the Corpus of Contemporary American English, which includes more than 1 billion word tokens from eight genres. However, the data processing pipelines behind generating these N-grams from the raw texts are not open-source, and therefore, we don’t have any visibility on the techniques or the run time efficiency of the systems. Given the growth of text data in different domains and the need to process text data at a large-scale to feed language models, there is a demand to develop data pipelines that can generate N-grams or other features from new raw text data in a more efficient and scalable manner. The one study that introduced an approach to extract and generate N-gram and that has an open-source code repository was done by [11]. In that research, the Wikipedia archive file was created on 02-Nov-2019 with a size of 74.4 GB in its XML format. The authors used a WikiExtrator tool to extract articles or documents from the XML format. The processing time for WikiExtrator to extract 5,963,484 articles was 106.7 min on a machine with an i7-6700k processor and 24 GB of RAM. After content extraction, another sanitizing step was performed, such as removing non-ASCII characters, to prepare data for N-gram extraction. This step added another 2:41:07 min to process 2,184,219,778 words. [11] After content extraction, cleaning, and sanitizing steps, the main N-gram generation step is performed. Given a text document, the N-gram process is to
674
A. Esmaeilzadeh et al.
extract words in a sliding window of size N . The first approach proposed by [11] was to use a MySQL relational database to create tables for 1 to 5-grams with a strong consistency scheme to use the 1-grams as the primary key in the 1-gram table and foreign key for all the other tables. This approach was discontinued due to the high run time cost of inserting data into the database. The second proposed approach was to read the cleaned data files by Python into memory, extract N-grams, and use Timesort algorithm to remove duplicates and create N-gram frequencies. This approach is detailed in the paper and took approximately 35 h to complete [11]. The main limiting factor of these approaches is using a single machine to extract and process text data. Even with access to more powerful hardware on one machine, the run time complexity of these approaches is too high. In this study, we introduce a distributed approach to process text data using the Apache Spark engine in a cluster of 10 worker nodes, which will lower the run time from 35 h to less than 1 h.
3
Apache Spark
The MapReduce [12], is one of the main frameworks to process large-scale data processing applications and batch jobs. Given its distributed architecture and ability to run on commodity hardware, it became a more robust and cheaper alternative to single machine databases that were limited by the power of CPU on one machine. There have been many efforts over the years to parallelize sequential algorithms based on the MapReduce framework that could significantly reduce the runtime cost [13]. However, as the scale and diversity of data-intensive applications grew even further, it became apparent that the MapReduce framework had limitations due to its design and implementation [14]. The system is designed for jobs that are based on an acyclic data flow model, and it was not suitable for jobs that reused the same dataset in multiple jobs or tasks in parallel [15]. The main bottleneck for this design is the fact that the intermediate datasets are stored on a disk in order for other tasks to access it and perform data transformation, which has a high time and space complexity cost on Terabytes of data. In order to solve some of the limitations of the MapReduce framework, Apache Spark was released by [14] as a distributed in-memory data processing engine. This new framework retains the scalability and fault tolerance of the MapReduce framework and implements a new programming paradigm and data abstraction called Resilient Distributed Datasets (RDD), which can keep large-scale working datasets in memory to boost data access performance [14]. Even though there have been other attempts to store datasets in memory for reuse, most of the other programming interfaces that support fault tolerance on datasets in-memory follow the fine-grained updates to the shared data paradigm [14]. In this approach, the datasets have to be replicated across machines and keep them synchronized to support fault-tolerant datasets. This technique, however, is very expensive for large-scale datasets because of the much higher storage cost as well as the higher network bandwidth needed to copy datasets [14].
Building N-grams with Spark
3.1
675
Resilient Distributed Datasets (RDD)
The RDD data abstraction in Apache Spark is defined as an object with a collection of read-only partitioned data that are distributed across worker nodes in a cluster and is kept in memory [15]. One of the main benefits and probably the core property of RDD is that it implements a Directed Acyclic Graph (DAG) and coarsed-grained transformations on datasets, instead of fine-grained updates to shared data. In this approach, each RDD can store its direct parent or dependent RDD as well as what transformations need to be performed on their parent RDDs to construct the current RDD [15]. This property makes the entire data pipeline of RDDs fault-tolerant, since any failed RDD can trace back its lineage and rebuild all the necessary dependent RDDs without storing the updated datasets on the disk. Table 1 shows the improvements of RDD. Table 1. RDD comparison [15]. Aspect
RDDs
Distr. Shared Mem.
Reads
Coarse- or fine-grained
Fine-grained
Writes
Coarse-grained
Fine-grained
Consistency
Trivial (immutable)
Up to app/runtime
Fault recovery Fine-grained and lowoverhead using lineage
Requires checkpoints and program rollback
Straggler mitigation
Possible using backup tasks
Difficult
Work placement
Automatic based on data locality
Up to app (runtimes aim for transparency)
Behavior if not enough RAM
Similar to existing data flow systems
Poor performance (swapping?)
As shown in Table 2, the RDD interface supports the most common data transformations on dataset objects. These operations are divided into Transformations and Actions. Transformation operations such as map and f ilter on RDD has a lazy evaluation, meaning that these operations will not be executed after evaluation until one Action operation is requested on the RDD. The lazy evaluation allows the Spark engine to build a DAG of all Transformations, and therefore, after knowing all the requirements, apply optimizations on the graph once it needs to be executed. The Action operations such as sum or count that are applied to an RDD will trigger the actual execution of the DAG, which in turn will return values either to the driver program or store them on a storage system [14,15].
676
A. Esmaeilzadeh et al. Table 2. Operations available on RDD in Apache Spark [15]. Transformations map(f : T ⇒ U) filter(f : T ⇒ Bool) flatMap(f : T ⇒ Seq[U]) sample(fraction : Float) groupByKey() reduceByKey(f : (V,V) ⇒ V) union() join() cogroup() crossProduct() mapValues(f : V ⇒ W) sort(c : Comparator[K]) partitionBy(p : Partitioner[K]) Actions
count() collect() reduce(f : (T,T) ⇒ T) lookup(k : K) save(path : String)
Aside from the RDD interface that accesses data and performs data transformations, Spark provides a higher-level API on top of RDD API, which is known as SQL API for dataframes. The dataframe and its supported methods are very similar to SQL tables and statements. Some of the common operations defined in the SparkSQL library are referenced in [16]. We will compare the performance of the RDD and SQL APIs in the Experiment section. 3.2
Apache Spark Architecture
The Apache Spark cluster, as shown in Fig. 1, has a driver application node which has control over application execution and a number of worker nodes. The worker nodes execute the tasks sent to them by the driver program in parallel on RDD partitions. Once the computations are done, worker nodes send the results back to the driver program. The initial Spark implementation used the Apache Mesos [18] as the cluster operating system to share resources [14], but the newer versions of Spark have support for other schedulers such as Yarn and Kubernetes [17].
Building N-grams with Spark
677
Fig. 1. Spark cluster [14].
4
Experiment
In this study, we perform multiple experiments with different job parameters, such as the number of Spark worker nodes and data transformation APIs, to process an English Wikipedia archive file to create an N-gram corpus. The corpus will include 1 to 5 N-grams that are extracted directly from the Wikipedia XML raw file. In the following sections, we discuss each step in more detail. 4.1
Data Set
The dataset used in this experiment is the English Wikipedia archive file dated 21-Jan-2021. The compressed version of the file is in Zip format with a size of 18 GB. The uncompressed Zip file includes an XML file with a size of 78 GB which contains all the articles and their related metadata. This dataset is 5 GB larger than the Wikipedia dataset used in [11], since it is a newer archive version and includes more articles that were published since the time of the previous archived version. 4.2
Cluster Setup
We used a Spark cluster of 10 nodes with the hardware configuration, as shown in Table 3. The storage environment for the cluster is backed by HDFS [20] available in the same nodes of the Spark cluster. 4.3
Data Processing
There are three main data processing steps in the Spark application. First, we convert the Wikipedia archive XML file to a binary file format called Parquet [19], which we will discuss in the following section. Then, we perform data cleaning transformations on Parquet files to extract articles and remove unnecessary words or characters. Finally, we perform the N-gram extraction from cleaned data. We will discuss these three steps as follows:
678
A. Esmaeilzadeh et al. Table 3. Spark cluster setup.
Node
Server
CPU
Total vCore
Total memory (GB)
Total data storage (TB)
Network speed (GB)
Master PowerEdge R730xd Xeon(R) E5-2650 v4 48
131.7
18.1
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
131.7
18.1
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
131.7
18.1
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
131.7
18.1
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
131.7
18.1
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
263.8
43.2
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
263.8
43.2
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
263.8
43.2
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
263.8
43.2
10
Worker PowerEdge R730xd Xeon(R) E5-2650 v4 48
263.8
43.2
10
XML to Parquet: The Apache Parquet is an open-source column-oriented binary file format with major advantages compared to row-oriented formats such as Comma Separated Values (CSV) for analytical workloads on large datasets [19]. The Parquet files also include metadata related to the content of the data and the Spark engine can leverage this information to push down filters to the storage files while retrieving data. This property can reduce the amount of data retrieved through the network from storage systems. In order to read and process XML files, we use the Spark XML library developed by Databricks [21], as shown in Listing 1.1. The root tag of the XML file is called “mediawiki”, and each article is stored under “page” tags. We provide these as options while reading the file. 1 2 3 4 5 6
dataframe = ( spark . read . format ( " com . databricks . spark . xml " ) . option ( " rootTag " , " mediawiki " ) . option ( " rowTag " , " page " ) . load ( ’ enwiki - latest - pages - articles . xml ’) )
7 8
dataframe . write . parquet ( " /N - gram / parquet / enwiki . parquet " ) Listing 1.1. XML to Parquet.
This process will transform the single XML document to multiple Parquet files, which are faster to retrieve and transform in a distributed environment. Data Cleaning: Once we have the single XML file stored as Parquet files, we perform the data cleaning steps to remove HTML tags from the articles. We use regular expressions to remove all the HTML related tags that are embedded in articles, as shown in Listing 1.2, and then only keep English alphabet characters and digits.
Building N-grams with Spark 1 2 3 4 5
6
679
def process_data ( x ) : pattern1 = ( re . compile ( ’ < ref (.*?) >(.*?) ’) ) pattern2 = ( re . compile ( ’ \[\[([^\\]+?) \]\]|\{\{([^}}]+?) \}\} ’) ) pattern3 = re . compile ( ’[A - Za - z ]+|\ d + ’)
7 8 9 10
x = re . sub ( pattern1 , ’ ’ , x ) x = re . sub ( pattern2 , ’ ’ , x ) x = re . findall ( pattern3 , x )
11 12
return x Listing 1.2. Data Cleaning.
In order to work with the datasets, as we discussed in Sect. 2, we can use the RDD or SQL APIs. The RDD is a lower-level API that allows us to directly work on each partition and each row of partitions using Transform operations, which are normally written as Lambda functions. We can easily pass the data cleaning function we have defined to the Lambda functions to transform every row in an RDD, as shown in Listing 1.3. We clean each RDD row and create a mapping in the form of (1, x) in which 1 is the id and x is the list of words in the article. 1 2 3 4 5 6
map_result = ( dataframe . select ( dataframe . text ) . rdd . flatMap ( lambda x : x ) . map ( lambda x : process_data ( x ) ) . map ( lambda x : (1 , x ) ) ) Listing 1.3. RDD Data Cleaning.
The SQL API is a higher-level abstraction over the dataset. The main difference with the RDD API is that SQL APIs have a SQL optimizer and catalyst that can optimize SQL statements and queries. This API also allows us to pass our custom functions to each column of the dataframe to transform its content. As shown in Listing 1.4, we clean the content of the data with the process function, use the split function to create list of words out of the article string, and then assign a monotonically increasing id to each list of words. 1 2 3 4 5 6
map_result = ( dataframe . select ( function . split ( process_data ( dataframe . text ) , ’ ’) . alias ( " words " ) ) . withColumn ( " id " , function . m o n o t o n i c a l l y _ i n c r e a s i n g _ i d () ) ) Listing 1.4. SQL Data Cleaning.
680
A. Esmaeilzadeh et al.
Extract N-grams: The N-gram extraction is straightforward. We use the Ngram library available in the Spark ML package, which is implemented using the sliding method on the Scala’s Seq iterators [22]. We have already transformed the articles into lists of words. We simply pass these lists of words to the transform method of the N-gram class and provide the window size, as shown in Listing 1.5. 1
2
ngram_object = NGram ( n = gram , inputCol = " words " , outputCol = " Ngrams " ) ngram_dataframe = ngram_object . transform ( map_result ) Listing 1.5. N-gram Class.
In the case of RDD API, we use the map and reduceByKey operations to count the frequency of each N-gram returned by the N-gram class. 1 2 3 4 5 6
ngram_result = ( ngram_dataframe . rdd . map ( lambda x : x [ ’ Ngrams ’ ]) . flatMap ( lambda x : x ) . map ( lambda x : (x , 1) ) . reduceByKey ( lambda x , y : x + y ) ) Listing 1.6. RDD Transformation.
In the case of SQL API, we use the groupby statement and then count each N-gram frequency as shown in Listing 1.7. 1 2 3
4 5
ngram_result = ( ngram_dataframe . select ( functions . explode ( functions . col ( ’ Ngrams ’) ) . alias ( ’ word ’) ) . groupby ( ’ word ’) . count () ) Listing 1.7. SQL Dataframe Transformation.
We write the results as Parquet files, which will include every N-gram token and their frequencies. 4.4
Run Time Evaluation
The Apache Spark configuration file allows us to define different parameters related to the job execution. Of these parameters, three are the most significant: the number of executors, the number of virtual cores, and the amount of RAM available to each executor in worker nodes [23]. Given the hardware configuration of our cluster, the maximum capacity for a Spark application is to have 50 executors each with 5 vCores and 15 GB of memory.
Building N-grams with Spark
681
First, we transform the XML dataset to Parquet files. This process, which is described in Listing 1.1, took 9 min using the max capacity of the cluster. We then start our benchmark by running the application with 3 executors and then increase it to 5, 10, 15, 20, 30, 40, and finally the max capacity at 50 executors for data cleaning and N-gram extraction. Moreover, we compare the run time performance of 3-grams in this benchmark and will apply the best setting to other N-gram applications to compare RDD and SQL APIs. Figure 2 shows the run time performance for a 3-gram application with different numbers of nodes. Given that the configuration with 50 executors has the lowest run time, we run the 1 to 5-gram applications using this setting with both RDD and SQL APIs.
Fig. 2. Spark run time performance for 3-gram.
As we can see in Fig. 3, the only instance where the RDD API is outperforming is in processing 1-grams. In all other applications, the SQL API is outperforming by a large margin. This is mostly due to the SQL optimizer, which can leverage Parquet metadata as well as apply multiple optimization procedures on executing the DAG. The total run time of generating N-gram tokens of size 1 to 5 is 58.7 min using the Spark SQL API. The run time for the single machine solution proposed by [11] is 35 h in total.
682
A. Esmaeilzadeh et al.
Fig. 3. Spark RDD and SQL run time performance.
5
Conclusion
Text data is a fundamental requirement for natural language processing models. The neural network-based language models use an extensive amount of text data to extract hidden structures and dependencies in language. As the scale and variety of text data increases, there is a demand to construct pre-processed text datasets with features such as N-grams in large volumes. In this work, we introduced the Apache Spark framework to build the N-gram corpus over an archived version of English Wikipedia. We described the three main stages of the data pipeline, which are converting XML data to Parquet files, cleaning datasets, and extracting N-grams. The total run time in our work is 58.7 min compared to the 35-hour run time of the latest study in generating N-grams. In future studies, we will use the Spark framework to develop data pipelines to extract more features from text documents as well as perform model training in this distributed environment to improve the data processing and training time complexity. Acknowledgment. This research paper is based on works that are supported by the National Science Foundation Grant No. 1625677.
Building N-grams with Spark
683
References 1. Esmaeilzadeh, A., Taghva, K.: Text classification using neural network language model (NNLM) and BERT: an empirical comparison. In: Arai, K. (ed.) IntelliSys 2021. LNNS, vol. 296, pp. 175–189. Springer, Cham (2022). https://doi.org/10. 1007/978-3-030-82199-9 12 2. Heidari, M., Jones, J.: Using BERT to extract topic-independent sentiment features for social media bot detection. In: 2020 11th IEEE Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON), pp. 542–547 (2020) 3. Heidari, M., Jones, J., Uzuner, O.: Deep contextualized word embedding for textbased online user profiling to detect social bots on Twitter. In: 2020 International Conference On Data Mining Workshops (ICDMW), pp. 480–487 (2020) 4. Heidari, M., et al.: BERT model for fake news detection based on social bot activities in the COVID-19 pandemic. In: 2021 12th IEEE Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON) (2021) 5. Heidari, M., Jones, J., Uzuner, O.: An empirical study of machine learning algorithms for social media bot detection. In: 2021 IEEE International IOT, Electronics and Mechatronics Conference (IEMTRONICS), pp. 1–5 (2021) 6. Kambar, M., Nahed, P., Cacho, J., Lee, G., Cummings, J., Taghva, K.: Clinical text classification of Alzheimer’s drugs’ mechanism of action. In: Xin-She Yang Simon Sherratt Nilanjan Dey, p. 513 7. Esmaeilzadeh, A.: A test driven approach to develop web-based machine learning applications. In: Digital Scholarship@UNLV (2017). https://doi.org/10.34917/ 11889688 8. All Our N-gram are Belong to You. https://ai.googleblog.com/2006/08/all-our-ngram-are-belong-to-you.html. Accessed 10 Jan 2022 9. Hajibabaee, P., et al.: Offensive language detection on social media based on text classification. In: 2022 IEEE 12th Annual Computing And Communication Workshop And Conference (CCWC) (2022) 10. N-grams Data. https://www.ngrams.info/compare.asp. Accessed 10 Jan 2022 11. Cacho, J.R.F., Cisneros, B., Taghva, K.: Building a Wikipedia N-GRAM Corpus. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1251, pp. 277–294. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55187-2 23 12. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008) 13. Abdolazimi, R., Heidari, M., Esmaeilzadeh, A., Naderi, H.: MapReduce preprocess of big graphs for rapid connected components detection. In: 2022 IEEE 12th Annual Computing And Communication Workshop And Conference (CCWC) (2022) 14. Zaharia, M., et al.: Apache spark: a unified engine for big data processing. Commun. ACM 59, 56–65 (2016) 15. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12), pp. 15–28 (2012) 16. SQL Reference - Spark 3.2.0 Documentation. https://spark.apache.org/docs/ latest/sql-ref.html. Accessed 10 Jan 2022 17. Cluster Mode Overview - Spark 3.2.0 Documentation. http://spark.apache.org/ docs/latest/cluster-overview.html. Accessed 10 Jan 2022 18. Hindman, B., et al.: Mesos: a platform for fine-grained resource sharing in the data center. NSDI 11, 22–22 (2011)
684
A. Esmaeilzadeh et al.
19. Apache Parquet. https://parquet.apache.org/documentation/latest/. Accessed 10 Jan 2022 20. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10 (2010) 21. MavenRepository - Databricks Spark XML Package. https://mvnrepository.com/ artifact/com.databricks/spark-xml 2.10/0.2.0. Accessed 10 Jan 2022 22. GitHub - Apache Spark NGram Source Code. https://github.com/apache/spark/ blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/NGram.scala. Accessed 10 Jan 2022 23. Configuration - Spark 3.2.0 Documentation. https://spark.apache.org/docs/latest/ configuration.html. Accessed 10 Jan 2022
Selecting NLP Classification Techniques to Better Understand Causes of Mass Killings Abigail Sticha(B) and Paul Brenner University of Notre Dame, Notre Dame, IN 46556, USA {asticha,paul.r.brenner}@nd.edu Abstract. We perform an analysis of SVM, BERT, and Longformer NLP tools as applied to large volumes of unclassified news articles given small volumes of labeled news articles for training. Analysis of the target machine learning tools is performed through a case study of global trigger events; specifically triggers of state-led mass killings. The goal of the case study is to draw relationships from the millions of machine classified articles to identify trends for the prediction and prevention of future mass killing events. In this paper we focus on the classification one specific trigger, coups, in order to glean insight into the accuracy and complexity of our SVM, BERT, and Longformer models. This study centers on classifying which news articles contain uniquely defined coup events and the temporal placement of those articles. Our performance analysis centers on the comparison of multiple accuracy metrics as applied to specific subsets of the corpus. We also demonstrate that raw accuracy scores are insufficient to fully understand the quality of classification required for specific target use cases. Keywords: Machine learning · Natural language processing · Neural networks · BERT · Longformer · Support vector machines · Event classification · Accuracy metrics
1 Introduction In an increasingly interconnected world, identifying the complex causes of global crisis and effective responses to combating them requires leveraging best-in-class data analytics techniques on rapidly growing, but often poorly structured, data. While the tools available for data science continue to evolve, there remain significant challenges for research teams trying to wrangle insight from the large and complex data accessible to them. The processes of reviewing, labeling, and classifying massive amounts of information takes extensive time, money, and human power. Fortunately, natural language processing (NLP) and related evolving machine learning tools can be harnessed to gain insight and classify text data in order to answer these critical questions. Through the application of machine learning models to our case study of global events leading to crises, specifically state-led mass killings, this paper provides three important contributions to this evolving NLP research. First, we provide a clear, efficient, and accessible machine learning framework that future social scientists may utilize when implementing NLP-focused algorithms to classify large quantities of text documents given a relatively small quantity of labeled training data. Due to the growing availability of both open source and commercial NLP c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 685–700, 2022. https://doi.org/10.1007/978-3-031-10464-0_46
686
A. Sticha and P. Brenner
tools, those using machine learning within specific domains face the temptation to use machine learning algorithms as ‘black box’ tools in which data is blindly input and tool parameters tweaked to obtain the highest accuracy score. This approach can lead to global inefficiencies at best and unexpected misclassifications at worst. Therefore, the clarity of the framework that we outline, which explains each step of the process and helps shed light on the classification by including essential visualizations of the input data, is important to aid the implementation of machine learning in specific domains. Furthermore, the framework is presented in such a way that fellow data scientists can implement different machine learning algorithms, such as support vector machines (SVMs) or bidirectional encoder representations from transformers (BERT). Since different machine learning algorithms are most suitable for a given optimization problem, it is important that our framework is structured for the implementation of various algorithms. Second, we perform a comparative analysis of three different machine learning tools that we implemented in the context of our proposed framework. More specifically, we compare multiple accuracy metrics of SVM, BERT, and Longformer (a BERT-based algorithm designed for longer inputs) models in order to glean insights into each of these models. To validate our initial framework and compare each model, we focused on a case study to better understand global crises where event classification solely by human readers is not feasible due the extremely large corpus size. This is a valuable comparison since there has been minimal analysis done on comparing traditional machine learning algorithms, such as SVMs, and more advanced state-of-the-art neural network models on data sets in cases where it is necessary to classify a large corpus of long text documents given a small quantity of training documents. Third, we highlight the discrepancies between different accuracy metrics of each individual model to demonstrate that raw accuracy scores are insufficient to fully understand the quality of classification required for specific target use cases. Oftentimes, performance metrics are used as the sole method for comparison of machine learning models. This can be seen in the push within the machine learning community to produce state-of-the-art models that produce the highest accuracy scores on benchmarks. Our comparative analysis reminds researchers that although an algorithm might produce high accuracy scores, the model may not be optimal for the task at hand. 1.1
Triggers of Mass Killings (ToMK)
The overall goal of the target research application is to utilize statistical and NLP tools for a systemic analysis of triggers of state-led mass killings (ToMK). Peace and conflict researchers have identified several large-scale structural conditions that make stateled mass killings more likely, such as political instability, a history of violence against vulnerable groups, radical political ideologies, and autocratic or anocratic (i.e. authoritarian) governments [34]. However, the timing of mass killing onset is less understood. Analysts have identified several plausible trigger-type events that political elites may perceive as threatening their power, including coups, assassinations, protests/riots, armed conflict escalation, cancelled elections, neighboring conflict spillover, and the like, but little systematic analysis beyond specific country case studies has been conducted to examine whether, and if so when, these events actually trigger killings. This
Selecting NLP Classification Techniques
687
Fig. 1. Potential triggers and structural conditions for state-led mass killings.
project canvasses a large number of news sources to identify and examine the occurrence of nine trigger-type events across all countries from 1989–2017, and analyzes under what conditions - and in what patterns and sequences - certain trigger-type events increase the probability of mass killings (Fig. 1). It seeks to bring greater specificity and understanding about the timing of state-led mass killings, which is of interest to both scholars and peacebuilders. This paper focuses specifically on the coup/attempted coup trigger in order to illustrate the proposed framework and compare the three models. Once the complexity and accuracy of the models as applied to the coup trigger are analyzed, it will be possible scale these models to the remaining 8 trigger events and subsequently analyze the causal relationships between triggers. The coup data consists of the English-translated text of news articles retrieved via LexisNexis queries based on several search parameters which include a date filter from 1989–2017, a source filter for our list of 20 sources, and keywords including the word “coup” or related terms. The maximal corpus that we hope to classify is estimated to contain 1.8 million articles across all countries. This estimate for the total articles for the coup trigger compares to other triggers such as protests and assassinations Fig. 2. The articles classified for this paper are a subset of this corpus (“select countries corpus”) that contains the 69 countries with identified state-led mass killing events between 1989–2017. This “select countries corpus” contains 647,989 articles. In order to train the algorithms, we used a training set (“training corpus”) consisting of 551 articles retrieved in the same manner and labeled by a team of researchers trained to identify articles that qualify as a coup event.
688
A. Sticha and P. Brenner
Fig. 2. Article counts for the years between 1989 and 2017. Countries illustrated comprise at least 1% of the total articles pulled down across each trigger.
2 Related Work 2.1
Overview of Machine Learning Tools: SVMs and Neural Networks
There are many different traditional machine learning tools that can be used for text event classification. Some examples include SVMs, K-nearest neighbor (KNN) algorithms, Decision Trees, and Naive Bayes. Each of these algorithms is optimal for different types of classification. From these traditional tools, we selected SVMs as the baseline for our particular project based on background research that shows that SVMs often outperform other text machine learning tools due to their “simple structure, complete theory, high adaptability, global optimization, short training time, and good generalization performance” [12, 19, 21, 36]. There are a variety of deep neural network architectures and algorithms, most of which utilize gradient descent in order to create layers of networks that can learn hard perceptual problems. Some common architectures include convolution neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) models. Whereas SVMs are equipped to train on smaller data sets [9, 11, 40], these deep neural network models require large training sets. In the early stages of the ToMK project, we implemented each of these deep neural network structures on our data: a CNN, RNN, and LSTM. Even after hyperparameter tuning these models were only able to reach accuracy scores of 66.4%, 76.3%, and 61.8%, respectively, with validation losses tending to stagnate above 50% or rise dramatically while training each of the neural networks. Since it was clear these deep neural networks were overfitting and did not perform well on our data, as expected with our small 200–600 document training set,
Selecting NLP Classification Techniques
689
we shifted to pre-trained BERT-based models which will be referenced next and used throughout the rest of the paper. Newer NLP neural network tools include word embedding tools such as word2vec [23] and transformers [33] such as the Bidirectional Encoder Representations from Transformers, or BERT [8]. BERT, a transformer based NLP tool that was released by Google in 2018, was designed to pre-train deep bidirectional representations from unlabeled text through masked language modeling and next sentence prediction tasks. The model can be fine-tuned using labeled text for different downstream NLP tasks, such a classification [13]. Since this is such a powerful and efficient model, there have been countless variants of BERT which can be viewed on the huggingface library [35]. Each variant was created for a different purpose, whether this be more efficiency or greater specificity to a certain domain. In this study we will focus on 1) BERT-base, a smaller version of the BERT model released by Google, which we will refer to as our ‘BERT model’ for simplification and 2) Longformer, a BERT-based model that aims to handle inputs of longer length by using segment-level recurrence mechanisms to capture information from all the tokens of a document [4]. Due to the techniques used to train BERT, each document can only be represented by up to 512 tokens, which is significantly lower than the average token number of our article data set. This increases the risk of losing valuable information from our articles, so we hoped leveraging the Longformer model would rescue this information. 2.2 Use of SVMs and Neural Networks for Classification in Related Case Studies In order to understand the effectiveness of SVMs and pre-trained BERT models to classify text, it is useful to highlight their application in parallel domains. Donaldson et al. presents an application of SVMs to classify biology literature to determine whether documents reference protein-protein interactions and subsequently presents the data through the Biomolecular Interaction Network Database (BIND) [10]. Wright et al. [36] use SVMs to identify diabetes specific electronic health records in order to assist clinicians in making more informed decisions. Similar applications of SVMs to classify clinical notes have been implemented by Sohn and Savova [30] to identify the smoking status of patients, and by Carroll et al. to identify patients with rheumatoid arthritis [7]. Nii et al. implemented SVMs to reduce the resources needed to evaluate nursing-care data and identify well written nursing-care notes [25, 26]. SVMs have also found application in text classification for the social sciences. For example, in research done by Greevy and Smeaton, SVMs were utilized to classify web pages as to whether or not they are racist [15]. Much of the literature related to pre-trained BERT models consists of publications that reports on new BERT-based models that perform well on various benchmark data. Regardless, there are several case study implementations of BERT-based models in different domains. For example, Mozafari et al. implemented several pre-trained BERTbased models to detect racism, sexism, hate, or offensive content on Twitter [24]. This study chose to implement a pre-trained BERT-based model due to their relatively small amount of training articles and the complexity of hate speech. Another study used pretrained BERT models to qualitatively classify posts from a German parent counseling site to better understand how successful online counselling works [14]. Both of
690
A. Sticha and P. Brenner
these studies are text classification problems that focus on classifying text input that is much shorter than the news articles of focus for the ToMK project. There have been a few papers that implement BERT-based variants that are meant to handle longer text. An example of this is a multi-label document classification study that implemented pre-trained BERT-based models to label COVID related published medical documents [16]. It is worth noting that the above studies used training data sets with 15,817 inputs, 10,000 inputs, and 23,000 inputs respectively, all of which are significantly larger than the training corpus for our ToMK project. 2.3
Comparison of SVMs and NNs
Several papers have compared SVMs and “trained-from-scratch” neural network models. Zaghloul et al. found that the performance of a basic neural network is statistically comparable to that of the SVMs for document classification [39]. Alternatively, Sripriya et al. found SVMs to outperform a basic feed-forward neural network in all applications and Adhikari et al. encourages researchers to rethink the increased application of complex neural networks for document classification due to their complex architectures which are “more difficult to train, more sensitive to hyperparameters, and brittle with respect to domains with different data characteristics” [1, 31]. The study suggests using basic neural network structures that do not use attention mechanisms, such as a simple bi-directional LSTMs, or an SVM model. Other studies conduct a comparison of SVMs to the recent line of research in deep language representation models. Gonza´alez-Carvajal & Merch´an argue that BERT should be used as the default model for NLP-related tasks, as opposed to traditional tfidf models, due to its dynamic capabilities, ease of implementation, and higher accuracy results on the IMDB benchmark data set [13]. Another study argues the opposite by explaining that “little is known about the exact mechanisms that contribute to BERT’s success” and also provides results that highlights BERT’s tendency to be overparameterized for downstream tasks [17, 18]. An additional resource for comparisons between BERT based models and SVMs is within domain specific cases. Both the psychosocial online counseling [16] and COVID-19 [14] case studies use SVMs as a baseline when evaluating BERT-based models. The psychosocial case found that a German language BERT-based model outperformed SVMs and the COVID-19 case found that the BERT-based models BioBERT and Longformer outperformed the SVM model. In this paper we contribute to the growing literature comparing BERT-based neural network models and SVMs. Our case study is unique in that the ToMK training set consists of a smaller set of inputs with longer text length than previous discussed literature. In order to perform our analysis we will rely on metrics suggested by Xia et al. who highlight important considerations to track when deciding which pre-trained encoding model to implement. Some of the areas of consideration that this paper specifically points to are the pre-training process of the encoder on the front end, the efficiency of the model as measured by time and memory usage, the quality of the pretraining data, and the interoperability of the resulting predictions provided by the model [38].
Selecting NLP Classification Techniques
691
Fig. 3. MLT framework for event recognition and article classification.
3 ToMK Classification Framework The following machine learning framework was applied to our ToMK project but can also be implemented by future social scientists looking to use NLP-focused algorithms to classify large quantities of text documents given a relatively small quantity of labeled training data. An overview of our machine coding framework is presented in Fig. 3. The overall framework is split into two different phases: the development phase and the production phase. The development phase involves training, testing, and iteratively tuning the machine learning algorithm. This allows the model to ‘learn’ the patterns in the data that separate a positive instance of a potential trigger versus a non-trigger. Once a model is sufficiently optimized, we classify our larger, unlabeled data set in the production phase. The SVM workflow scripts that we utilized to develop our inference engine was initially modeled off of a concise text classification example written by Gungit Bedi [3]. This example utilizes various features contained in the scikit-learn library [28] and the Natural Language toolkit or NLTK [5]. The BERT and Longformer models, based on a tutorial provided by Venelin Valkov [32], uses pytorch [27] and pre-trained models from the huggingface library [35]. Our primary functional codes will be released publicly on GitHub. The proceeding sections both provide a brief explanation of each step in the framework and give a comparison of how each step applies to the different scripts, when applicable. Note that since the BERT and Longformer algorithms are both BERT-based algorithms, they will require the same steps aside from the actual pre-trained kernel that is imported. 3.1 Visualize Corpus, Preprocess Text, Encode Labels, and Extract Features Robust visualization of the data can aid in understanding and illustrating the textual relationships from which the machine learning algorithms will produce insights. Distributions of the word counts for labeled training articles by class are pictured in Fig. 4. This was a particular important visualization as it highlighted our need to leverage the Longformer variant of BERT in order to account for the high percentage of articles over the 512 token limit. In the following discussion section we introduce several more involved methods to visualizing the corpus which aided in cleaning the data and evaluating the models.
692
A. Sticha and P. Brenner
Fig. 4. Distribution of training article word counts by class for the coup trigger.
Preprocessing, encoding, and feature extracting details are provided for reproducibility. Where SVM and BERT-based models differ they are described separately. The preprocessing step of the framework for the SVM model involves several substeps. First, the script removes blank rows from the corpus. Next, all letters in the raw text are changed to lowercase. The script then tokenizes the words in the corpus by taking the string of text in each entry and splitting it into substrings composed of a single word or symbol. Each token is checked to make sure that all characters are alphabetic characters. Additionally, ‘stopwords’ are removed and lemmatizing is performed. Once the text is preprocessed, the processed (final) text is transformed into a numerical vector that can be understood and utilized in the SVM algorithm. The tf-idf vectorizer builds a vocabulary that only considers the top 5000 features based on term frequency across the corpus. Then we fit the initialized tf-idf vectorizer. Finally, the vectorizer transforms the articles in both the testing set and training set into a tf-idf-weighted documentterm sparse matrix of size (n articles, m features). Within the matrix, a higher tf-idf value denotes a stronger relationship between a term and the document in which it appears [20]. The BERT-based preprocessing begins similarly as the data is imported and null values are dropped. Conveniently, the HuggingFace Library provides tokenizers for each model which contains methods that we can call to pre-process the text. Under the hood, these tokenizers are performing similar pre-processing steps to the manual steps in the SVM script: the tokenizer lowercases all words and decomposes the input into individual words. Importantly, for BERT based models a maximum token length must be set which the tokenizer uses to reduce long inputs down to this maximum length. For the BERT model the maximum length allowed is 512 tokens and for Longformer this length is 4,096. Since the BERT model uses the original text data to gain an understanding of
Selecting NLP Classification Techniques
693
long-term dependencies between words in the input, vectorizing with tf-idf is unnecessary. Rather, the tokenizer simply transforms the tokens to their corresponding integer ids which will be passed into the BERT model later. There are several special tokens added to each input such as [SEP] which is used to separate each input, [CLS] which indicates that a classification task is being performed, [PAD] which is added to the end of the inputs to make each entry the same length, [UNK] which is used for any token that does not correspond to an integer ID. The rest of the tokens are integer IDs given to each word based on the WordPiece embeddings vocabulary. These input IDs, along with an attention mask, which consists of a binary tensor that contains 1 for every token that has a meaning and a 0 for all padding tokens, are passed to each of the BERT-based models. 3.2 Machine Learning Model During this step of the framework, we implement each of the different NLP algorithms that we are testing: the SVM, BERT, and Longformer models. Each of these models is trained to learn the difference between the ‘positive’ and the ‘negative’ groups, which will allow the model to classify unlabeled data. In the training step, the SVM creates a multidimensional plot from the document vectors and their corresponding encoded labels. The algorithm then identifies the hyperplane that separates the two classes with the maximal separating margin [37]. This means that the SVM chooses a hyperplane that is furthest from the closest expression vector of both the ‘positive’ and the ‘negative’ classes. The BERT model on the other hand, is more abstract. Both the BERT and Longformer models were pre-trained based on masked language modeling and next sentence prediction tasks using 3.3 Billion words total with 2.5B from Wikipedia and 0.8B from BooksCorpus [8]. The weights learned from this pre-training were downloaded from the HuggingFace library. The BERT and Longformer weights were downloaded from the pretrained models named ‘bert-base-uncased’ and ‘allenai/longformer-base4096’ on the HuggingFace library, respectively. We then added a dropout layer and a final linear layer for classification to each of these models. 3.3 Hyperparameter Tuning It is often beneficial to tune the hyperparameters of the models. For the SVM, we set values such as the training percentage (80%), C-value (1), and kernel type (linear). For the BERT model, common hyperparameters to tune include the loss function, optimization algorithm, maximum sequence length, batch size, learning rate, and number for epochs. We closely followed the original BERT hyperparameters in our script. More specifically, we implemented sparse categorical cross-entropy as the loss function, adaptive moment estimation (ADAM) for the optimization algorithm, a batch size of 6, a learning rate of 2e−5, and 50 epochs. A maximum token length of 512 was used for the BERT model, as this is the highest possible value for BERT, and a length of 1,250 was used for the Longformer model. We hope to implement hyperparameter grid searches such as those described in [29] to further tune our BERT and Longformer models in the future.
694
A. Sticha and P. Brenner Table 1. Accuracy comparison of SVM, BERT, and longformer models SVM BERT Longformer Accuracy score on test data
97.21 97.59 91.67
Precision on test data
99.3
98.5
89.5
Accuracy score on subset of human validation data 78.93 78.05 77.56
3.4
Production Phase
The steps within the production phase that lead to classification results are the same as the steps described for the development stage, besides label encoding as this data is unlabeled. For the ToMK project, we have successfully implemented the production phase on the “select countries corpus” for the coup trigger. We used the trained SVM kernels, BERT, and Longformer kernels to classify the 647,989 unlabeled coup articles. To visualize this data we created a timeline which plots the positive instances of each trigger as classified by each the SVM, BERT, and Longformer kernels. The visualization (Fig. 5) highlights our preference for Type II errors over Type I errors (high false positive rates from BERT and Longformer are problematic) and allows us to quickly identify differences in how each of the kernels classify different articles. The full visualization, containing all 69 countries of the “select countries corpus” is available in an HTML format which both highlights the global scale of this project and allows the user to hover over an article to identify which combination of kernels classified it as a “yes”.
Fig. 5. Timeline of extracted articles describing coup events for a subset of countries in the corpus as classified by the SVM, BERT, and longformer Kernels.
4 Comparative Analysis 4.1
Accuracy of Machine Learning Models
Accuracy Score on Test Data - For basic accuracy scores we compared the predicted labels on the test data for each model to their correct labels. The raw model accuracy is shown in row 1 of Table 1.
Selecting NLP Classification Techniques
695
Precision on Test Data - In addition to accuracy scores we produced confusion matrices and classification reports which output precision, recall, and F1 scores. The most important metric from the classification report for our ToMK study is precision due to the problem of false positives and deceiving appearance of constant coup events occurring in each country across the 1989–2017 time period shown in Fig. 5. Maximizing precision minimizes false positive errors. The precision of each model are shown in row 2 of Table 1. Accuracy Score on Subset of Human Validation Data - A subset of the classification results were validated/coded by the political science researchers. There were 622 articles in this subset, 15 labeled “yes” by the human coders and 607 labeled “no.” We compared these labels to the labels that the SVM, BERT, and Longformer models gave to these same 622 articles. These percentages are given in row 3 of Table 1. 4.2 Number of Articles Classified as a Coup Event The total number of “yes” articles, or articles classified as a coup event, were calculated for each model as another way to look at minimizing false positives. The total number of classified “yes” articles were 28,552 for SVM, 84,871 for BERT, and 73,580 for Longformer. Evidently, the SVM outperformed BERT and Longformer in terms of refraining from over-specifying articles as positive coups. After plotting various combinations of coup events as classified by the models, we found that the “yes” articles could be further decreased from 28,552 to 20,984 articles by plotting only the coup articles where the SVM, BERT, and Longformer all agree on a positive classification (as opposed to focusing on the SVM classified coup events) (Fig. 6). 4.3 Similarity Percentage Between Models In addition to statistical accuracies, it is also useful to analyze the similarities between our 3 models. Specifically, we focused on reporting the overlap of the “yes” labeled articles as shown in Fig. 7. We found a 93.72% overlap between SVM and BERT, 59.04% between BERT and Longformer, and 77% between SVM and Longformer. 4.4 Resource Restraints The SVM model showed no time or resource restraints. The BERT-based models, on the other hand, took 15 times longer to train than the SVM, and required a GPU for training. Additionally, the batch sizes for both BERT-based models could not exceed the size of 6 due to memory constraints.
696
A. Sticha and P. Brenner
Fig. 6. Comparison (using a subset of countries) of a timeline of coup events as classified by SVM versus a timeline of coup events as classified by all three models
Fig. 7. Overlap in model predictions of positively-classified coup articles
Selecting NLP Classification Techniques
697
4.5 Interpretability We used a dimensionality reduction algorithm, UMAP [22] to reduce each document vector to 2-dimensional vectors and plot these vectors. In the resulting plot the ‘positive’ and ‘negative’ articles are roughly clustered together (Fig. 8). The line added to the figure to separate these two clusters is a hypothetical representation of the SVM. This type of tangible representation is not available for the BERT-based models due to both their complexity and the pre-trained aspects of the models. Our team plans to utilize the comprehensive list of diagnostic properties for evaluating existing explainability techniques laid out by Atanasova et al. in order to choose the most adequate explainability techniques for our project [2].
Fig. 8. 2D Projection of the training set documents with an example SVM classification line.
5 Conclusion Xia et al. encourage publishing limitations in order to better compare and interpret pretrained encoding models [38]. We therefore believe that it is imperative to highlight the limitations of both our models and modes of comparison. First, although each of the 3 models reach high accuracy scores and precision on test data, the models are missing key information as shown by the over classification of “yes” articles. Second, we should (and will) gain a larger sample of post classification human validated true “yes” articles. Third, we plan a deeper analysis into the temporal relation of the classified events. Finally, pre-trained versions of BERT downloaded from the huggingface library [35] introduce a ‘black box’ approach which can cause bias issues on downstream tasks as described by Bolukbasi et al. [6].
698
A. Sticha and P. Brenner
This project provides multiple insights for our NLP community. First and foremost, we have provided an enhanced solution and detailed framework for event classification in a unique case where a large text corpus must be classified given a small labeled training corpus with long text for each input. Second, an in-depth analysis of SVMs and two BERT-based models implemented in a domain specific application was carried out. Finally, important insights about the usefulness of certain metrics were discussed. Our goal is to continue refining the open source tools, best practices and publicly available documentation to aid researchers hoping to leverage NLP tools to classify texts to aid global understanding of crises such as mass killing events.
References 1. Adhikari, A., Ram, A., Tang, R., Lin, J.: Rethinking complex neural network architectures for document classification. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4046–4051 (2019) 2. Atanasova, P., Simonsen, J.G., Lioma, C., Augenstein, I.: A diagnostic study of explainability techniques for text classification. arXiv preprint arXiv:2009.13295 (2020) 3. Bedi, G.: Simple guide to text classification (NLP) using SVM and Naive Bayes with python. Medium, July 2019 4. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer (2020) 5. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Newton (2009) 6. Bolukbasi, T., Chang, K.W., Zou, J., Saligrama, V., Kalai, A.: Man is to computer programmer as woman is to homemaker? Debiasing word embeddings (2016) 7. Carroll, R.J., Eyler, A.E., Denny, J.C.: Na¨ıve electronic health record phenotype identification for rheumatoid arthritis. In: AMIA Annual Symposium Proceedings, vol. 2011, p. 189. American Medical Informatics Association (2011) 8. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding (2019) 9. D´ıaz, I., Ranilla, J., Monta˜nes, E., Fern´andez, J., Combarro, E.F.: Improving performance of text categorization by combining filtering and support vector machines. J. Am. Soc. Inf. Sci. Technol. 55(7), 579–592 (2004) 10. Donaldson, I., et al.: PreBIND and Textomy-mining the biomedical literature for proteinprotein interactions using a support vector machine. BMC Bioinform. 4(1), 1–13 (2003) 11. Gao, Y., Sun, S.: An empirical evaluation of linear and nonlinear kernels for text classification using support vector machines. In: 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery, vol. 4, pp. 1502–1505. IEEE (2010) 12. Gayathri, K., Marimuthu, A.: Text document pre-processing with the KNN for classification using the SVM. In: 2013 7th International Conference on Intelligent Systems and Control (ISCO), pp. 453–457. IEEE (2013) 13. Gonz´alez-Carvajal, S., Garrido-Merch´an, E.C.: Comparing BERT against traditional machine learning text classification. arXiv preprint arXiv:2005.13012 (2020) 14. Grandeit, P., Haberkern, C., Lang, M., Albrecht, J., Lehmann, R.: Using BERT for qualitative content analysis in psychosocial online counseling. In: Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science, pp. 11–23 (2020) 15. Greevy, E., Smeaton, A.F.: Classifying racist texts using a support vector machine. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 468–469 (2004)
Selecting NLP Classification Techniques
699
16. Gutierrez, B.J., Zeng, J., Zhang, D., Zhang, P., Su, Y.: Document classification for COVID19 literature. arXiv preprint arXiv:2006.13816 (2020) 17. Hao, Y., Dong, L., Wei, F., Xu, K.: Visualizing and understanding the effectiveness of BERT. arXiv preprint arXiv:1908.05620 (2019) 18. Kovaleva, O., Romanov, A., Rogers, A., Rumshisky, A.: Revealing the dark secrets of BERT. arXiv preprint arXiv:1908.08593 (2019) 19. Kwok, J.T.Y.: Automated text categorization using support vector machine. In: Proceedings of the International Conference on Neural Information Processing (ICONIP). Citeseer (1998) 20. Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and Word2vec for text classification with semantic features. In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), pp. 136–140. IEEE (2015) 21. Liu, Z., Lv, X., Liu, K., Shi, S.: Study on SVM compared with the other text classification methods. In: 2010 Second International Workshop on Education Technology and Computer Science, vol. 1, pp. 219–222. IEEE (2010) 22. McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction (2020) 23. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality (2013) 24. Mozafari, M., Farahbakhsh, R., Crespi, N.: A BERT-based transfer learning approach for hate speech detection in online social media. In: Cherifi, H., Gaito, S., Mendes, J.F., Moro, E., Rocha, L.M. (eds.) COMPLEX NETWORKS 2019. SCI, vol. 881, pp. 928–940. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-36687-2 77 25. Nii, M., Ando, S., Takahashi, Y., Uchinuno, A., Sakashita, R.: Nursing-care freestyle text classification using support vector machines. In: 2007 IEEE International Conference on Granular Computing (GRC 2007), p. 665. IEEE (2007) 26. Nii, M., Ando, S., Takahashi, Y., Uchinuno, A., Sakashita, R.: Feature extraction from nursing-care texts for classification. In: 2008 World Automation Congress, pp. 1–6. IEEE (2008) 27. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alch´e-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates Inc. (2019) 28. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011) 29. Quijano, A.J., Nguyen, S., Ordonez, J.: Grid search hyperparameter benchmarking of BERT, ALBERT, and LongFormer on DuoRC. arXiv preprint arXiv:2101.06326 (2021) 30. Sohn, S., Savova, G.K.: Mayo clinic smoking status classification system: extensions and improvements. In: AMIA Annual Symposium Proceedings, vol. 2009, p. 619. American Medical Informatics Association (2009) 31. Sripriya, J., Samundeeswari, E.S.: Comparison of neural networks and support vector machines using PCA and ICA for feature reduction. Int. J. Comput. Appl. 40(16), 31–36 (2012) 32. Valkov, V.: Text classification — sentiment analysis with BERT using Hugging Face, PyTorch and python tutorial. YouTube, April 2020 33. Vaswani, A., et al.: Attention is all you need (2017) 34. Verdeja, E.: Predicting genocide and mass atrocities. Genocide Stud. Prev. Int. J. 9(3), 5 (2016) 35. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, October 2020, pp. 38–45. Association for Computational Linguistics (2020)
700
A. Sticha and P. Brenner
36. Wright, A., McCoy, A.B., Henkin, S., Kale, A., Sittig, D.F.: Use of a support vector machine for categorizing free-text notes: assessment of accuracy across two institutions. J. Am. Med. Inform. Assoc. 20(5), 887–890 (2013) 37. Noble, W.S.: What is a support vector machine. Nat. Biotechnol. 25, 1565–1567 (2006) 38. Xia, P., Wu, S., Van Durme, B.: Which* BERT? A survey organizing contextualized encoders. arXiv preprint arXiv:2010.00854 (2020) 39. Zaghloul, W., Lee, S.M., Trimi, S.: Text classification: neural networks vs support vector machines. Ind. Manag. Data Syst. 109, 708–717 (2009) 40. Zhang, W., Yoshida, T., Tang, X.: Text classification based on multi-word with support vector machine. Knowl.-Based Syst. 21(8), 879–886 (2008)
Sentiment Analysis on Citizenship Amendment Act of India 2019 Using Twitter Data Shreya Vaghasia and Kalpdrum Passi(B) Laurentian University, Sudbury, ON, Canada {svaghasia,kpassi}@laurentian.ca
Abstract. For the perspective of the latest happening news or some events occurring worldwide, social media is widely used and the reaction given by the people’s opinion is in the form of raw natural data in many languages and environments. All those written views have some unbalanced statements, i.e., some sensitive information or some slang words and uneven words. This makes opinion mining and making strategic decision useful in the future market. The structured and unbalanced data, Natural Language Processing (NLP) and Data Mining techniques are used for sentiment analysis. In the developed method, the study focuses on Twitter data on Citizenship Amendment Act of India, 2019 to detect the sentiment of the views from people all over the world using machine learning techniques. Many people had given their opinions and views about this new rule for CAA throughout that time. By purifying and analyzing the data using NLP techniques, VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment polarity is calculated. The dataset is normalized to be used by machine learning algorithms and prepared using natural language techniques such as Word Tokenization, Stemming and lemmatization, and Part of Speech (POS) Tagging. All that input variables are converted into vectors by Term Frequency-Inverse Document Frequency (TF-IDF). This method is implemented in a python programming language. The evaluation parameters such as accuracy, precision, recall, and F1-score were obtained for Naïve Bayes, SVM (support vector machine), K-Nearest Neighbor, Neural Network, Logistic Regression, Random Forest and LSTM (Long-short Term Memory) based RNN (Recurrent Neural Network). Finally, the results are compared. A One-way Analysis of Variance (ANOVA) test was performed on the mean values of performance metrics on all the methods. Keywords: Natural Language Processing (NLP) · Twitter · Sentiment analysis · Citizenship amendment act · Naïve Bayes · SVM · Random Forest · KNN · Python · Machine learning · Deep learning
1 Introduction With the development of connectivity and connected devices, there has been a drastic rise in the usage of social media. Now a days, social media is not only used to share one’s status but also used for publishing one’s thoughts, beliefs, and opinions. Billions of users use a platform like Twitter and Facebook to share their opinion on a particular subject © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 701–717, 2022. https://doi.org/10.1007/978-3-031-10464-0_47
702
S. Vaghasia and K. Passi
or matter. It could be a judgment, a political bill, a social crime, or even a fun fact. In short, these social media platforms have developed a framework that allows opinions or thoughts to be showcased on a larger platform. Before the evolution of social media, the opinions and thoughts on a particular subject were subjected to limited exposure. But this limitation has gradually vanished with the development of social media platforms. World Wide Web or popularly known as WWW consortium was just a repository with some static information. One will simply search for some stored information. However, things are not the same as the WWW platform has become more dynamic and interactive. Now, users can voice their opinion, make their recommendations, or even can share their thoughts. This dynamism is not limited to a particular area or regions, rather it has expanded across the globe. The concept of one’s expression of opinion or thoughts on social media is also referred to as user-generated content. These are the contents created by the users of the platform, rather than by any brand or an organization. Social media platforms are based on social networking sites. These are the sites that provide web-based services to the user to construct their profile on a well-defined and structured platform. This profile could be hidden, private, or public. These platforms articulate social media users to develop a connection with other users. As a result, the users can view the connections they have made with the other users and vice versa. It must be noted that connections are the underlying base of any social media platform. As a result, an opinion or sentiment that has been posted on a social media platform is likely to affect the users who are connected to the user who has posted, first. However, these connections are not standardized across all the platforms and their properties might not be the same in all the platforms. A lot depends on the type of security settings a user has deployed on his profile. Today, a celebrity user can share his or her comments on social media, develop an opinion, that might be endorsed, rejected, or might be debated by millions of his followers, who are part of the same network, and the result might be a creation of a trend. This is how social media converts a sentiment to an opinion and then to a trend. The trend could be positive or negative, and often it might be subjected to more debate and arguments. Moreover, due to the development of these trends, social media platforms have become a source of news as well. Most of the social media platforms exhibit the same kind of characteristics, even though they might be functionally different. For this research, we will focus on Twitter. As per the data shared by an agency, Omnicore, more than 300 million users use Twitter [1]. Amongst these users, 42% of users regularly visit this platform. The top countries that use Twitter are, Japan, followed by India and Brazil. In another data shared by Pew Research [2], showed that 313 million users actively used Twitter across the globe. Even though Twitter is used as a free messaging platform, it has got a character limit of 280 characters. With these 280 characters, one can share their opinion briefly and crisply, as to-the-point possible. This is one of the reasons Twitter is widely used by influencers, as with its to-the-point messaging, it is easy to develop an opinion, and a resultant trend quite rapidly. Many people initially thought that this 280-character limit is a limitation, and this would affect the growth of Twitter as a social media platform. One can share his opinion without any story or lag, just being to the point. As a result, Twitter has been often used as a platform for sharing a quick opinion. For example, opinion on political
Sentiment Analysis on Citizenship Amendment Act of India 2019
703
development, or even an opinion on some conflicts. It is also used for sharing updates, and any kind of breaking news. Many companies have successfully used it for sharing product updates. For example, the release of a new patch update. Moreover, companies also value their existence through user opinions on Twitter. 1.1 Motivation This research focuses on sentiment analysis, through analysis of Twitter data. Since Twitter data has widespread acceptance and it is one of the most crucial platforms used by influencers, Twitter was assumed to be the most appropriate platform for this research. As it has been discussed in the last section, social media platforms including Facebook and Twitter have huge acceptance amongst the users, and it has become part of their daily life. One can express his or her opinion at ease, just with a mobile in hand. One can become an influencer in a few minutes. The rapid rise of social media users vouches for the popularity of social media platforms. The generated twitter is capable of providing some powerful insights, such as sentiment analysis of a section of a given population. The stakeholders can gain some important understanding of the target population. Here, the stakeholder could be anyone, who has an interest in the target population. For example, the government wants to examine how a population of a particular part of the country has accepted some newly passed regulation. This will aid in better decision making, as decision making will be completely based on facts and data. For this research, the data was acquired from a data warehouse site, which hosted the harvested data from Twitter. The basis of data acquisition collected in a dataset was topic modeling. In Topic modeling [3], the tweets will be scanned based on certain phrases and data. For this research, it was “CAA” (Citizenship Amendment Act, 2019, India). And performed data cleaning steps where useless words are removed and filtered. The data remained after removing the noise is then analyzed. It has been determined whether these tweets reflect a negative, positive, or neutral sentiment. Overall, a consensus can be developed based on sentiment analysis. More details about the methodology will be available in the research methodology section. This research focuses on sentiment analysis, through analysis of Twitter data. Since Twitter data has widespread acceptance and it is one of the most crucial platforms used by influencers, Twitter was assumed to be the most appropriate platform for this research. As it has been discussed in the last section, social media platforms including Facebook and Twitter have huge acceptance amongst the users, and it has become part of their daily life. One can express his or her opinion at ease, just with a mobile in hand. One can become an influencer in a few minutes. The rapid rise of social media users vouches for the popularity of social media platforms. The generated twitter is capable of providing some powerful insights, such as sentiment analysis of a section of a given population. The stakeholders can gain some important understanding of the target population. Here, the stakeholder could be anyone, who has an interest in the target population. For example, the government wants to examine how a population of a particular part of the country has accepted some newly passed regulation. This will aid in better decision making, as decision making will be completely based on facts and data. For this research, the data was acquired from a data warehouse site, which hosted the harvested data from Twitter. The basis of data acquisition collected in a dataset was topic modeling. In Topic modeling [3], the tweets will be scanned based on certain phrases
704
S. Vaghasia and K. Passi
and data. For this research, it was “CAA” (Citizenship Amendment Act, 2019, India). And performed data cleaning steps where useless words are removed and filtered. The data remained after removing the noise is then analyzed. It has been determined whether these tweets reflect a negative, positive, or neutral sentiment. Overall, a consensus can be developed based on sentiment analysis. More details about the methodology will be available in the research methodology section. 1.2 Research Objectives and Approach The research has certain pre-determined objectives as listed below. • Use the text mining-based algorithms for analysis of the Twitter data that is already pre-extracted and available in a specified location. • Use rule-based lexicon approach through sentiments of words known by corpus like SentiWordNet and WordNet. • Visualizing the data based on underlying sentiments, thus classifying, and segregating the positive tweets and negative tweets. Developing a consensus based on the visualized data. • Automate the data analysis with the help of the Python program to a possible extent, and limit manual analysis. This will limit the errors during the analysis. To extract the underlying sentiments, the extracted tweets were analyzed with the help of a developed algorithm. This algorithm was executed in Python language. Python was selected for this research, as it has known benefits pertaining to data science. Moreover, Python is easy to implement, without any known complexities. The text analysis uses Lexicon based approach and Machine learning approach. 1.3 Methodology In this research, two approaches are being used namely Lexicon-based and Machine learning. The text data was preprocessed using some filtering methods like NLP (Natural Language Toolkit) to remove noise from the dataset. Then counting the sentiments of words using VADER was performed. Finally, the data set is trained by the models. Figure 1 shows a flow diagram for the methodology.
Data Set
Data Preprocessing
Result and Analysis
Applying Machine Learning and Deep Learning Algorithm
Fig. 1. Flow diagram for Twitter analysis
Linguistic processing using NLP
Sentiment Analysis
Sentiment Analysis on Citizenship Amendment Act of India 2019
705
2 Data and Preprocessing 2.1 Dataset and Variables The Twitter data used for sentiment analysis is that of “Citizenship Amendment Act 2019, India” with the hashtag “#CAA”.The hashtags #CAA contains all the tweets from the date 1st Nov 2019 to 29th Jan 2020 (90 days). It consists of 1,04,873 user tweets globally on the promotion of the Citizenship Amendment Act rule. The tweets on Citizenship Amendment Act new rules had a very good response from all over the world. The dataset available for the analysis has many tweets for the #CAA hashtags which is approximately 1.5 lakh (150,000 approx.) tweets. 2.2 Data Preprocessing Data preprocessing is the core step in any text mining analysis. Using data preprocessing one cannot count the polarity of any text data also we can end up getting incorrect results. The raw data extracted from Twitter, has noise in terms of repeated words, slangs, URLs, images, extra whitespaces, unusual words. With all this noise in a tweet, it makes difficult for performing sentiment analysis. Therefore, it is necessary to remove unstructured data from the original tweet for understanding true meaning of the tweet. NLP was used for extracting the appropriate texts from the data, which helped the content to get converted into more meaningful data that can be used for sentiment analysis. After applying data preprocessing steps and removing unnecessary data will reduce the size of the file which also improves the performance of the model in finding the sentiment polarity. Figure 2 shows an overview of data preprocessing steps.
Raw Tweet Data
Elimination of URL’s, RT character, @username, #Hashtags
Filterin g of repeated characters in word
Remove white space, stopwords, special characters and symbols
Clean-ed Tweet Data
Fig. 2. Structure of data preprocessing
Below describes the algorithm implemented in python for data cleaning and removing irrelevant data from each user tweet.
706
S. Vaghasia and K. Passi
The algorithm first removes URLs, then cleans special symbols such as @, # and RT plus other symbols like (: \ ; {} – [] + () ! ? @ # < > % *,). Moreover, repeated characters are removed. Furthermore, stopwords, whitespace, and punctuations are removed and spelling errors are corrected.
3 Sentiment Analysis People are referred to as social beings living in different geographical locations across the globe. The user base has expanded immensely after globalization with the advent of Instagram, Twitter and Facebook social media platforms. On a monthly basis, Twitter alone has over 300 million active users [4]. These channels are used by people to openly express their thoughts and emotions. They make use of universal languages such as English, French, Hindi, Spanish, Japanese, etc. Communication through a Language might be described as a “Linguistic communication” [5]. Natural Language Processing can be classified as a processing of a natural language expressed in written and verbal form by the computers [6]. NLP involves fields of linguistics, computers and artificial intelligence to understand and process natural human languages [7]. It has been an active area of research and sentiment analysis using Natural Language Processing is most widespread application of this research. Machine Learning methods categorize the linguistic data into sentiment polarity by representing the data in a vector form [8]. It also modifies computer programming languages, allowing programmers to communicate with real-world entities and permitting people to process natural languages. Therefore, Natural Language Processing (NLP) or Computational Linguistic [9] can be referred to as the processing of human language(s) using artificial languages. So, the term defined as Natural Language Processing which includes a number of steps and procedures to manipulate and analyze natural human languages. Sentiment polarity of a sentence or a document can be computed using lexical dictionaries such as WordNet, SentiWordNet and treebanks. Stopword removal, word tokenization, Word Stemming and Lemmatization, word sense disambiguation, POS (Partof-Speech) tagging, parsing, named entity recognition, and information extraction are some of the techniques to compute the sentiment polarity of the natural languages.
Sentiment Analysis on Citizenship Amendment Act of India 2019
707
3.1 NLTK (Natural Language Toolkit) NLTK provides a platform for constructing Python programs to process natural language data. It provides comfortable interfaces to over 50 corpora and lexical resources like WordNet with text processing libraries for classification, tokenization, stemming, tagging, parsing and semantic reasoning. It gives access to methods and packages for familiar NLP tasks; also provides the pre-processed and original versions of standard corpora used in NLP literature and courses [9]. Removing Stopwords Stopwords are the most frequent words in any natural language. The stopwords do not provide much value to the meaning of the documents that may contribute to the sentiment polarity of a sentence. Stopwords consist of commonly used words in a language. In English language 179 stopwords are available in the NLTK Library. Stopwords can be removed using python libraries. Examples of English stopwords are {of, and, or, the, an, be, etc.}. Word Tokenization Word tokenization is largely a procedure of breaking down the entire statement into words. Here, the processed tweets are broken down into words and separated by a comma. Tokenization is an important step to process the natural language. Many extra words are used within sentence adding some conjunctions, punctuations for creating a correct statement but some words are useless for sentiment analysis and some words express their own feeling. The words that have their own meaning and have associated sentiments or expressions in it, word tokenization is applied to segregate each of these words to calculate the sentiment polarity. Word Stemming The stemming of a word is a process of normalizing form of a word [10]. Word stemming extracts the root of a word [11]. The root of a word is obtained by chopping off the suffix of a word. It removes suffix like “ed”, “ing”, “ion”, “ions”. Stemming is performed by removing character by character of the suffix to obtain the root of the word. NLTK library in python provides a method for stemming the words. Word Lemmatization Lemmatization overcomes the problem caused due to stemming process as it just removes suffix from the word and sometimes the meaning of the word gets changed. Lemmatization is the process of transforming the word to its root or dictionary form. The word that has the root meaning is called a Lemma and can be used directly in the WordNet. Word lemmatization is used to create the lemmas which are input to the WordNet dictionary to find the sense number of the word to get the polarity.
708
S. Vaghasia and K. Passi
3.2 Semantic Lexicons A sentiment lexicon is basically a history of lexical features. To exemplify, some words are labelled familiarly corresponding to their semantic orientation such as positive or negative [12]. Possibly one can create and validate list of opinion bearing features, but it sometimes proves to be more reliable for generating sentiment lexicons, plus it consumes time. Due to this issue, researchers still use predefined and modelled sentiment lexicons. For sentiment analysis procedures, lexicons are the important part and so the discussions will be presented for polarity based and other valence based semantic Lexicons. 3.2.1 Semantic Orientation (Polarity-Based) Lexicons A computer program named LIWC (Linguistic Inquiry and Word Count) was constructed to analyze different components like emotional, cognitive, and structural. LIWC has its own dictionary which contains 76 categories and around 4500 words. It is known as the most reliable and validated tool and is used to calculate sentiment polarity of texts. LIWC also shows the intensity of the word. As an example, “she is awesome” conveys more positive sentiment than “She is okay”. Here LIWC score will be same for both sentences. Sometimes the difference matters most in social media sentiment analysis as it demands a more precise look at each word in the statement. 3.2.2 Semantic Intensity (Valence-Based) Lexicons Sentiment Intensity is needed for more depth in the study for a word. Intensity helps in learning a word systematically rather than only resolving the binary polarity of a text’s positive or negative emotion. Valence based lexicons provides the intensity of a word. It is sometimes important to learn how the sentiment intensity has changed over a period for a product. The SentiWodNet’s extension is WordNet, which apparently gives the valence strength of the text. The intensity of a text is defined between some numerical score which is either 0 or 1. The scores are calculated with complicated semi-supervised algorithms. SentiWodNet’s interface is provided by NLTK library in Python. 3.2.3 Valance Aware Dictionary and sEntiment Reasoner (VADER) VADER is a sentiment analysis tool particularly attuned to social media messages that exploits the benefits of rule-based modeling and lexicon based characteristics. VADER Sentiment Analyzer gives sentiment score related to the word intensity obtained from the lexical features of a word. Sentiment score of each tweet labels a word as positive, negative, neutral and compound as obtained by VADER. The compound value is a useful metric for measuring the sentiment in a given tweet. In the proposed method, the threshold values used to categorize tweets as either positive, negative, or neutral are based on the values shown below: 1. Positive Sentiment: Compound value > 0.0, assign score = 1 2. Negative Sentiment: Compound value < 0.0, assign score = -1 3. Neutral Sentiment: Compound value = 0.0, assign score = 0
Sentiment Analysis on Citizenship Amendment Act of India 2019
709
4 Methods After data cleaning and NLP-based data preprocessing, both machine learning techniques and deep learning was used to classify the unlabeled data into sentiment polarity. VADER converted the sentiments into labeled data for positive, negative, and neutral scores for the terms in the tweets. The labeled data was trained by the classifiers to predict the sentiment of the unlabeled data. To translate the strings into numbers, Term Frequency-Inverse Document Frequency (TF-IDF) was used. 4.1 Machine Learning Models This section gives a brief introduction of algorithms such as Naive Bayes, Support Vector Machine (SVM), Logistic Regression, Random Forest, K-Nearest Neighbor (KNN), and Neural Networks. Figure 3 shows the classification procedure for training and testing of sentiment scores.
Fig. 3. Classification procedure for predicting sentiment scores
Naïve Bayes: Naive Bayes is based on Bayes’ Theorem to guess the tag of a textual content. The probability of every tag is counted for a given text and the tag with the greatest probability is output. The Bayes’ theorem expresses the likelihood of a feature based on prior knowledge of conditions that may be relevant to that feature. It also uncovers the possibility of an occurrence based on the probability of any other event that has occurred. Bayes theorem is given as: P(A|B) =
P(B|A)P(A) P(B)
Where P(A|B) (posteriori probability) is the conditional probability of event A, given the observed data sample B. P(A) is the prior probability of A, i.e., initial probability. P(B) is the probability that sample data is observed. P(B|A) is the likelihood, i.e. the probability of observing the sample B, given that the hypothesis holds. Logistic Regression: Linear regression is the process of fitting a linear equation to observed data which shows the relationship between two variables. One variable is a
710
S. Vaghasia and K. Passi
dependent variable and the other is an instructive variable. Logistic Regression determines the output when one or more independent variables are used. The output value may be a number between 0 and 1. Logistic regression applies sigmoid function to renovate linear regression into logit function. Logit is said to be Log of Odds. After using logit, it calculates the probability. Sigmoid function is “S”-shaped curve or known as Sigmoid curve. Sigmoid function is given as: f (x) =
L 1 + e−k(x−x0 )
x0 denotes the x value of the Sigmoid’ s midpoint, L is the curve’s maximum value, k is the logistic growth rate. The sigmoid function gives the probability of being into class 1 or class 0. The odds ratio is computed by dividing the probability of an event to occur by the probability that it will occur. After taking the odds ratio will give the log of odds. The equation below transforms logistic function or sigmoid function into an Odds ratio: OddsRatio =
P 1−P
Support Vector Machine (SVM): SVM is a supervised learning distance-based model. It is extensively used for classification and regression. The main aim of SVM is to find an optimal separating hyperplane that correctly classifies data points and separates the points of two classes as far as possible. It means that two classes have maximum distance from the separating hyperplane. SVM does not perform well for text categorization as compared to Naïve Bayes and Maximum Entropy [13]. Different kernel functions are used in different implementations of SVM which include linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid. In this research, linear kernel and RBF kernel was used. Random Forest: Random Forest belongs to the supervised learning technique. It is an ensemble algorithm, i.e., it combines more than one algorithm for classifying objects. It creates a decision tree by combining the subsets of the original dataset. It decides final class of input data by aggregating the votes from different decision trees. Tree predictors are set up in such a way that each tree is reliant on randomly patterned values of random vectors and all of the trees are evenly distributed in the forest. K-Nearest Neighbor: KNN is a Lazy Learner classifier as it only makes use of training data for prediction of the class. It identifies the K-nearest neighbors based on the distance function. Various distance functions that can be used are Euclidean distance, Manhattan distance, and Minkowski distance. These three are used for continuous variables. Hamming distance is used for categorical variables. Neural Networks: A neural network is a collection of layers that tries to identify core associations in a set of data using a mechanism that mimics how the human brain works. It has many neurons where the neurons are called as its basic unit. Each node is a perceptron, which connects to nodes in the hidden layer which feed into an activation function that may be nonlinear. Perceptron’s are arranged as interconnected layers in a
Sentiment Analysis on Citizenship Amendment Act of India 2019
711
multi-layered perceptron. The input layer consists of input nodes which are passed to the hidden layer nodes with weights, which are passed to the output layer through the activation function. The input weightings are fine-tuned by hidden layers to obtain the optimal output with least error. Hidden layers simulate a nonlinear function from the input data that predicts the outputs optimally. 4.2 Deep Learning Methods Long Short-Term Memory based Recurrent Neural Network: RNN learns information from immediate previous step. The data moves in a loop in the RNN networks. Figure 4 shows a portion of neural network, A receives an input from function X t and outputs a value to ht .
Fig. 4. Recurrent neural network loops
A loop passes the information from one node of the network to the next. RNN network consists of the same network repeated multiple times, each one passing a message to the next step. One special kind of RNN is Long Short-Term Memory (LSTM) Networks, which is apparently capable of learning long-term dependencies. LSTMs are clearly designed to prevent the long-term dependency problem. It remembers information for long time.
Fig. 5. The repeating module in an LSTM
In Fig. 5, vectors are transported from the output of one node to the input of the next node along a line. Each internal circle represents an operation such a vector addition or multiplication and the boxes in yellow color represent layers of the neural network. The merging if the lines denote concatenation of the inputs, while forking of a line denotes copying the content to different locations.
712
S. Vaghasia and K. Passi
4.3 Evaluation Parameters The performance of the classifiers was assessed using the following assessment parameters, namely, accuracy, precision, recall, F1-score and Area Under the Receiver Operating Characteristics (ROC) curve (AUC). There are define below. Accuracy: It is the ratio of true positive and true negative predictions (correct predictions) to the total observations. Accuracy =
TP + TN TP + FP + TN + FN
(1)
Precision: It is the ratio of the true positive predictions to the total positive predictions. Precision =
TP TP + FP
(2)
Recall: It is the ratio of the true positive predictions to the total actual positives. Recall =
TP TP + FN
(3)
F1 Score: It is the weighted average of precision and recall measured as a harmonic mean of precision and recall. F1score = 2 ∗
recall ∗ precision recall + precision
(4)
AUC-ROC (Area Under the ROC Curve): The ROC curve is used for visual comparison of classification models. It is a plot between the true positive rate and false positive rate. AUC represents the area under the ROC curve and depicts the measure of accuracy of the model. An accuracy of 100% will have an area of 1.0.
5 Results The sentiment label attributes in the output data set were created using the overall data set formed by the derived algorithm that has the labels with positive score, negative score and neutral score. Figure 6 shows that from the overall tweets, 42.8% were Neutral, 35.9% were positive comments and 21.3% were Negative tweets. To predict the test data, the classifiers were trained on the output data obtained by adding sentiment scores and class labels. Model selection splits the input data into random training and test datasets, which would be in different forms like lists, arrays or data frames. The model is trained on training data before being used to test data to determine the label of an unknown data sample. The model’s performance is evaluated using the test dataset. The test sizes were in three ratios: 10%, 20%, and 30% of the total data is used as test data, while the remaining 90%, 80%, and 70%, respectively is used as training data. A total of 50,000 random tweets were used for classification.
Sentiment Analysis on Citizenship Amendment Act of India 2019
713
Fig. 6. Overall sentiment score.
Fig. 7. Comparison of accuracies on all methods
Seven different methods of machine learning and deep learning techniques were used for training sentiment labelled dataset and the results obtained show good performance. In Fig. 7, we can see the highest accuracy was obtained with LSTM based RNN which is about 90% and the least accuracy was obtained using KNN which is 46%. Figure 8 shows the comparison of scores obtained for precision, recall and F1-score for all the methods. Here we can observe that precision, recall and F1-score are similar for Logistic Regression. The recall score for KNN is high whereas for Naïve Bayes, the score is low. Moreover, Random Forest shows low F1-score, and Logistic Regression shows high F1-score. LSTM based RNN has the highest precision score followed by Logistic Regression, SVM using RBF kernel and the least score is shown by Naïve Bayes. LSTM also has similar scores which also seems like the highest score from all the methods.
714
S. Vaghasia and K. Passi
Fig. 8. Comparison of precision, Recall and F1 score of all methods.
The analysis using ROC curve of all the methods is shown in Fig. 9. It shows that point of KNN is lowest, meaning score for KNN is very low which is 56% and the highest score is 97% for LSTM based RNN, and then followed by Random Forest and SVM with the same score 91%. Naïve Bayes and SVM RBF Kernel has ROC score of 80%.
Fig. 9. Comparison for AUC scores
Sentiment Analysis on Citizenship Amendment Act of India 2019
715
Table 1 shows a comparison of all the methods with their mean values. The accuracy, precision, recall, and F-score values achieved using KNN, Logistic Regression, Naive Bayes, SVM, Random Forest, Neural Network, and LSTM based RNN classifiers are shown. Table 1. Comparison of all methods KNN
Random Forest
Logistic Regression
Naïve Bayes
SVM Linear Kernel
SVM RBF Kernel
Neural Network
LSTM
Accuracy
0.459
0.804
Precision
0.656
0.754
0.77
0.679
0.80
0.769
0.81
0.90
0.767
0.657
0.797
0.76
0.806
0.924
Recall
0.904
0.83
0.767
0.65
0.784
0.764
0.806
0.889
F1-score
0.757
0.784
0.767
0.65
0.787
0.767
0.806
0.957
AUC
0.561
0.916
0.908
0.837
0.905
0.818
0.916
0.976
From Fig. 10, it can be observed that LSTM performed the best in terms of accuracy (90%), precision (0.924), F1-score (0.957) and AUC score (0.976) compared to all other methods, and it is still comparable in recall. The second highest performance in terms of accuracy (80%), precision (0.81), and F1-score (0.81) is observed for Neural Network. From the machine learning classifiers, SVM and Random Forest seem to give the best accuracy of 80%; SVM gives higher precision than Random Forest. For recall, F1-score and AUC scores, Random Forest gives higher scores. Logistic Regression and SVM RBF Kernel show the next best scores in accuracy, 77% and 76%, respectively. The precision, recall and F1-score are similar in both the methods. But the AUC score for
Fig. 10. Comparison for all methods
716
S. Vaghasia and K. Passi
Logistic Regression is higher with 0.920 as compared to SVM with 0.834. The second last is the Naïve Bayes classifier which has 67% accuracy and the least similar score in terms of precision, recall and F1-score (0.65) while, the AUC scores are 0.83. The last classifier which is not robust is KNN, although it has highest Recall score among all the methods implemented. It is not compatible in accuracy and AUC. 5.1 Statistical Analysis A statistical one-way ANOVA test was performed to select the best classifier. The statistical significance test can be used to determine if the difference in performance measurements found using all methods is significant. An ANOVA (One-way Analysis of Variance) test is performed on the mean values of the output parameters of all three ratios, on all the methods. The null hypothesis (H0) states that all models perform equally well. H0 is accepted if the mean values of the output parameters for the various models do not significantly differ statistically. If a statistically significant performance difference (p < 0.05) is found to exist, the alternate hypothesis (H1) is accepted and H0 is rejected. Table 2 shows the mean values and the p-value for all the measures. Table 2. ANOVA test results on performance metrics KNN Random Logistic Naïve SVM SVM Neural LSTM P-values Forest Regression Bayes Linear RBF Network Kernel Kernel Accuracy 0.459 0.804
0.77
0.679 0.80
0.769
0.802
0.889
8.38*1019
Precision 0.656 0.754
0.767
0.657 0.797
0.76
0.806
0.924
3.04*106
Recall
0.904 0.83
0.767
0.65
0.784
0.764
0.806
0.889
6.05*106
F1-score
0.757 0.784
0.767
0.65
0.787
0.767
0.806
0.957
1.06*106
6 Conclusions and Future Work The objective in this research was to use different machine learning techniques to perform sentiment analysis on a linguistic data set. A method was proposed for processing the sentiment analysis of the tweets related to CAA. Different methods were used with Natural Language Processing techniques for pre-processing to filter and remove meaningless data from the unstructured data. It is required to go through a series of pre-processing steps to convert unstructured data into structured data. The raw tweets were screened and processed after preprocessing to produce more accurate data. After filtering needless data from the tweets, the derived data was used to perform additional processing tasks. Natural Language Processing (NLTK) toolkit was used for word tokenization, stemming and lemmatizing of the words. Using a VADER sentiment analyzer, the polarity of the Twitter data on CAA was calculated to build a model to analyze the linguistic dataset.
Sentiment Analysis on Citizenship Amendment Act of India 2019
717
The polarity of the tweets showed that most of the tweets were neutral, followed by positive tweets, whereas not many negative tweets were observed. Sentiment scores were used to create labeled dataset which was then trained on different classifiers to predict the sentiment of the tweets on test data. Random Forest classifier achieved an accuracy of 80.99%, Logistic Regression with 77.68%, Naïve Bayes with 68.33%, SVM-linear kernel with 80.50% and 77.50% using SVM-RBF kernel, and KNN with 46.39%. While the Neural Network gave an accuracy of 82%, LSTM gave the highest accuracy of 90%. The future scope can involve changes in the real-time sentiment polarity assigned to Twitter data. For doing that, one should prepare a website or software through which all the data pre-processing steps would be performed through different chosen techniques. Currently, the model’s accuracy has been checked by accessing 5000 words and only 50,000 random tweets, but one can access more words and tweets to analyze bigger dataset and evaluate the models. Other social media platforms can be selected for analyzing the datasets. Also, a dataset can be created by Twitter’s API through hashtags.
References 1. Twitter statistics Blog- https://www.oberlo.ca/blog/twitter-statistics/ 2. Actively Twitter users. https://blog.hootsuite.com/twitter-demographics/ 3. Neri, F., et al.: Sentiment analysis on social media. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society (2012) 4. Source: “Company | About.” Twitter. Twitter, 19 June 2020. Web. 04 Dec. 2019 5. Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions. Association for Computational Linguistics (2006) 6. Understanding the classification report through SKlearn. https://muthu.co/understanding-theclassification-report-in-sklearn/ 7. Chowdhury, G.G.: Natural language processing. Ann. Rev. Inf. Sci. Technol. 37(1), 51–89 (2003) 8. Olsson, F.: A literature survey of active machine learning in the context of natural language processing (2009) 9. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python. O’Reilly Media, Inc., Newton (2009) 10. Toman, M., Tesar, R., Jezek, K.: Influence of word normalization on text classification. Proc. InSciT 4, 354–358 (2006) 11. Younis, E.M.: Sentiment analysis and text mining for social media microblogs using opensource tools: an empirical study. Int. J. Comput. Applicat. 112(5) (2015) 12. Liu, B.: Sentiment analysis and subjectivity. In: Indurkhya, N., Damerau, F. (eds.) Handbook of Natural Language Processing (2nd edn.). Chapman & Hall, Boca Raton (2010) 13. Raza, M., Saqib, N., Basit, S., Javed, F., et al.: EarlyDetection of controversial Urdu speeches from social media. DataScience Pattern Recognit. 1(2), 26–42 (2017)
Sentiment Analysis on Depression Detection: A Review Norma Mohamad Nor1 , Noorihan Abdul Rahman1(B) , Mohd Ridzwan Yaakub2 , and Zuriani Ahmad Zukarnain1 1 Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA Kelantan,
Bukit Ilmu, Machang, Kelantan, Malaysia [email protected] 2 Faculty of Information and Science Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor, Malaysia
Abstract. Depression has become a public health issue. The high prevalence rate worsens all scopes of life irrespective of age and gender, affects psychological functioning, and results in loss of productivity. Early detection is crucial for expanding individuals’ lifespan and more effective mental health interventions. Social networks that expose personal sharing and feelings have enabled the automatic identification of specific mental conditions, particularly depression. This review aims to explore the sentiment analysis to the psychology area for detecting depressed users from the datasets originating from social media. Sentiment analysis involves five research tasks, but this study investigates the sentiment analysis that focuses on emotion detection in the text data. This paper surveys existing work on the most common classification approach in machine learning to classify linguistic, behavioral, and emotional features and presents a comparative study of different approaches. Keywords: Depression · Emotion detection · Machine learning approach · Sentiment analysis · Social media
1 Introduction Depression is threatening people’s health nowadays. Depression is a mood state characterized by sadness, feeling of worthlessness or excessive guilt, empty mood, and followed by cognitive changes that are severe enough to interfere with the individual function, working, and social life [1]. The World Health Organization estimated that ten percent of the world population is affected by depression, increasing prevalence among young people aged 15 to 24 [2]. In Malaysia, depression is the most prevalent mental disorder, affecting 2.3 million individuals regardless of sociodemographic or geographic aspects, but this health issue remains undetected and untreated [3]. It has been estimated that the prevalence of depression in Malaysia is between eight and twelve percent [4], and it is still seen as fragmented and unclear [3].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 718–726, 2022. https://doi.org/10.1007/978-3-031-10464-0_48
Sentiment Analysis on Depression Detection: A Review
719
People with depression are usually diagnosed by health by health practitioners faceto-face through patient visiting by referring to clinical depression criteria, but the diagnosis may not be made meticulously [5]. Researchers found that the health records lack support documenting and tracking data for behavioral health conditions [6]. Furthermore, the provision for diagnosing, supporting, and managing depression is still insufficient [6], as the depression diagnosis is more accurate if made over more than one visit. The current trend on depression detection is considered on social media data. Several studies have proven that social media has been successfully used to deal with people’s mental health. For example, revealing the people’s behavior can be achieved by mining people’s opinions on the text data, resulting in improved healthcare [7]. Analyzing social media data by using machine learning techniques could identify patterns and make predictions based on the retrieved data [8] as the text mining provides new non-trivial knowledge, such as, finding prediction values, hidden patterns, and dependencies [9]. Hence, by mining users’ social media posts, we may obtain valuable information about the users’ behavior, which could help detect and predict depression. Using the social network in daily life activities to share feelings, thoughts, and emotions provides unprecedented opportunities to solve problems in a wide variety of fields with text data exploration [10]. In a traditional setting, depression screening and detection are based on questionnaires and face-to-face interviews. However, many practitioners, especially psychologists, psychiatrists, and even counselors, are now turning to online network services to deal with depression detection and investigate individual emotions and behaviors during their online interactions [11]. Social media are important data sources for sentiment analysis that can explore depression that involves emotions, behavior, and cognitive aspects. For example, psychologists may observe the online behaviors of depressed users and discover potential features that could be used to distinguish the depression levels of an individual.
2 Sentiment Analysis: An Overview Sentiment analysis is a branch of emotional computing that mines people’s ideas, sentiments, and feelings based on observations of their behaviors, which can be captured through writings, facial expressions, voice, music, and gestures. It is the process of automatically extracting information about an entity and recognizing the subjectivities. Sentiment analysis is a text mining technique used to analyze the sentiment text, usually involving opinion mining and emotion mining. Opinion mining is concerned with the expression of opinions, whether the text generated by users is positive, negative, or neutral. In contrast, emotion mining is concerned with the feelings expressed in the text. Sentiment classification may be performed in three categories: feature, sentence, and document. The feature-level classification aims to categorize the sentiment in respect of particular entities. While, the classification of the sentence level is intended to categorize emotions, whether they convey positive, negative, or neutral views. On the other hand, the classification at the document level tries to classify an opinion as positive or negative as a single unit in the entire document. Thus, as shown in Fig. 1, there are three methods for performing sentiment analysis classification, namely keyword-based, machine learning and hybrid techniques.
720
N. M. Nor et al.
Fig. 1. Taxonomy of sentiment analysis techniques
The keyword-based approach relies on several words associated with given sentiments from a dictionary-based approach or a corpus-based approach. The dictionarybased technique employs a manual way to gather seed sentiment words based on antonyms and synonyms, whereas the corpus-based technique employs statistical or semantic approaches and a seed list of opinion words to locate other opinion words in a large corpus. On the other hand, the machine learning technique comprises of two methods: supervised and unsupervised learning. Unlabeled input data is classified using supervised approaches, which use labeled datasets to build a model. When unlabeled datasets are utilized for training in unsupervised methods, it is necessary to use clustering techniques to label data. Whereas the hybrid strategy then combines a keyword-based method and a machine learning technique to increase accuracy results.
3 Application of Sentiment Analysis Sentiment analysis is widely used in many areas such as business, government, and even in the healthcare domain. The applications such as analyzing the weather, economy, products, politics, medical facilities, and disease outbreaks have been performed from extensive user-generated content on social media. In depression detection, a deep learning model performs sentiment analysis on statements that either contain positive, negative, or neutral polarity. A study conducted by Shetty et al. [12] applied the long short-term memory (LTSM) model to detect if a person’s tweet indicates any signs of depression. Three different classifiers, namely Logistic Regression, Linear Support Vector, and Ensembles, were used to cross-check the binary class prediction.
Sentiment Analysis on Depression Detection: A Review
721
Trotzek et al. [13] used a Convolutional Neural Network (CNN) on different word embeddings to classify signs of depression in written texts and then classify based on user-based metadata features like word and grammar usage, readability, emotions, and sentiment. In contrast, Yates et al. [14] used CNN to identify users with depression in online forums by processing user postings and merging them to create a vector representation of the user activity to perform classification. They developed a shared architecture built on a CNN, a merging layer, model-specific loss functions, and an output layer to increase performance. Conversely, Wang et al. [15] trained CNN to find connections between mental diseases and nutritional supplement usage using Twitter data. They applied sentiment analysis using Linguistic Inquiry and Word Count (LIWC) on the Twitter dataset containing mental health hashtags. Machine learning techniques have been widely used in analyzing social network data for user feelings and feelings. For example, Islam et al. [16] investigated moods and attitudes while communicating online through Facebook. In view of several psycholinguistic features, the depression analysis was performed using four primary classifying classifiers: Support Vector Machine (SVM), Decision Tree, Ensembles, and k-Nearest Neighbour (kNN). At the same time, Tadesse et al. [17] used Natural Language Processing and machine learning to investigate Reddit user language to find any factors that would identify relevant online user depressed attitudes. They achieved better performance improvement when proper feature selections are made, and several features are combined. Apart from that, a study by Cacheda et al. [18] presented two alternative ways based on machine learning singleton and dual by employing textual, semantic, and writing aspects generated from user behavior to predict depressive conditions. They classified user behaviors based on three features of their writings such as textual dissemination, time gap, and time span. Thorstad and Wolf [19] explored if language could predict various mental health development. For example, a depressed user might have crying spells, irritability, loss of interest in usual activities, conflict with friends or family, low self-esteem, reflect or possess on failures and self-blame. Thus, they looked at user language and semantic similarity to see whether they might forecast the occurrence of various mental illnesses in the future. Fatima et al. [20] adopted a hierarchical approach to predict postpartum disorders (PPD). They extracted the linguistic features from user textual posts on social media using LIWC to compare and calculate the percentage of words that match the built-in dictionaries and then categorized them as general, depressive, and PPD representative using multiple machine learning techniques. Their prediction is based on two layers: the depression content classification (DCC) layer and the postpartum depression content classification (PPD-CC) layer. Peng et al. [11] employed a multi-kernel SVM-based model to recognize depressed people in social media. They extracted user micro-blogging text, user profile, and user habits from Sina Weibo social media. They created an emotional dictionary consisting of text and emoticon dictionary to extract word frequency statistics from text features. On the other hand, Ricard et al. [21] proposed a predictive model using linear regression to detect depression from linguistic features in the community-generated contents, including multiple sentiment scores like emoji sentiment and metadata variables. In
722
N. M. Nor et al.
addition, A. Hassan et al. [5] analyzed various grammatical nuances, cultural variations, and emotions behind social data. They adopted a voting model combining three different classifiers and feature selection techniques to find the depression level from Twitter and the newsgroup platform by observing and extracting emotions from the texts.
4 Discussion Table 1 displays the data extracted from the selected papers that were reviewed. Language structure and patterns, emotional state, behavioral patterns, and many other characteristics distinguish depressed people from non-depressed people. Social media has the potential to detect an individual’s level of depression [6, 17]. Most studies have shown that depression can be identified using the linguistic features of text postings. These features consider the structure and meaning of words and sentences utilizing linguistic styles categorized by text analysis, which is referred to as linguistic inquiry and word count (LIWC) [11, 13–17, 20]. LIWC is a standard text analysis tool that reveals individual feelings, emotions, and personalities based on written text. It is made up of pre-defined word categories related to psychometric characteristics such as positive and negative emotions, as well as affective dimensions such as anger, disgust, fear, joy, and sadness. Furthermore, there are various methods of extracting the features that have been respected. Techniques commonly used in natural language processing and text mining, like Bag-of-Words (BoW) [13, 14, 18, 21], n-grams [5, 12, 17], TF-IDF [11, 12, 14, 19] and Part-of-Speech tagging (PoS tagging) [5], were frequent. The use of PoS tagging for depression is particularly interesting, as some studies have already verified singularities in the use of language by depressives, such as greater use of first-person pronouns [5] and past tense verbs [12]. There are also perform classification based on the score given by a formula based on frequency words (such as “always”, “never” and “generally”), depressive symptoms, pronouns, and negations [5]. While an approach using dictionaries aimed at the application in computing, such as LIWC, with categories related to PoS, analysis of personal feelings and interests, and Affective Norms for English Words (ANEW) [5], for valence and affective excitement of words were also employed. Analyze emotions in the texts can be useful given the emotional impact that depression causes. Some works also used dictionaries self-created, such as the dictionaries of [14] for frequency terms and symptoms. The use of topic modeling from the Latent Dirichlet Allocation technique [17] was not as common as other techniques. In addition, some researchers have applied different techniques for depression classification such as Convolutional Neural Network (CNN) [12–15], Logistic Regression [12, 17, 19, 20], Support Vector Machine (SVM) [5, 11, 15–17, 20], Naïve Bayes [5, 11, 12], Multilayer Perceptron (MLP) [15, 17, 20] and others to classify and predict the depression, that returned different results in classification performance. They are mainly accountable for sentiment analysis, which determines if the data retrieved from social media contains positive, negative, or neutral sentiments. Using various forms of machine learning algorithms, it is also possible to combine sentiment analysis with deep learning to classify the sentiment.
Sentiment Analysis on Depression Detection: A Review
723
Table 1. Summaries of reviewed articles Ref. Title
Dataset
Techniques
[12] Predicting depression using deep learning and ensemble algorithms on raw twitter data
Kaggle datasets on Twitter tweets
CNN, Naïve Count vectorizer, Bayes, TF-IDF, n-grams Logistic regression, linear support vector, multinomial Naïve Bayes, Bernoulli
Binary cross-entropy with accuracy metric selection
CNN
ERDE metric
[13] Utilizing Neural eRisk 2017 Networks and Dataset Linguistic Metadata for Early Detection of Depression Indications in Text Sequences Depression and Self-Harm Risk Assessment in Online Forums
Reddit CNN Self-reported Depression Diagnosis (RSDD) dataset
Features
LIWC, CBoW
Evaluation metrics
LIWC, BoW, Categorical TF-IDF, Emotion Cross-Entropy, lexicons Class Metric, MSE
[15] Detecting associations between dietary supplement intake and sentiments within mental disorder tweets
Crawled CNN, SVM, LIWC, using Twitter Random Tweet2vec Streaming Forest, MLP API with 25 hashtags
[16] Depression detection from social network data using machine learning techniques
Facebook users’ comments for depressive behavioral exploration
SVM, Decision Tree, Ensembles, kNN
[17] Detection of Depression-Related Posts in Reddit Social Media Forum
Collected from Reddit
Logistic LIWC, LDA, Regression, n-grams SVM, Random Forest, Ada Boost, MLP
10-fold cross-validation
Psycho-linguistic Matrices features LIWC parameters (precision, recall and F-measure)
Confusion matrix, Accuracy
(continued)
724
N. M. Nor et al. Table 1. (continued)
Ref. Title
Dataset
Techniques
Features
Evaluation metrics
[18] Early Detection of Depression: Social Network Analysis and Random Forest Techniques
Collected from Reddit
Random Forest
BoW, LSA
ERDE metric
[19] Predicting future mental illness from social media: A big-data approach
Collected from Reddit
Logistic Regression
TF-IDF, t-SNE
Hold-out on prediction accuracy
[20] Prediction of Collected postpartum depression from Reddit using machine learning techniques from social media text
SVM, Logistic Regression, MLP
LIWC
Hold-out on prediction accuracy
[11] Multi-kernel SVM based depression recognition using social media data
SVM, Naïve TF-IDF, Cross-validation, Bayes, Interaction Confusion Decision frequency, LIWC matrix Tree, KNN, libD3C classifiers
Collected from Sina Weibo
[21] Exploring the Utility of Collected Community-Generated from Social Media Content Instagram for Detecting Depression: An Analytical Study on Instagram
Linear Regression
BoW
Hold-out on prediction accuracy
[5]
SVM, Naïve Bayes, Maximum Entropy
n-grams, POS, Negation, Sentiment Analyzer
Accuracy
Sentiment Analysis of Social Networking Sites (SNS) Data using Machine Learning Approach for the Measurement of Depression
Collected from Twitter and Newsgroup
** TF-IDF – Term Frequency-Inverse Document Frequency; LIWC – Linguistic Inquiry and Word
Count; CBoW – Continuous Bag of Words; LDA (Latent Dirichlet Allocation); BoW – Bag of Words, LSA – Latent Semantic Analysis; POS – Parts of Speech
Sentiment Analysis on Depression Detection: A Review
725
5 Conclusion Several researchers have looked into the use of sentiment analysis to detect depression in social media users. This paper reviews recent studies to find the key features of sentiment analysis to be carried out to perform the classification. Information from social media is widely used in sentiment analysis recently. Social media like Twitter, Facebook, Reddit plays excellent potential to learn about people’s behaviors, emotions, and social interactions. In this paper, different techniques have been applied to detect depression from social media data. This paper contemplates the sentiment analysis approach in depression detection based on the linguistic analysis of the text, the behavioral data, and the social data. Therefore, the depression classification is varied. In future, we would like to conduct a deep review on the datasets, classifications, features and techniques that primarily extracted from text data by using a predefined protocol. Acknowledgment. This work reported herein was fully supported by the Fundamental Research Grant Scheme (FRGS) under reference number (Ref: FRGS/1/2018/SS09/UiTM/02/2). In addition, the authors would like to thank the Ministry of Higher Education (MOHE), Malaysia, and Universiti Teknologi MARA (UiTM), Malaysia, for supporting the research.
References 1. Truschel, J.: Depression definition and DSM-5 diagnostic criteria, Psycom 2019. https://www. psycom.net/depression-definition-dsm-5-diagnostic-criteria/ 2. World Health Organization: Mental disorders (2019). https://www.who.int/news-room/factsheets/detail/mental-disorders. Accessed 02 Feb 2020 3. Mukhtar, F., Oei, T.P.S.: A review on assessment and treatment for depression in Malaysia. Depress. Res. Treat. 2011, 1–8 (2011). https://doi.org/10.1155/2011/123642 4. Chan, S.L., Hutagalung, F.D., Lau, P.L.: A review of depression and its research studies in Malaysia. Int. J. Educ. 2(4), 40–55 (2017). www.ijepc.com 5. Hassan, A., Hussain, J., Hussain, M., Sadiq, M., Lee, S.: Sentiment analysis of social networking sites (SNS) data using machine learning approach for the measurement of depression. In: International Conference Information Communication Technology Convergence ICT Convergence Technology Lead. Fourth Ind. Revolution, ICTC 2017, vol. 2017-Decem, pp. 138–140 (2017). https://doi.org/10.1109/ICTC.2017.8190959 6. Aldarwish, M.M., Ahmad, H.F.: Predicting depression levels using social media posts. In: Proceedings 2017 IEEE 13th International Symposium Autonomous Decentralized Systems, ISADS 2017, pp. 277–280 (2017). https://doi.org/10.1109/ISADS.2017.41 7. Ghani, N.A., Hamid, S., Hashem, I.A.T., Ahmed, E.: Social media big data analytics: a survey. Comput. Human Behav. (2018). https://doi.org/10.1016/j.chb.2018.08.039 8. Samways, B., Teresinha, M., Steiner, A., Trojan, A., Henrique, R., Lima, P.: Data mining and machine learning techniques applied to public health problems: a bibliometric analysis from 2009 to 2018. Comput. Ind. Eng. 138, 106120 (2019). https://doi.org/10.1016/j.cie.2019. 106120 9. Niaksu, O., Skinulyte, J., Duhaze, H.G.: Systematic literature review of data mining applications in healthcare. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8182, pp. 313–324 (2014)
726
N. M. Nor et al.
10. Chen, L., Ho, S.S., Lwin, M.O.: A meta-analysis of factors predicting cyberbullying perpetration and victimization : From the social cognitive and media effects approach. New Media Soc. 1–20 (2016). https://doi.org/10.1177/1461444816634037 11. Peng, Z., Hu, Q., Dang, J.: Multi-kernel SVM based depression recognition using social media data. Int. J. Mach. Learn. Cybern. 10(1), 43–57 (2017). https://doi.org/10.1007/s13042-0170697-1 12. Shetty, N.P., Muniyal, B., Anand, A., Kumar, S., Prabhu, S.: Predicting depression using deep learning and ensemble algorithms on raw twitter data. Int. J. Electr. Comput. Eng. 10(4), 3751–3756 (2020). https://doi.org/10.11591/ijece.v10i4.pp3751-3756 13. Trotzek, M., Koitka, S., Friedrich, C.M.: Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences. IEEE Trans. Knowl. Data Eng. 32(3), 588–601 (2020) 14. Yates, A., Cohan, A., Goharian, N.: Depression and self-harm risk assessment in online forums. In: EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing Proceedings, pp. 2968–2978 (2017). https://doi.org/10.18653/v1/d17-1322 15. Wang, Y., Zhao, Y., Zhang, J., Bian, J., Zhang, R.: Detecting associations between dietary supplement intake and sentiments within mental disorder tweets. Health Inf. J. (2019). https:// doi.org/10.1177/1460458219867231 16. Islam, M.R., Kabir, M.A., Ahmed, A., Kamal, A.R.M., Wang, H., Ulhaq, A.: Depression detection from social network data using machine learning techniques. Health Inf. Sci. Syst. 6(1), 1–12 (2018). https://doi.org/10.1007/s13755-018-0046-0 17. Tadesse, M.M., Lin, H., Xu, B., Yang, L.: Detection of depression-related posts in Reddit social media forum. IEEE Access 7, 44883–44893 (2019). https://doi.org/10.1109/ACCESS. 2019.2909180 18. Cacheda, F., Fernandez, D., Novoa, F.J., Carneiro, V.: Early detection of depression: social network analysis and random forest techniques. J. Med. Internet Res. 21(6) (2019). https:// doi.org/10.2196/12554 19. Thorstad, R., Wolff, P.: Predicting future mental illness from social media: a big-data approach. Behav. Res. Methods (2019) 20. Fatima, I., Abbasi, B.U.D., Khan, S., Al-Saeed, M., Ahmad, H.F., Mumtaz, R.: Prediction of postpartum depression using machine learning techniques from social media text. Expert Syst. 36(4), 1–13 (2019). https://doi.org/10.1111/exsy.12409 21. Ricard, B.J., Marsch, L.A., Crosier, B., Hassanpour, S.: Exploring the utility of communitygenerated social media content for detecting depression: an analytical study Instagram. J. Med. Internet Res. 20(12) (2018). https://doi.org/10.2196/11817
Supervised Negative Binomial Classifier for Probabilistic Record Linkage Harish Kashyap(B) and Kiran Byadarhaly Mysuru Consulting Group, Mysore, India [email protected] Abstract. Motivated by the need for linking records across various databases, we propose a novel graphical model based classifier that uses a mixture of Poisson distributions with latent variables. The idea is to derive insight into each pair of hypothesis records that match by inferring its underlying latent rate of error using Bayesian Modeling techniques. The novel approach of using Gamma priors for learning the latent variables along with supervised labels is unique. The naive assumption is made deliberately as to the independence of the fields to propose a generalized theory for this class of problems and not to undermine the hierarchical dependencies that could be present in different scenarios. This classifier is able to work with sparse and streaming data. The application to record linkage is able to meet challenges of sparsity, data streams and varying nature of the datasets.
Keywords: Record linkage Probabilistic
1
· Poisson · Gamma · Bayesian ·
Introduction
Data quality is one of the most important aspects of data management. Incorrect and sloppy datasets often result in erroneous data analytic results leading to imprecise business decisions. Poor data across businesses and the government cost the U.S. economy $3.1 trillion a year, according to a report by InsightSquared in 2012 [4]. In health care domains, keeping track of patients health information is vital and these datasets reside in multiple data sources. All these records are critical to diagnose a disease or prescribe medicine for the disease and inaccurate or incorrect data may threaten patient safety [5]. Massive amounts of disparate data sources, have to be integrated and matched to support data analyses that can be highly beneficial to businesses, governments, and academia. Record Linkage is a process which aims to solve the task of merging records from different sources that refer to the same entity, a task that only gets harder if they don’t share a unique identifier between them. The area of record linkage poses many challenges such as on-going linkage, storing and handling dynamic data, handling different linkage scenarios and finally accommodating ever increasing and diverse data-sets [1]. All of these issues make the record c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 727–738, 2022. https://doi.org/10.1007/978-3-031-10464-0_49
728
H. Kashyap and K. Byadarhaly
linkage problem very challenging and critical. Efficient algorithms are essential to address this problem [3]. Traditionally, record linkage consists of two main steps: blocking and matching. In the blocking step, records that potentially match are grouped into the same block. Subsequently, in the matching step, records that have been blocked together are examined to identify those that match. Matching is implemented using either a distance function, which compares the respective field values of a record pair against specified distance thresholds, or a rule-based approach, e.g., “if the surnames and the zip codes match, then classify the record pair as matching” [2]. Deterministic record linkage methods require a single identifier distinguishing between truly linked records (records of the same individual) in the datasets which may not be always available. In order to solve this problem researchers have developed probabilistic record linkage which is a set of methods which assumes that no single match between variables can identify two records accurately and instead calculates the probability that two records belong to the same client by using multiple pieces of identifying information [8]. One of the most popular probabilistic record linkage methods is the FellegiSunter method that estimates normalized frequency values of the different features in each record and uses weighted combinations of them [6]. This poses a problem as the underlying string error rates vary in a non-linear fashion and therefore these weighting schemes may not efficiently capture them. In addition, these linear weighting techniques are mostly ad-hoc and not based on a strong pattern recognition theory. Hence, automated linkage processes are one way of ensuring consistency of results and scalability of service. We propose a robust solution that models the probability of the matching records as a Poisson distribution that is learned using a probabilistic graphical model based classifier. Individual features between matching records have a rate parameter that is unique to them. For example, the errors in name pairs and errors in address pairs will likely have different rate parameters which is modelled as a Gamma distribution, a conjugate prior to the Poisson distribution. The Negative Binomial distribution arises when the mixing distribution of the Poisson rate is a Gamma distribution. 1.1
Contributions
The following summarizes the contributions of the research described in the paper: – A Probabilistic model for record linkage that utilizes the probability distributions of the underlying string error rates in matching and mismatching records. – The model training is prompt due to small number of parameters that need to be learned. – This model allows for error to vary over time and produces better parameter estimates with increase in training data size which is especially useful in cases where you have streaming data or sparse data available for certain features.
Probabilistic Record Linkage
729
– Results show higher Accuracy and F1 score as compared to Supervised learning as well as unsupervised Fellegi-Sunter based record linkage methods. – The Negative Binomial distribution could be useful for various areas beyond record linkage where distribution of the features are Poisson, one example of such an application is RNA sequencing [7]. The probabilistic record linkage is explained in more detail in the next section. The theoretical details of the Negative Binomial Classifier are explained in Sect. 3 followed by the results in Sect. 4.
2
Probabilistic Record Linkage
The Fellegi-Sunter method for probabilistic record linkage calculates linkage weights which are estimated by observing all the agreements and disagreements of the data values of the variables that match. This weight corresponds to the probability that the two records refer to the same entity. Given two records (Ra , Rb ), with a set of n common features Ra → [Fa1 , Fa2 , ..., Fan ]
(1)
Rb → [Fb1 , Fb2 , ..., Fbn ]
(2)
the comparisons between the two records can be obtained as a result of applying a distance function like edit distance to each set of the matching variables and can be accumulated in a comparison vector αc = {αc1 , αc2 , . . . , αcn }
(3)
The binary comparison vector is calculated as 1, if Fak = Fbk αck = 0, otherwise The basic idea in the Fellegi-Sunter method is to model the comparison vector as arising from two distributions - one that corresponds to true matching pairs and the other that corresponds to true non-matching pairs. For any observed comparison vector αc in Λ which is a space of all comparisons, the conditional probability of observing αc given the pair is a match is given by m(αc ) = P (αc |(Ra , Rb ) M ) and the conditional probability of observing αc given the pair is a non-match is given as u(αc ) = P (αc |(Ra , Rb ) U ). Here M and U are the set of matches and the set of non-matches respectively. The weight c) for each record pair is given as pab = m(α u(αc ) [10,11]. Once the probabilities are estimated, the decision rule for the Fellegi-Sunter method is: if the weight of a record pair pab > Tλ , then its a match and if pab < Tτ , then its a non-match. If pab is between Tλ and Tτ , then it is deemed to be“possible matches”.
730
2.1
H. Kashyap and K. Byadarhaly
Error Distribution
The edit distances of pairs of strings within the data-sets that match, are relatively small as compared to the lengths of the matching string pairs. On the other hand, the edit distances of pairs of records are quite large compared to the lengths of the records in case of a non match. Given a pair of strings, the number of ways they could be a match is much smaller than the number of ways in which they could be a mismatch as any other random string can be a mismatch. The underlying binomial distribution that represents the match/mismatch between strings can be approximated by a poisson distribution owing to the fact that the probability of a match p between any two given strings is much smaller than the number of different pairs of strings n that can exist. Figure 1 and 2 show the edit distances of names and addresses in matching records in the dataset [16]. These errors that are distributed as Poisson can suffer from the uncertainty around the underlying error rate θi for the feature Fi which can be considered as a latent variable.
Fig. 1. Error distribution of the name variable for matched records.
Figure 3 and 4 show the distribution of errors over the name and the address column for mismatched records in the dataset [16]. 2.2
Limitations
The Fellegi-Sunter method for probabilistic record linkage still relies on likelihood ratios and weights that are ad-hoc and cannot be probability measures. In addition, the Fellegi-Sunter method is very dependent on the accuracy of the estimates of the match and mis-match probabilities. Various factors like misspecifications in the model assumptions, lack of information, inappropriate choices in the previous steps of the whole record linkage process and so on can cause a loss of accuracy in the estimates. The estimation of these parameters cannot
Probabilistic Record Linkage
Fig. 2. Error distribution of the address variable for matched records.
Fig. 3. Error distribution of the name variable for mismatched records.
Fig. 4. Error distribution of the address variable for mismatched records.
731
732
H. Kashyap and K. Byadarhaly
be accurate when one of the categories (especially matches) is too rare. Generally speaking the number of matches should be large enough, more than 5% of the overall set. This is also a classical problem faced by supervised learning algorithms for record linkage. The imbalance of classes in a labeled data set can cause low predictive accuracy for the smaller class [15]. In 2014, Toan Ong showed some good results in extending existing record linkage methods to handle missing field values in an efficient and effective manner [14]. An unsupervised Bayesian approach to graphical record linkage was proposed which overcomes many obstacles encountered by previous methods. This unsupervised algorithm is great for unlabeled data. However, leveraging the vast amounts of labels when they are available is necessary. This begs the need for a supervised probabilistic linkage algorithm. The proposed supervised negative binomial classifier is explained in more detail in the next section.
3
Supervised Negative Binomial Classifier
The probability distribution of the error X i of a matching feature F i between a pair of records is found be poisson and is written as p(X i = x|θ) =
θx e−θ x!
(4)
The poisson error rate θ is assumed to be distributed as a γ distribution and is written as p(θ) = γ(α, β) (5) The Gamma distribution is a natural fit as a conjugate prior to the Poisson distribution [9,12]. The method of moments is used to estimate the shape parameter α and the rate parameter β of the gamma distribution from the ground truth. The first and the second moments of the error Xi can be expressed in terms of the shape and the rate parameters as α μ1 = ( ) β μ2 = (
α α + 2) β β
(6) (7)
Solving the above equations leads to α=(
μ21 ) μ2 − μ21
(8)
α ) μ1
(9)
β=(
The first and the second moments are calculated as μ1 = E(Xi )
(10)
Probabilistic Record Linkage
μ2 = E(Xi 2 ) = μ21 + σ 2 (Xi )
733
(11)
E(Xi ) and σ 2 (Xi ) are the expectation and the variance of Xi respectively. The posterior probability of Xi can be calculated as p(Xi ))|θi ))p(θi ) p(θ|Xi )
(12)
P oisson(X i |θ)Gamma(θi |α, β) Gamma(θi |X i )
(13)
p(Xi ) = Which can be expressed as p(X i ) =
Substituting the pdf’s for Poisson and Gamma distributions in to the above equation leads to α + Xi − 1 β α θX +α−1 e−θ(β+1) p(X ) = ( )( ) Xi (1 + β)α+Xi θX i +α−1 e−θ(β+1) i
i
p(X i ) = (
α + Xi − 1 β α 1 Xi ) ( ) )( Xi β+1 β+1
(14)
(15)
This has a known form called the negative Binomial distribution which is the posterior predictive distribution of the gamma-poisson pair. 3.1
Matching of Records
To compare the probability of the records pairs (Ra , Rb ), we are only looking at only the n fields (1, 2, 3, ... n) that are common between the records. The record pair could potentially contain features that are not identical such as zip code present in one and not present in the other. Assuming the matching features are independent, an assumption that is made for generality of the theory and by no means is the only formulation that it is limited by, the probability of match for the record pair which is a joint probability across all of the matching features can be written as (16) p[(X1 ), (X2 ), ..., (Xn )] Writing this in terms of the Negative Binomial distribution α1 + X 1 − 1 α2 + X 2 − 1 αn + X n − 1 )( )...( ) X1 X2 Xn βn αn β1 α1 β2 α2 ) ( ) ...( ) ×( β1 + 1 β2 + 1 βn + 1 1 2 n 1 1 1 )X ( )X ...( )X ×( β1 + 1 β2 + 1 βn + 1 n = N eg.Bin(αi , βi , xi,j )
Hab = (
i=1
(17)
734
H. Kashyap and K. Byadarhaly
Given a dataset which contains a mix of matching and non matching records, the matching records are used to estimate the parameters of the poisson error rate for each feature. Only the common features between the matching records are used. Once learned, the negative binomial classifier can be used to estimate if a given pair of records match or not by calculating the posterior probability. Given a new pair of records Rp , Rq , the pairs match if Hpq > θ. The threshold θ is chosen using the training/validation data. The parameters of the trained Poisson-Gamma model can be continuously updated with each new data point in a streaming fashion. Given two new data points (Two pairs of records), one of which k1 is a match and so it follows a Poisson distribution, and the other k2 is a mismatch and is of an arbitrary probability distribution, then the updated Poisson-Gamma parameters are given by (18) αi = αi + k1 βi = βi + 1
(19)
Also, it may not be necessary to learn parameters at every step but in chunks depending on the application. In such an instance, a chunk of new data could be added to the stream and the underlying α and β parameters can be recomputed. This will provide better estimates for the newly arriving test data points. In the big data era, the velocity of data updates is often high, quickly making previous linkage results obsolete. A learning framework that can incrementally and efficiently update linkage results when data up-dates arrive are essential [13]. The advantage of this method is that it allows for updating parameters in an online streaming fashion instead of training over all the large data-sets repeatedly as in the case of the standard probabilistic record linkage algorithms. This would mean that a older data-sets can be thrown away and only the new data can be trained on and parameters updated.
4
Experiments
The proposed algorithm is tested on a record linkage problem using the Restaurant Dataset: The restaurant dataset are tables of names and addresses provided by the restaurant review companies Zagat and Fodors. The dataset was obtained from the RIDDLE data repository [16]. There are a total of 191 records in the dataset out of which around 60% (112 records) are matches and rest (79 records) are non matches. The ground truth consists of false pairs of data along with matching pairs. The dataset was split in to Training (70%), Validation (10%) and Testing (20%) data. The parameters of the Negative Binomial classifier for each pair of features are learned from the matching records. For aggressive error biasing, the Gaussian distribution can be used to fit the errors for the non matching pairs but is not implemented in the current version of the research described in this paper. The matches were randomized with no criterion to filter the feature set on. This helps absorb random errors that could happen during the filtering operation as
Probabilistic Record Linkage
735
errors can happen at any string position. The training data is used to learn the Negative Binomial Classifier and the validation data to pick an optimum threshold. The AUC-ROC as well as Precision-Recall Curves on the validation data are shown in Fig. 5 and 6.
Fig. 5. The AUC on the validation data.
Fig. 6. Precision - recall curve on the validation data.
4.1
Performance on the Test Data
The performance of the trained Negative Binomial classifier is compared with two other methods – an unsupervised Fellegi-Sunter method and a supervised Support Vector Machine (SVM) classifier. The performance of the three methods are analyzed on the following metrics – F1 Score, Precision, Recall and Accuracy Score and tabulated in Table 1.
736
H. Kashyap and K. Byadarhaly Table 1. Performance metrics on the restaurant dataset Algorithm
F1
Precision Recall Accuracy
Negative Binomial Classifier 91.6% 99.5%
84.6% 93.1%
Fellegi-Sunter Method
62%
99.2%
45%
SVM Classifier
89%
99.5%
80.6% 92.1%
44.8%
The Negative Binomial classifier clearly outperforms the Fellegi-Sunter unsupervised probabilistic record linkage method and so does the supervised SVM classifier. This is expected as the Fellegi-Sunter method is an unsupervised technique that relies on accurate probability estimates and suffers from sparsity in matching records. However even with a very small number of matching records that have been used to train the model, the proposed technique outperforms the SVM classifier. The Negative Binomial classifier has higher accuracy score than both the Fellegi-Sunter and the SVM classifier. While the precision score is high in all of the three methods, the Negative Binomial classifier has higher recall as well as F1 scores as compared to the SVM classifier. The high value of F1 score suggests that the proposed method is not only able to find matching records precisely (high precision) but is also very robust as it is able to obtain a significant number of matches (high recall). The model is capable of being generalized owing to the simple fact that we learn the underlying probability distribution of the features. For example, the parameters learned for the names or addresses from one dataset can be applied to names and addresses in a different related dataset. For best results it would be optimal to learn the parameters on a large dataset and then used in evaluating data from a smaller dataset.
5
Conclusion and Future Work
This paper presents a classifier that is based on principles of Bayesian inference and is robust and can work with small datasets. The scientific value of the model is that it exploits the intrinsic probability distribution in the edit distances of features in matching records to build the classifier and also provides a framework for researchers to explore other distributions as long as the underlying latent variables are conjugate priors. The classifier has proven to be better than both an unsupervised Fellegi-Sunter method as well as the supervised Support Vector Machine classifier on the record linkage task. In addition the Bayesian updation of the parameters of the Negative Binomial Classifier in a streaming fashion through a simple and easy procedure adds efficiency to the learning algorithm. The Negative binomial classifier can be applied to a different area of application like RNA sequencing and to data sets with different decay processes other than poisson.
Probabilistic Record Linkage
737
One of the limitations of the model is the independence assumption in the features which is not always relevant in the real world situations. This can be mitigated to an extent by learning mixture distributions. Researchers in record linkage can explore the variants of the theory to include hierarchical arrangements of features where dependencies such as zip code and street names can be set to further refine the model. The probability model itself can also be made hierarchical by assuming that the parameters of the gamma distribution (α, β) follow some other probability distribution which would make them hyper-priors to the original error Poisson distribution. We intend to continue to built out the theory by testing out other probability distributions for the mismatched records in which case any new pair of records could be either classified as a match or a mismatch based on their likelihood scores on the two distributions. The theory will in addition be tested on other larger data sets with different types of features in them and the transfer learning ability of the model will also be tested.
References 1. Boyd, J.H., Randall, S.M., Ferrante, A.M., Bauer, J.K., Brown, A.P., Semmens, J.B.: Technical challenges of providing record linkage services for research. BMC Med. Inform. Decis. Mak. 14(1), 23 (2014) 2. Karapiperis, D., Gkoulalas-Divanis, A., Verykios, V.S.: Summarization algorithms for record linkage. In: EDBT (2018) 3. Mamun, A.A., Aseltine, R., Rajasekharan, S.: Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4), e0154446 (2016) 4. Ilyas, I.F., Chu, X.: Trends in cleaning relational data: consistency and deduplication. Found. Trends Databases 5(4), 281–393 (2015) 5. Kerr, K., Norris, T., Stockdalel, R.: Data quality information and decision making: a healthcare case study. In: Proceedings of the 18th Australasian Conference on Information Systems Doctoral Consortium, pp. 5–7 (2007) 6. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969) 7. Dong, K., Zhao, H., Tong, T., Wan, X.: NBLDA: negative binomial linear discriminant analysis for RNA-Seq data. BMC Bioinform. 17(1), 369 (2016) 8. Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: current practice and future directions. Technical report, CSIRO Mathematical and Information Sciences (2003) 9. Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis, 2nd edn. Chapman & Hal Texts in Statistical Science (2003) 10. McVeigh, B.S., Murray, J.S.: Practical Bayesian inference for record linkage. Technical report, Carnegie Mellon University (2017) 11. Sharp, S.: Deterministic and probabilistic record Linkage. Alternative sources branch, National Records of Scotland 12. Minka, T.P.: Estimating a gamma distribution. Technical Report, Microsoft Research, Cambridge, UK (2002) 13. Gruenheid, A., Dong, X.L., Srivastava, D.: Incremental record linkage. Proc. VLDB Endow 7(9), 697–708 (2014) 14. Ong, T.C., Mannino, M.V., Schilling, L.M., Kahn, M.G.: Improving record linkage performance in the presence of missing linkage data. J. Biomed. Inform. 52, 43–54 (2014)
738
H. Kashyap and K. Byadarhaly
15. Hurwitz, A.M.: Record linkage sharing using labeled comparison vectors and a machine learning domain classification trainer. US Patent, US9576248B2 16. Tejada, S.: Restaurant - a collection of restaurant records from the Fodor’s and Zagat’s restaurant guides that contains 112 duplicates. Includes both segmented and unsegmented versions. https://www.cs.utexas.edu/users/ml/riddle/data.html
A Recipe for Low-Resource NMT Eryk Wdowiak(B) Arba Sicula, Brooklyn, NY, USA [email protected] http://www.arbasicula.org
Abstract. Incorporating theoretical information into the dataset, tokenization and subword splitting improves translation quality in lowresource settings. Previous research has shown that one can train a reasonably good translation model by training a model with small subword vocabularies and high dropout parameters. And backtranslation and multilingual translation further improve translation quality. But just as a textbook helps a student learn a language, it also helps a machine learn a language. Theoretical information allows us to make more efficient use of a given dataset and train a better model. Keywords: Neural machine translation · Subword-splitting Low-resource languages · Sicilian language
1
·
Introduction
Last year, several researchers began an important discussion about large language models [2]. This paper shows what one can accomplish with small language models. Just as Transformers [16] scale upwards, they also scale downwards providing meaningful models that serve people in their preferred language. Our innovation is to use existing methods more efficiently. Instead of scaling upwards to achieve performance gains, we achieved good translation quality by incorporating theoretical information into the dataset, the tokenization and the subword splitting. Given our experience, this paper proposes modeling the language to make more efficient use of a given dataset and to offer the promise of language models to all the world’s people. Our goal was to create a neural machine translator for the Sicilian language. Sicilian provides a good case study in low-resource machine translation for several reasons. First, the language has been continuously recorded since the Sicilian School of Poets joined the imperial court of Frederick II in the 13th century. And in our times, Arba Sicula has spent the past 43 years translating Sicilian literature into English (among its numerous activities to promote the Sicilian language). In the course of their work with the many dialects of Sicilian, the organization established a “Standard Sicilian,” a single form of the language. To train our translator, we had to make better use of limited amounts of parallel text than previous researchers had. Just a few years ago, [10] calculated c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 739–746, 2022. https://doi.org/10.1007/978-3-031-10464-0_50
740
E. Wdowiak
learning curves for English-to-Spanish translation. At 377,000 words, their neural machine translation model only achieved a BLEU score of 1.6. More recently, [13] improved upon their results by using subword-splitting [12] to train a neural German-to-English model that scored 16.6 on a 100,000 word dataset. And we improved upon their results by incorporating theoretical information into our modeling strategy. With just 16,945 translated sentence pairs containing 266,514 Sicilian words and 269,153 English words, our Tradutturi Sicilianu achieved a BLEU score of 25.1 on English-to-Sicilian translation and 29.1 on Sicilian-to-English. Then we augmented our dataset with backtranslation [11] and multilingual translation [9], which further increased our BLEU scores to 35.0 on English-toSicilian and to 36.8 on Sicilian-to-English. That’s a good result for a small amount of parallel text. It shows what one can accomplish by using theory to model the language. The next section describes our data sources (Sect. 2). The section on subword splitting (Sect. 3) explains our method of biasing subwords towards theoretical stems and desinences. Then our “recipe (Sect. 4)” describes our method of training a translator on little parallel text. Finally, the section on multilingual translation (Sect. 5) explains how incorporating a trilingual “bridge” [6] of textbook exercises into our dataset further improves translation quality. And the last section concludes (Sect. 6).
2
Data Sources
Our first ingredient is high-quality parallel text. Standard Sicilian provided the consistency necessary to create a high-quality corpus of Sicilian-English parallel text. With that good start, we avoided the dialect challenges faced by [8], who note that variations in pronunciation coupled with the lack of a written standard cause extreme inconsistency in spelling. Consistent spelling increases word frequencies, enabling us to train a neural machine translation model on a small corpus of parallel text. To seed this project, Arthur Dieli kindly provided 34 translations of Giuseppe Pitr`e’s Sicilian Folk Tales and lots of encouragement. And Arba Sicula, which has been translating Sicilian literature into English since 1979, contributed its bilingual journal of Sicilian history, language, literature, art, folklore and cuisine. Most of our data comes from Arba Sicula articles. Some parallel text comes from Dr. Dieli’s translations of Pitr`e’s Folk Tales. And some comes from translations of the homework exercises in the Mparamu lu sicilianu [5] and Introduction to Sicilian Grammar [3] textbooks. Although it only makes up a small portion of the dataset, adding the textbook examples yielded large improvements in translation quality on a test set drawn only from Arba Sicula articles. Just as a grammar book helps a human learn in a systematic way, it also helps a machine learn in a systematic way. “Language models are few-shot learners” [4]. The textbook exercises provided the few examples of each grammatical element necessary to train a good model.
A Recipe for Low-Resource NMT
3
741
Subword Splitting
According to a recent case study of best practices for low-resource neural machine translation [13], neural models can achieve better translation quality than phrase-based statistical machine translation. In their best practices, the authors suggest using a smaller neural network with fewer layers, smaller batch sizes and a larger the dropout parameter. Importantly, their largest improvements in translation quality (as measured by BLEU score) came from the application of a byte-pair encoding [12] that reduced the vocabulary from 14,000 words to 2000 words. Our experience suggests that biasing the subword distribution toward theoretical stems and desinences further improves translation quality. For example, the English present tense only has two forms – speak and speaks – while the Sicilian present tense has six – parru, parri, parra, parramu, parrati and parranu. But upon splitting them into subwords, parr+ matches speak+, while the Sicilian verb endings (+u, +i, +a, +amu, +ati and +anu) match the English pronouns. So subword splitting should allow us represent many different word forms with a much smaller vocabulary and should allow the translator to learn rare words and unknown words. For example, even if “jo manciu” (“I eat”) does not appear at all in the dataset, but forms like “jo parru” (“I speak”) and “iddu mancia” (“he eats”) do appear, then subword splitting should allow the translator to learn “jo manciu” (“I eat”). In practice, achieving that effect required us to bias the learned subword vocabulary towards the stems and desinences one finds in a textbook. Specifically, we added a unique list of words from the Dieli Dictionary and the inflections of verbs, nouns and adjectives from Chi` u dˆ a Palora to the Sicilian data. Because each word was only added once, none of them affected the distribution of whole words. But once the words were split, they greatly affected the distribution of subwords, filling it with stems and suffixes. So the subword vocabulary that the machine learns is similar to the theoretical stems and desinences of a textbook. And the translation model learns to translate in a more theoretic manner, making it more generalizable to unseen data. Within a given dataset, theoretical splitting increased our BLEU scores from 20.3 to 22.4 on English-to-Sicilian and from 21.4 to 24.1 on Sicilian-to-English.
4
A Recipe for Low-Resource NMT
Even though we only have a little parallel text, we can still develop a reasonably good neural machine translator. We just have to train a smaller model for the smaller dataset. As shown in Table 1, we trained models of three different sizes, all of which were smaller than the defaults provided by the Sockeye toolkit [7]. And just as we incorporated theoretical information into our dataset, we also incorporated theory into our modeling strategy. In this section, we incorporate insights from statistical theory because in a low-resource context, we must be careful to avoid over-fitting.
742
E. Wdowiak Table 1. Model sizes Defaults Our models Larger Many-to-many Layers 6 Embedding size 512 512 Model size 8 Attention heads 2048 Feed forward
3 256 256 4 1024
4 384 384 6 1536
4 512 512 8 2048
Training a large model on a small dataset is comparable to estimating a regression model with a large number of parameters on a dataset with few observations: It leaves you with too few degrees of freedom. The model thus becomes over-fit and does not make good predictions. Reducing the vocabulary with subword-splitting, training a smaller network and setting high dropout parameters all reduce over-fitting. And self-attentional neural networks also reduce over-fitting because (compared to recurrent and convolutional networks) they are less complex. They directly model the relationships between words in a pair of sentences. This combination of splitting, dropout and self-attention is an implementation of the best practices discussed above [13], but using the Transformer model [16] from the Sockeye toolkit [7]. It achieved a BLEU score of 25.1 on English-to-Sicilian translation and 29.1 on Sicilian-to-English with only 16,945 lines of parallel training data containing 266,514 Sicilian words and 269,153 English words. In their best practices study, the authors found that reducing the vocabulary to 2000 subwords yielded the largest improvements in translation quality. But their most successful training also occurred when they set high dropout parameters [13]. During training, dropout randomly shuts off a percentage of units (by setting it to zero), which effectively prevents the units from adapting to each other. Each unit therefore becomes more independent of the others because the model is trained as if it had a smaller number of units, thus reducing over-fitting [14]. Subword-splitting and high dropout parameters helped us achieve better than expected results with a small dataset. And the Transformer model pushed our BLEU scores into the double digits. Compared to recurrent neural networks, the self-attention layers in the Transformer model more easily learn the dependencies between words in a sequence because the self-attention layers are less complex. Recurrent networks read words sequentially and employ a gating mechanism to identify relationships between separated words in a sequence. By contrast, self-attention examines the links between all the words in the paired sequences and directly models those relationships. It’s a simpler approach.
A Recipe for Low-Resource NMT
743
Table 2. Datasets and results
Dataset
Subwords Lines
Word count (in tokens) Sicilian English Italian
BLEU score En-Sc Sc-En
20 21 23 24 25 27 28 29
2,000 2,000 3,000 3,000 3,000 3,000 3,000 3,000
7,721 8,660 12,095 13,060 13,392 13,839 14,494 16,591
121,136 146.370 171,278 178,714 185,540 190,072 196,911 258,730
– – – – – – – –
11.4 12.9 19.6 19.6 21.1 22.4 22.5 24.6
12.9 13.3 19.5 21.5 21.2 24.1 25.2 27.0
30
3,000
16,945 266,514 269,153 –
25.1
29.1
30 +back
5,000
16,829 261,421 264,242 – +3,251 +92,141 – –
27.7
–
30 Books +back
Sc: 5,000 16,891 262,582 266,740 – 19.7 26.2 929,043 838,152 35.1* 34.6* En: 7,500 32,804 – – It: 5,000 +3,250 +92,146 –
33 Books +back En/It-Sc Sc: 5,000 +back Sc-It En: 7,500 It: 5,000 textbook
12,357 28,982 +3,250 +3,250 4,660 4,660 4,660
237,456 – +92,146 – 30,244 30,244 –
121,892 146,437 175,174 183,736 190,538 195,372 202,652 261,474
236,568 836,757 – – 35,173 – 35,173
The textbook exercises form a trilingual “bridge,” the strategy proposed by [6].
– 755,196 35.0* 36.8* – +84,657 – It-Sc Sc-It 29,855 36.5† 30.9† 29,855 * larger model † many-to-many model
Combining these three features – small subword vocabularies, high dropout parameters and self-attention – yields a trained model that makes relatively good predictions despite being trained on limited amounts of parallel text because they reduce over-fitting.
5
Multilingual Translation
Our discussion so far has focused on a dataset of Sicilian-English parallel text. This section augments our dataset with parallel text in other languages to enable multilingual translation [9] and improve translation quality. In our case, we can obtain Sicilian-English parallel text from the issues of Arba Sicula but finding Sicilian-Italian parallel text is difficult.
744
E. Wdowiak
Nonetheless, we trained a model to translate between Sicilian and Italian without any Sicilian-Italian parallel text at all (i.e. “zero shot” translation) by including Italian-English parallel text in our dataset. Then, to improve translation quality between Sicilian and Italian, we implemented a “bridging strategy” [6] by adding Sicilian-Italian-English homework exercises to our dataset. It’s an example of transfer learning. In our case, as the model learns to translate from Italian to English, it also learns to translate from Sicilian to English. And as the model learns to translate from English to Italian, it also learns how to translate from English to Sicilian. More parallel text is available for some languages than others however, so [9] also studied the effect on translation quality and found that oversampling low-resource language pairs improves their translation quality, but at expense of quality among high-resource pairs. Importantly however, the comparison with bilingual translators holds constant the number of parameters in the model. Training a larger model can improve translation quality across the board [1]. Our experience was consistent with these findings. As shown in Table 2, holding model size constant reduced translation quality when we added the Italian-English subset of Farkas’ Books data (from the OPUS project [15]) to our dataset. So to push our BLEU scores into the thirties, we trained a larger model – an appropriately sized model. In a broader effort, another study developed a “bridging strategy” to collect data for and to train a model that can directly translate between 100 languages. To overcome the limitations of English-centric data, the authors strategically selected pairs to mine data for, based on geography and linguistic similarity. Their approach yielded large improvements in translation quality in non-English directions, while matching translation quality in English directions [6]. A similar strategy improved our translation quality between Sicilian and Italian. Taking a theoretic approach, we bridged Sicilian, English and Italian by translating 4,660 homework exercises from the Mparamu lu sicilianu [5] and Introduction to Sicilian Grammar [3] textbooks. As shown in Table 2, this technique yielded translation quality between Sicilian and Italian that’s almost as good as translation quality between Sicilian and English, for which we have far more parallel text.
6
Conclusion
Our recipe for low-resource neural machine translation – theoretical subwordsplitting, high dropout parameters and self-attention – yields a trained model that makes relatively good predictions. Adding backtranslation and multilingual translation improves translation quality even more. And we improved upon our zero-shot result by bridging the three languages with textbook exercises. Most importantly, we achieved these good results by training a small model. Instead of scaling upwards, we used theory to make more efficient use of a dataset and help a small model learn a good set of translation rules.
A Recipe for Low-Resource NMT
745
We hope our experience encourages practitioners to model the language and to develop language models for all the world’s people. Acknowledgments. Arba Sicula, Gaetano Cipolla and Arthur Dieli developed the resources that made this project possible. I would like to thank them for their support and encouragement. Prof. Cipolla helped me learn Sicilian and he also helped me develop this recipe for low-resource neural machine translation. We thought about the problem together. He encouraged me to incorporate theoretical information into the model and that’s why we got good results. Dr. Dieli seeded this project with his vocabulary list and translations of Pitr`e’s Folk Tales. He helped me get started. And he and his family gave me a lot of support and encouragement. This project is dedicated to his memory. Finally, I would like to thank Arba Sicula for the language resources that we used to develop the dictionary and translator. And I would like to thank the organization and its members for their sponsorship and development of Sicilian language and culture. Their poetry made this project beautiful. Grazzi!
References 1. Arivazhagan, N., et al.: Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. arXiv preprint arXiv:1907.05019 (2019) 2. Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623 (2021). https://doi.org/10.1145/3442188.3445922 3. Bonner, J.K.: Introduction to Sicilian Grammar. Legas, Brooklyn (2001) 4. Brown, T.B., et al.: Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165 (2020) 5. Cipolla, G.: Learn Sicilian. In: Mparamu lu Sicilianu. Legas, Mineola (2013) 6. Fan, A., et al.: Beyond English-Centric Multilingual Machine Translation. arXiv preprint arXiv:2010.11125 (2020) 7. Hieber, F., et al.: Sockeye: A Toolkit for Neural Machine Translation. arXiv preprint arXiv:1712.05690 (2017) 8. Hollenstein, N., Aepli, N.: Compilation of a Swiss German dialect corpusand its application to PoS tagging. In: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects, pp. 85–94 (2014). https:// aclanthology.org/W14-5310.pdf 9. Johnson, M.S., et al.: Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. arXiv preprint arXiv:1611.04558 (2016) 10. Koehn, P., Knowles, R.: Six Challenges for Neural Machine Translation. arXiv preprint arXiv:1706.03872 (2017) 11. Sennrich, R., Haddow, B., Birch, A.: Improving Neural Machine Translation Models with Monolingual Data. arXiv preprint arXiv:1511.06709 (2015) 12. Sennrich, R., Haddow, B., Birch, A.: Neural Machine Translation of Rare Words with Subword Units. arXiv preprint arXiv:1508.07909 (2016) 13. Sennrich, R., Zhang, B.: Revisiting Low-Resource Neural Machine Translation: A Case Study. arXiv preprint arXiv:1905.11901 (2019)
746
E. Wdowiak
14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014). http://jmlr.org/papers/v15/srivastava14a.html 15. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (2012). https://opus.nlpl.eu/ 16. Vaswani, A., et al.: Attention Is All You Need. arXiv preprint arXiv:1706.03762 (2017)
Natural Language Processing Using Database Context Zheni Mincheva, Nikola Vasilev, Anatoliy Antonov(B) , and Ventsislav Nikolov Eurorisk Systems Ltd., 31 General Kiselov Street, 9002 Varna, Bulgaria {jmincheva,nvasilev,antonov,vnikolov}@eurorisksystems.com Abstract. The usage of natural language in the industry has become more prevalent in recent years. Nowadays it is much easier to operate with complex infrastructures using natural language. This subfield of artificial intelligence is becoming more widespread every year. Feature semantic parsing represents one of the tasks of converting natural language utterances into structured logical parts that can be used as queries to generate responses. This paper introduces an algorithm that transforms natural sentences to obtain structural results. Such a functionality is effective for Question and Answering (Q&A) and allows for spoken language understanding. Obtaining the input as natural language sentence allows many people, even those without technical skills to access information effectively. The generated queries are being executed on specific tables or databases and then the information is retrieved in a comprehensible way. The process of generating queries from natural language sentence implies different operational steps, such as recognition of different types of words, synonym detection, feature grammar parsing, etc. Keywords: Natural Language Processing (NLP) · Queries · Database · Sentences · Question and Answering (Q&A) · Information accessing
1 Introduction 1.1 Problem Definition Nowadays, more and more systems are introducing speech assistants. It’s common place for people at any age in this day to use voice commands, i.e. make calls, schedule appointments, change settings, options and access information using simple everyday words of their natural language. For that reason, it is necessary to simplify modern technology and build tools that make it possible to access large amounts of information easily. With the evolution of the industry, more and more tools will be replaced with simpler ones. This paper proposes an approach that automatically translates natural language into SQL syntax queries. In this way, users will be able to obtain desired information from database tables without the nuisance of technical details. The overview of this approach is shown in Fig. 1. The suggested solution is intuitive and automatic because it is easily accessible and can be applied to any context. The only input information required is an OLAP table. All other details and specifications are generated automatically from the context of the provided table. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 747–759, 2022. https://doi.org/10.1007/978-3-031-10464-0_51
748
Z. Mincheva et al.
Send query sentence Internet Input Return the results
Parsing application NLTK SpaCy
Results
Client side
OLAP Internal DB Server side
Fig. 1. Overview of the approach
1.2 NLP Explanation Natural language processing (NLP). [7] combines linguistics and computer science. It is related to the interaction between the natural language of humans and computers. This paper observes the request for information, using limitations and clauses for the grouping and ordering of results. The request is given in a natural language and is being translated into an operation understandable to computers. Currently, this request can be obtained from voice commands or written sentences.
1.3 Technical Environment Python. The majority of modern solutions regarding natural language processing are created using Python. Python is high-level programming language that supports a variety of well-developed and widely used frameworks for language processing, including spaCy, NLTK and others. The solution presented in this paper is based primarily on Python 3.7, along with other Python libraries described below. Natural Language ToolKit (NLTK). Even though Python can perform natural language processing tasks on its own, NTKL is a good extension that enables the execution of any type of NLP tasks. NLTK contains various modules and has an open-source license. Additionally, it offers a functionality for importing grammar and using it to parse text. [1]. SpaCy. SpaCy is an open-source library for advanced natural language processing. There is a functionality that perfectly defines numbers, time periods and exact dates given in conversational form. For example, it can detect “last year” as a time period and 29th of September as a date. Structured Query Language (SQL). SQL is a standard language for dealing with databases and is used in programming and for managing data in relational databases. SQL has specific syntax and grammar. The purpose of the proposed application is to automatically translate sentences as expressed by humans into queries in SQL that are understandable to computers.
Natural Language Processing Using Database Context
749
Intent Recognition with RASA. [2, 8] The first step in this process is intent recognition. This is important because the assistant can be connected to more than one OLAP table. Since those tables include information for different contexts, intent recognition is used to determine which table to access. RASA is a great tool for intent recognition. In order for intent recognition to function properly, it is necessary to train the RASA neural network. This is achieved by giving it examples for each intent. After being trained, the network is able to calculate a percentage for the probability of an input having the encounter intents. Thereafter, an intent is recognized and the input can be used in the key replacement phase. Observed Data. The OLAP table [11] is a flat virtual Cartesian table created from other real tables that are linked together and is used by OLAP reporting systems, such as QlikView and QlikSense. OLAP tables contain two types of data – measures and dimensional data. Dimensional values are usually enumerable and are used for grouping and filtering. Measures can be represented in the form of dates or numerical values. Numerical measures are generally used for aggregation and conditions. Date measures are used for the representation of date points or periods. A data scientist must first define an OLAP table (or use an existing one) using the corresponding format in order to proceed, which is fundamental to the data analysis. OLAP tables provide fast access to large data sets, as all analytical operations are performed on already fetched data and operations are completed in the memory. There is no need for all data to be prefetched from the database, providing operations “on the fly”. The module also requires a configuration that includes the selected columns and synonyms for the column.
2 Methodology 2.1 Pre-processing Date Parsing. Date values are complicated data structures for language processing. In the presented module, date values are differentiated into two types – an exact date that specifies a particular date point, and a date range, representing a time period. Date values can be expressed in two forms – implicit and explicit. The implicit form is a self-evident format, e.g. ‘27 of April’. The explicit form defines more complex structures, such as ‘2 years ago from last Sunday’ or ‘for the last four years’. All date values are masked with specific keywords that can easily be recognized later in the grammar parsing. Additionally, dates are stored into a common format in order to avoid errors and be more general. Number Replacement. In contrast with date values, numbers are simpler for recognition. All numbers are normalized according to the same format and masked with a specific keyword, i.e. „NUMBER”. Later, the masked values are obtained and used to form the SQL query. Key Word Replacement. Key words are used in the sentence when the system is asked for information. Those words do not cover stop words, which will be considered later.
750
Z. Mincheva et al.
Key words consist of five word sets. The first includes synonyms for column names. When a word from this set is detected it the input and is known, it is referred to the corresponding column. The second set of words contains all distinct values from every enumeration column within the OLAP table. The third set is used for all words that express comparison, e.g. bigger, higher, smaller, less. The last two sets contain synonyms for grouping and ordering, which represent very important parts of the query, as they prioritize information according to the user criteria.
3 Query Parsing After the key words are loaded, each word from the input is being searched for in the sets and, if found, is replaced with a corresponding word that provides information on the word meaning. For example, all column synonyms are replaced with the key word that contains the column name and has a “SYNONYM” suffix. For example the synonyms for the column region will be replaced with region_synonym. If a distinct value is found in the input, it is replaced with the corresponding column name. For instance, the input ‘French’ would be replaced with region_value, because it is a distinct value from the column region. 3.1 Removing of Stop Words The SpaCy library provides a list of stop words [9]. Stop words bring a little to none meaning to the application. They are words such as ‘also’, ‘the’, ‘to’, ‘are’, etc. Such words are removed. All words that are left unmapped and do not represent stop words are listed as incorrect words. Incorrect words are processed by the edit distance algorithm [3]. It is assumed that those words are not recognized by the microphone or misspelled if written in a query. 3.2 Edit Distance Algorithm for Incorrect Words The edit distance algorithm generates the probability of a word being confused with another word or combination of letters similar to the word in question. The algorithm calculates the percentage difference between the words, based on misplaced letters. It takes into account the closeness of the syllables, of letters, as well as the position of keyboard keys. The algorithm distributes negative points according to the error made and depending on what must be done to make the necessary corrections to match it to the desired word. The final percentage is calculated using those points [4, 5]. In this particular case, three threshold values can be distinguished. The first is used for an automatic replacement of words. This value is the highest. If the value of similarity between the words is higher than this automatic replacement threshold, the word is replaced without human interaction. The formula (1) for this calculation is: T1 = 1 −
1 l
(1)
Natural Language Processing Using Database Context
751
where T1 is the threshold value and l is the number of characters in the incorrect word. If T1 is bigger than 0.92, it is set to 0.92. If it is lower than 0.75, it is set to 0.75. The second threshold value limits the possible choices presented to the user when choosing the correct word. It is calculated using the following formula (2): T2 = 1 −
3 l
(2)
where T2 is the threshold value and l the number of characters in the incorrect word. If T2 is higher than 0.75, it is set to 0.75. If it is lower than 0.5, it is set to 0.5. All words with a similarity bigger than T2 and smaller than T1 are compiled into a list and presented to the user to choose from them. The third threshold value is 0.25. If the similarity between the incorrect word and any of the distinct and key values is lower than 0.25, the word is marked as irrelevant and is removed. Values 0.92, 0.75, 0.5 and 0.25 are preferred and are chosen because of the high accuracy detected during the testing process. 3.3 Grammar Parsing After the pre-processing, the query is parsed and words from the input sentence are replaced with key words that are recognized by the grammar. Parsing of queries is a process of analyzing the sequence of key words in a sentence that have met a previously defined set of rules. Those rules define the parsing grammar. The grammar that is used in this case in context free [10]. It is generated once when the application is started. It consists of a couple of main rules for the structure of the sentence and rules formed for the columns. The columns rules for retrieving information are generated numerically by the application according to the database. There are two types of columns: enumerable and numeric. Enumerable are the columns containing a finite set of values and numeric are the columns with real values. The rules are generated via templates for each type of column. Since each SQL select query is composed of separate parts, the input should contain structured information that can be mapped to any of those parts. The grammar should not only recognize them, but take into account the natural language order. The main structure of grammar rules is disintegrated into smaller rules that are reusable. Each rule is enplaned later.
Sentence -> Sentence OrderingPart | ColumnPart WherePart GroupPart | ColumnPart | ColumnPart WherePart | ColumnPart GroupPart
The main part of an SQL select query is the column part, in which the required columns are enumerated. It has the following syntax: “SELECT revenue, revision_year, name, etc.”. This is the only required part of the query. All other parts are optional. Generally, columns are described using synonyms. If the user wants to see specific information from columns, he uses one of the corresponding synonyms. Synonyms are obtained from sources studying language synonyms and are carefully prepared by
752
Z. Mincheva et al.
people that are familiar with the natural language and the database. ColumnNames lists synonyms for all columns. ColumnPart -> AllWord | ColumnNames
The where part describes the conditions in the WHERE statement. Users can ask for data in several ways using conditions. Since OLAP tables have two main column types, the grammar depends on those types. The where part points to each clause, which describes a request, to a column. In the end, the recursion part is added, which is a very important and elegant way of calling more than one condition, in any particular order. The where part contains restrictions and filters that specify the requested information. In other words, it describes the conditions in the WHERE statement. This is the most complex part because it contains rules, combined into different parts, for each column of the OLAP table. Those parts are generated automatically using a supportive database. Such database contains descriptions of different columns and defines the way the grammar is supposed to operate with them. The grammar depends on OLAP table column’s types. The where part points each query, that describes a filtering request, to a column. For each column type there are several approaches for requesting information. The statements are pre-processed and specific parts are replaced by others that are recognizable by the grammar. Following is an example of the description of a measurement type column. Examples for the covered cases are “revenue is less than 100” or “revenue is 50 or more”. This part is also generated.
revenueClause -> revenueSynonyms Comparative Number | revenueSynonyms Number Comparative
Measurement conditions use comparative words. Comparatives are loaded from the same database as the synonyms. A lot of words are replaced in the pre-processing. They are replaced with one of the four possible comparatives: • ‘positive’ are clauses for words ‘greater’ (more than, above, greater, etc.) • ‘negative’ are keywords that contain most of the words for ‘lower’ (less than, to, behind, etc.) • ‘equality’ (is equal to) ‘ • difference’ (is different than, is not)
Comparative[SEM='COMPARATIVE'] -> 'positive' | 'negative' | 'equal' | 'different'
Dates represent measures, but the syntax for their query is somewhat different. Sometimes comparative words might be omitted and their meaning then refers to an exact
Natural Language Processing Using Database Context
753
point. In other cases, ‘greater’ (‘positive’) or ‘smaller’ (‘negative’) synonym words are predefined.
revisionDateClause -> revisionDateSynonyms Comparative Date | revisionDateSynonyms Date
In the pre-processing phase of the input text, all numbers are replaced with this word so they can be parsed in the grammar. Input numbers are stored as attributes for later use in the final query. Number[SEM='NUMBER'] -> 'NUMBER'
In the same manner as numbers, dates are normalized and replaced by keywords in order to avoid the mismatching of formats, as well as to benefit the lightweight syntax of the grammar. Dates are stored as attributes for later use in the final query. In this case, the UTC format will be used for the definition of the query. Date[SEM=’DATE’] ->
‘DATE’
If the referred column is from a dimensional type, e.g. region, then it is limited by the following rule in the grammar:
regionClause -> regionNames regionSynonyms | AllWord regionSynonyms | regionNames
Following is an example of the description of an enumerable column. It illustrates the case where the value “all regions” is selected, or any enumeration of column values, with or without a definitive word. The covered cases are: • regionNames regionSynonyms- most commonly used, followed by a synonym, e.g. “Italian region”. • AllWord regionSynonyms - this is a case when all values are taken into account, e.g. “all regions”. • regionNames - only the value of the given distinct value is considered, e.g. “Italy”. The above explained rules refer to all columns, depending on their type. The defined rules are added to the where clause, thereby defining the limitations of the query. The where part is followed by the grouping part, after which comes the ordering part. Grouping is most commonly used for columns that are defined utilizing the dimensional type. In cases where the information is hard to assimilate because of its magnitude, the grouping part uses a combination of rows. Usually, when there is a grouping part, there is an aggregation in the column part that is generated automatically. All columns
754
Z. Mincheva et al.
containing the measure type are aggregated using the sum function, except for columns that contain dates. Since the sum of dates is not in use, the depicted information shows the minimum and maximum value of the column. In the grammar, the grouping part is presented as:
GroupPart -> GroupSynonyms ColumnNames GroupSynonyms[SEM='GROUPING'] -> 'GROUPING'
The ordering part is the last part and refers to the ORDER BY clause in the query, which allows the sorting of result set by one or more columns. Two sorting options are available: ascending and descending. These options are optional. By default, the sorting is in the ascending order. In the grammar, this part has the following syntax:
OrderingPart -> OrderingSynonyms ColumnNames | OrderingPart OrderingStyle | OrderingStyle OrderingPart OrderingStyle[SEM='ORDERING_STYLE'] -> 'ASC' | 'DESC' OrderingSynonyms[SEM='ORDERING'] -> 'ORDERING'
In this context, the grammar can be automatically generated. Since the information is retrieved from an OLAP table, this makes it easier to generate clear grammar.
4 Creating the Table Query Once the parsing of the input using grammar is performed, it is certain that the input is convertible to a data base query. The output is represented in the form of a tree which, when iterated, provides a query of a size equal to the input query size. It contains labels of corresponding positions that describe the semantics of the word for this position in the input query. Semantics and inputs are used in the making of the query. This is achieved by extracting the recognized words and placing them in an output query in the proper construction, containing, for example “SELECT …. WHERE ….”. In the cases described above, generated queries will have a syntax as shown below: What is the revenue and the number of employees of companies Lenovo and Sony in the Italian and French region, grouped by region? Since the input sentence requires more than one values of the column region which in natural language is represented by the conjunction and, in the construction of the SQL query this should be represented using the OR to be logically correct. Common SQL syntax query: SELECT revenue, employees_num, company_name, region FROM OLAP_TABLE WHERE (company_name = 'sony' OR company_name = 'lenovo') AND (region = 'italian' OR region = 'french') GROUP BY region
Natural Language Processing Using Database Context
755
Result can be illustrated in the form of: • Tables like Table 1. Table 1. Example result table. revenue employees_num company_name Region …
…
sony
French
…
…
….
….
• Text, which is suitable for single row results.
revenue = …. and employees number = ….. Show me region and company for all companies with revenues bigger than 100 Common SQL syntax query: SELECT region, company_name, revenue FROM OLAP_TABLE WHERE revenue > 100
Here the results can be shown in the form of: Table in Table 2. Table 2. Example of the query results.as a table. Region
Company_name
Revenue
french
sony
120
italian
sony
130
french
lenovo
130
italian
lenovo
120
• Charts which are suitable for more complex results as in Fig. 2. The table could be easily customized by the user requirements with a couple of rules added to the grammar. Their purpose will be to construct the table according to the requirements for the table type and information shown on the x and y axis.
756
Z. Mincheva et al.
revenue 132 130 128 126 124 122 120 118 116 114 sony
sony
lenovo
lenovo
french
italian
french
italian
Fig. 2. Example of the query results as a chart.
A query is successfully parsed when all words are recognized and their order follows the grammar’s rules. Figure 3 shows the main steps of the application’s algorithm. First, the information is loaded from the OLAP table. In this study the OLAP table is a flat table containing the information that is available for accessing from the users. The table represents the main input for the generation of a context. The distinct values from the OLAP table are loaded and recorded in an internal structure used later in the process. This information, provided in the structure, is used for the generation of the grammar.
Loading the distinct values from the OLAP table
Generating grammar
Obtaining the sentence from microphone or text
Making corrections of words if necessary
Parsing the query
Executing the query
Returning the result to the device in proper format (table, chart) Fig. 3. Overview of the steps of the algorithm
Natural Language Processing Using Database Context
757
Once the grammar is generated, it is ready to parse sentences that are obtained from voice or text. Each word in the input sentence must be recognized and labeled. Words that are recognized represent distinct values and synonyms, which are added to the input query for more flexibility. Unlabeled words are marked as incorrect and are send for correction with edit distance algorithm. After the corrections are made, corrected words are replaced in the input sentence. Once the sentence is completely corrected, it is parsed and used in the process of generating queries. All words that are found in the input should be recognized and labeled, so the grammar can parse them. Parsing of the grammar is the process of checking the order of words in a sentence. Their order must match the predefined rules of the grammar. Consequently, the query is generated and, if it is valid, it can be executed and the retired results could be displayed in various formats, such as tables, charts or even voice.
5 Results The OLAP table used in the test environment is comprised of five columns, which contain the name of the company, region of the located branch, revenue of the corresponding branch, the revision year and number of employees. Company name and region columns are dimension type data, while all others represent measure types. The table contains 10000 records. The internal database contains supporting information that is required for the grammar generation. Figure 4 shows the model of the internal database.
Fig. 4. The internal database relational model.
META_INFO includes the properties of each column and the column’s name, types and classes. Types provide information on the kinds of values that can be found in this column, which are. String, Number and Date. These values are described in the COLUMN_TYPES table. Classes represent types of data that are stored in columns and are classified into dimensions and measures. They are located in the COLUMN_CLASSES table. Table DISTINCT_VALUES contains distinct values from all columns. The SYNONYMS table stores synonyms that can be used to designate columns. There are two additional tables that contain information on key words, such as “more”, “bigger”,
758
Z. Mincheva et al.
“smaller”, “less”, etc. Those key words are used throughout the entire application and are not connected specifically to columns. In order to test the described algorithm, a Python porotype application is build. It uses a Python Qt5 library for visualization and basic functionalities. The input for the sentence can be in the form of voice or text. After that, the natural sentence is automatically converted to SQL query. Additionally, an intermediate result of the parsing is provided showing only the meaningful words by removing the stop words and other unnecessary information. Then, if all the parsing and validation stages are successful, the SQL query is displayed and results are shown in table format. Here the given sentence is ‘What is the revision and revenue of Nvidia and Disney in the Italian and British region, ordered by revision’. After the parsing of results, the algorithm provides the following query, which matches the given criteria: SELECT revision_year, revenue FROM OLAP_TABLE WHERE ( name = ‘nvidia’ OR name = ‘disney’) AND ( region = ‘australian’ OR region = ‘british’) ORDER BY revision_year. There is an application that shows the input taken from the user, the corrected sentence which is “understood” by the application, the final result of the parsing, as well as results from the query. The algorithm can be used from a desktop application, a web browser or mobile application [6]. However, if the input is misspelled, the application requires an additional interaction with the user in order to correct the incorrect word. Consider the following example: ‘give me namess and rewenues for cissco sorte by region’. This sentence has several errors. The algorithm automatically substitutes some words, such as rewenues, namess, sorte, since they come very close to their correct counterparts. However when the misspelled word is not that close to the correct one is needed some additional interaction. The application shows a list of alternative words, according to the given probability of similarity. For example for the misspelled word cissco there are several assumptions with top 3: • 0.78 – cisco • 0.58 – swiss • 0.56 – classify For further details, is provided a log information. There could be found the automatic actions and performance benchmarks. All parsing results are performed almost immediately, within less than a second and the. 5.1 Future Developments The main logic of the application runs on a server, so it can be accessed from anywhere. The proposed solution provides information in a user-friendly and flexible manner. Since smartphones have become such an indispensable part of our lives, it can be made possible for the application to be accessed from mobile devices. It could also be accessed via a web browser or desktop application, depending on the user needs.
Natural Language Processing Using Database Context
759
Further developments will include multi-language support and interactive conversation, thereby enabling the usage of aggregation functions, such as sum, average, max and min, as well as ordering and chart type selection of the result table for the primary data request.
6 Conclusion Natural language represents an easy way to obtain desired information without the necessity of spending additional time to learn the required technical details. The solution presented in this paper can be used in different contexts. For example, obtaining needed information immediately, using only simple sentences.
References 1. Perkins, J.: Python text processing with NLTK 2.0 cookbook over 80 practical recipes for using Python’s NLTK suite of libraries to maximize your natural language processing capabilities, Birmingham, U.K.: Packt Pub. (2010) 2. Bocklisch, T., Faulkner, J., Pawlowski, N., Nichol, A.:Rasa: open source language understanding and dialogue management 3. Plamen Paskalev, A.A.: Intelligent application for duplication detection. CompSysTech’2006 (2006) 4. Anatoliy Antonov, P.P.: Increasing the performance of an application for duplication detection. CompSysTech’2007 (2007) 5. Skurzok, D., Ziółko, B.: Edit distance comparison confidence measure for speech recognition. v Lecture Notes in Electrical Engineering (2013) 6. M. Aiello, Y. Yang, Y. Zou i L.-J. Zhang, Artificial Intelligence and Mobile Services – AIMS 2018 7th International Conference, Held as Part of the Services Conference Federation, SCF 2018, Seattle, WA, USA, June 25–30, 2018, Proceedings, 1st ed. 2018.. ped., M. Aiello, Y. Yang, Y. Zou i L. Zhang, Ped., Cham: Springer International Publishing : Imprint: Springer (2018) 7. Machine learning master: What is natural language processing? https://machinelearningmas tery.com/natural-language-processing/. Accessed 8 Oct 2021 8. Open source conversational AI. Rasa. https://rasa.com/. Accessed 10 Oct 2021 9. Medium: Stop Words in NLP. https://medium.com/@saitejaponugoti/stop-words-in-nlp-5b2 48dadad47. Accessed 11 Oct 2021 10. Brilliant: Context free grammars. https://brilliant.org/wiki/context-free-grammars/. Accessed 11 Oct 2021 11. Microsoft: Overview of online analytical processing (OLAP). https://support.microsoft.com/ en-us/office/overview-of-online-analytical-processing-olap-15d2cdde-f70b-4277-b009-ed7 32b75fdd6. Accessed 12 Oct 2021
Enriching Contextualized Representations with Biomedical Ontologies: Extending KnowBert to UMLS Guilhem Piat1(B) , Nasredine Semmar1 , Alexandre Allauzen2 , Hassane Essafi1 , and Julien Tourille1 1
Universit´e Paris-Saclay, CEA, List, 91120 Palaiseau, France {guilhem.piat,nasredine.semmar,hassane.essafi,julien.tourille}@cea.fr 2 Universit´e Paris Dauphine, LAMSADE, 75775 Paris Cedex 16, France [email protected]
Abstract. Currently, biomedical document processing is mostly human work. Software solutions which attempt to alleviate this burden exist but generally do not perform well enough to be helpful in many applications. Concurrently, there exist projects which organize concepts in the biomedical field. Therefore, we seek to leverage existing structured knowledge resources to improve biomedical language modeling. In this paper, we provide an implementation integrating the UMLS knowledgebase into a BERT-based language model, aiming to improve its performance in biomedical Named Entity Recognition. To achieve this, we extend KnowBert, a recently developed technique for integrating knowledge into language models. Preliminary results reveal the challenges of applying KnowBert to the biomedical domain given the number and subtlety of different concepts in UMLS. Going forward, addressing these challenges and combining this with other approaches such as BioBERT may help expand the range of usefully automatable biomedical language processing tasks. Keywords: Artificial neural networks · Knowledge based systems Knowledge representation · Machine learning · Biomedical informatics · Information extraction
1
·
Introduction
With over a million articles published every year in the biomedical field and the large number of patient records generated by hospitals, it is increasingly difficult for healthcare professionals to keep up to date on research, carry out systematic reviews, or search for patient information. There is thus a demand for language processing tools able to identify and extract meaningful information from these texts. For this reason, multiple knowledge bases such as the Unified Medical Language System (UMLS) and the OpenTargets LIterature coNcept Knowledge base c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 760–773, 2022. https://doi.org/10.1007/978-3-031-10464-0_52
KnowBert-UMLS
761
(LINK) have been created to make information more searchable. Our objective is to enable the recently developed Transformer-based pretrained neural Language Models (LMs) [23] to make explicit use of this knowledge, in order to improve their performance and interpretability. We follow the method described by [18] known as KnowBert to integrate knowledge derived from the UMLS Knowledge Base into a BERT-based language model. We thus call this model KnowBert-UMLS. Other projects such as BioBERT [11] and ClinicalBert [1] have successfully specialized language models to the biomedical domain. However, their approach has typically not explicitly leveraged structured knowledge sources such as UMLS. A notable exception is UmlsBERT [14], which leverages UMLS as a thesaurus to explicitly teach BERT synonymy and enriches biomedical word representations with rough clustering. Our approach using KnowBert differs significantly in that it makes use of the full vocabulary of UMLS to enrich word representations, and jointly performs entity linking. It is also fairly indifferent to the pretrained LM used as a base, and can be combined with another specialized model such as BioBERT. In the context of large and specialized knowledge bases such as UMLS, we find the approach proposed by [18] to be computationally unrealistic with current tools for most organizations. Preliminary biomedical Named Entity Recognition evaluations of our model trained on a small subset of our training corpus demonstrate a decrease in performance with respect to models with non-enriched word representations. We investigate the reasons for this, and propose ways to alleviate this computational burden. The remainder of the paper is organized as follows. We contextualize our approach and motivations by discussing related work in Sect. 2. In Sect. 3, we overview the architecture of KnowBert and discuss the specifics of our extension of it to the UMLS Knowledge Base. The preliminary experimental results are reported and discussed in Sect. 4. Finally, we present in Sect. 5 our conclusions and future work.
2
Related Work
In the wake of the advent of the Transformer architecture introduced by [23], language processing tasks have increasingly been handled by neural language models based upon this architecture such as BERT [6]. Due to the differences between the language used in the biomedical field and the types of text typically used to train these models, many projects have sought to leverage the Transformer architecture in the more specific and rigorous biomedical context. The typical approach has been to pre-train language models on specialized text as has been done with BioBERT [11], BioMed-RoBERTa [8], SciBert [3], and Clinical BERT [1], which all incorporate various amounts and proportions of biomedical text in the pre-training phase of their models. However, not only does this type of pre-training usually lead models to underperform on general-domain text as demonstrated by [2], large models with attention mechanisms such as BERT are also notoriously computationally expensive
762
G. Piat et al.
to pre-train. Furthermore, the ability for a model to associate concepts (e.g. “COVID-19 ” and “respiratory failure”) is predicated on these concepts appearing in the pre-training corpus, leading to difficulty adapting to some forms of distributional shift. These limitations have lead to interest in different methods of adapting these models to specific domains. One such method is integrating information derived from existing Knowledge Bases (KBs) into these models. There have been multiple methods proposed to achieve this. E-BERT [19], for instance, projects entity embeddings derived from the KB to the input word embedding space. UmlsBERT [14], on the other hand, explicitly adds semantic group embeddings to words found in UMLS. ERNIE [26] and BERT-MK [9] learn a fused representation of contextualized word and entity representations. K-Adapters [24] are a cheaper alternative to the ERNIE process which can’t suffer from catastrophic forgetting by using Adapters [10]. One drawback of these methods, particularly in the biomedical context, is that they all require a separate upstream Entity Linking (EL) step to be made at inference. While Entity Linkers are known not to be perfectly reliable, Biomedical Entity Linking is a particularly difficult task. Some of the best performing models include RysannMD [5], which achieves 0.436 F1 on the CRAFT corpus, and the dual encoder architecture proposed by [4] which achieves 0.564 F1 on the MedMentions corpus [15]. The effectiveness of the knowledge integration step being inherently limited by the Entity Linking step, the possibility of performing knowledge enrichment without or jointly with Entity Linking becomes attractive. KEPLER [25] is one such model which introduces a knowledge embedding loss as an objective for language model pre-training, aligning contextual word representations with entity description representations. As such, it does not require an EL step at inference. The KnowBert model developed by [18], on the other hand, grafts a KB-specific entity linking module into a transformerbased pretrained LM such as BERT, in order to jointly perform Entity Linking and contextualized word representation enrichment, making it also a standalone method requiring no upstream EL with the additional benefit of explicitly identifying the entities present in the text. Our approach to biomedical Language Modeling thus differentiates itself from existing methods by leveraging the ability of KnowBert to jointly perform Entity Linking and Language Modeling, and applying it in a biomedical context in order to improve word representations and generate metadata which can be used for a variety of downstream tasks. Additionally, relying on structured knowledge enables KnowBert-based models to recognize new concepts without additional training and to perform similarly to non-specialized models on general domain text.
3
KnowBert-UMLS
As shown in Fig. 1, the KnowBert architecture is composed of three main components: a pretrained Language Model backbone, a Knowledge Attention
KnowBert-UMLS
763
and Recontextualization Module (or KAR) which performs Entity Linking and knowledge enrichment of word representations, and a candidate mention generator.
Fig. 1. Abstraction of the KnowBert Architecture. KnowBert Extends BERT by Adding a Knowledge Attention and Recontextualization Module (KAR) between Two Transformer Layers; in this Case between Layers 10 and 11.
3.1
Pretrained BERT
While the KnowBert method can apply to most Transformer-based pretrained language models, we focus on BERT as it was used by [18]. BERT models comprise L Transformer layers. For a sequence of N tokens, each layer i takes as input an N × H-dimensional sequence representation Hi−1 and outputs a representation Hi which integrates more contextual information by applying a multiheaded attention mechanism to Hi−1 followed by a Multi Layer Perceptron. In the case of BERTBASE , we have L = 12 and H = 768. The final output of each token is thus a contextualized representation in RH . 3.2
Ontology and Candidate Generator
The KnowBert method ties a pretrained language model to an Ontology, specifically a Knowledge Base. For our purposes, we define a Knowledge Base K as a set of JK entities ej , each with a vectorial representation ej ∈ RK . We use the UMLS Knowledge Base, with each entity corresponding to a Concept Unique Identifier. The entity embeddings we use are computed according to the adversarial method provided by [13] with K = 50. To perform the Entity Linking step, the KAR requires a candidate generator to create a list C of candidate mentions. Specifically, each sequence is associated
764
G. Piat et al.
to a set S of S candidate spans, which may or may not contain an entity mention. Each candidate span is then assigned a corresponding list of candidate entities, including a null entity representing the lack of an entity mention within the candidate span. Formally, we have: C = {(s, {es,1 , . . . , es,Js })|s ∈ S}
(1)
with each candidate span s being associated to a set of Js candidate entities, and each entity ej having a corresponding vector ej ∈ RK . These candidates are produced by a candidate generator which follows rules specific to the KB being used. KnowBert as specified by [18] implements compatibility with two KBs, namely WordNet and Wikipedia. The challenge in crafting a mention generator for the biomedical domain, and specifically UMLS, is the variability of formulations for each concept. In the case of UMLS, each concept is associated to a list of common strings (called “atoms”) that may represent it. For instance, the concept for lung cancer is associated to 97 different forms, including “pulmonary carcinoma”. To leverage these atoms, we have attempted several methods based on string similarity and cosine similarity of vectorial word representations. We have found the most effective option for our purpose to be the QuickUMLS python library [20] which, given some text, identifies candidates in the form (span start, span end, concept ID). We then aggregate candidate entities by candidate span, derive an empirical estimate of the prior probabilities for each entity from MedMentions, and find the relevant entity embeddings as described in Fig. 2. Finally, we feed the output of this candidate generation process to the KAR. In practice, matching each of the approximately 180 M (million) sequences in our training corpus to the 16 M atoms in UMLS on-demand is prohibitively computationally expensive. In order to achieve this, we precompute the candidates for each of our sequences ahead of time and create a lookup table for each file in our corpus. This needs to be done only once and is parallelizable, but nonetheless 3.5% of our corpus took six days to process across seven nodes of a computing cluster, each equipped with two Xeon 36-thread processors with a clock speed of 3 GHz, and required us to favor speed over recall and precision when considering QuickUMLS settings. Depending on which similarity measure and threshold are chosen, QuickUMLS trades off between recall and execution time. We settled on Jaccard similarity, with a threshold of 0.7 as the best compromise we could find. In our experience, the computational impact at inference is fairly low for on-demand low-volume applications, as the candidate generator typically takes fractions of a second to process a sequence.
KnowBert-UMLS
765
Fig. 2. Detailed structure and output of the UMLS candidate generator.
3.3
KAR
The KnowBert approach adds a KB-specific “Knowledge Attention and Recontextualization module”, or KAR, between two transformer layers in a pretrained BERT model. This module is a relatively inexpensive addition to the pretrained model, with in our case only approximately 0.3% as many trainable parameters as BERTBASE . Multiple KBs can be used in tandem: theoretically, a KAR can be inserted between every pair of layers in the transformer. In practice, the insertion of a KAR too close to the input layer causes too much perturbation to the flow of information and prevents the model from recovering during training. As suggested by [18], in order to minimize the language model’s perplexity1 , we insert the KAR between the tenth and eleventh layers of BERTBASE as per Fig. 1. This module performs entity linking on the intermediate contextualized word representations and pools them with the relevant entity embeddings. This results in contextualized word representations which are enriched with information extracted from a KB. Specifically, the KAR takes as input a sequence representation Hi and a list C of S candidate mentions as generated by the Candidate Generator (see (1)). As described by [18] and illustrated in Fig. 3, the KAR first linearly projects the output of the previous transformer layer Hi to the entity embedding space: Hproj = Hi Wproj + bproj i
(2)
where Wproj and bproj are learned. 1
Perplexity is computed as the exponential of the cross-entropy loss, and is a standard measure of how well the language model predicts samples.
766
G. Piat et al.
Fig. 3. Detailed structure of the knowledge attention and recontextualization module (KAR).
Then, the projected embeddings for the words in each span are pooled into a matrix S ∈ RS×K of span embeddings. Each span embedding is computed following “End-to-end neural coreference resolution” by [12], who describe a way to compute text span vectors: each token in each span is associated to a weight computed from the contextualized embeddings fed through a trained FFNN. These weights are softmaxed with respect to each span of text, and serve as the weights for a weighted-sum pooling of the non-contextualized token embeddings, resulting in non-contextualized text span embeddings. These span embeddings are then contextualized with a standard transformer layer to allow the entity linker to identify relationships between entity mentions, resulting in the contextualized span embedding matrix Se . Se = MLP(MultiHeadAttn(S, S, S))
(3)
where MLP and MultiHeadAttn designate a position-wise Multi-Layer Perceptron and a Multi-Headed Attention layer respectively. The contextualized span embedding se of every candidate span s is then used to pool the corresponding matrix of candidate entity embeddings Es from the KB, resulting in a predicted entity representation: − → ψs = Softmax(MLP(ps , se · Es ))
KnowBert-UMLS
− → ˜s = ψs · Es e
767
(4)
where ps ∈ R is the vector of prior probabilities for the candidate entities asso− → ciated with span s, and ψs ∈ RJs is an estimate of their posterior probabilities. ˜s of each span s are packed The predicted entity representation embeddings e and added to contextualized span embeddings Se , forming the knowledgeenriched span embedding matrix Se : Js
˜ Se = Se + E
(5)
is computed with word-to-enriched-entity-span attention, similarly H i to applying a regular transformer layer to Se but substituting the query in the : attention mechanism for projected word embeddings Hproj i proj
H
proj i
= MLP(MultiHeadAttn(Hproj , Se , Se )) i
(6)
Finally, the knowledge enriched contextual word representation output of the proj KAR is a projection of H i back to BERT contextualized word representation space with an added skip connection: H i = H
proj proj W i
+ b
proj
+ Hproj i
(7)
and b are learned. where W → . The linked entity for span s is simply es,argmax(− ψ ) proj
proj
s
3.4
Training
There are three training steps for KnowBert models. First, once the mention generator is written, the KAR is trained on the Entity Linking task on spans given by the corpus, minimizing a log-likelihood loss for the predicted probability distribution over candidate entities: exp(ψsg ) LEL = − log (8) n s exp(ψsk ) k=1
→ − with ψsg the score for the ground truth entity in ψ s . The second training phase involves continuing the pre-training of BERT using both a Masked Language Model and a Next Sentence Prediction objective. This phase corrects the disruptions incurred by the Language Model when grafting the KAR between the Transformer Layers in BERT. This step also adjusts the weights of the KAR for Entity Linking, minimizing: LKnowBert = LBERT + LEL
(9)
We call this phase the “re-training” step to differentiate it from the BERT pre-training step and the fine-tuning step. The final step, as with most pretrained LMs, is to fine-tune it to the target task.
768
4
G. Piat et al.
Preliminary Experiments
We present the preliminary results of our experiments, which intend to highlight the challenges that must be overcome to successfully apply the KnowBert method to the biomedical domain with the UMLS Knowledge Base. We use the same pretrained backbone for our KnowBert-UMLS model as [18] in their original paper on KnowBert, i.e. English BERTBASE uncased. 4.1
Masked LM and Next Sentence Prediction
For a large source of raw biomedical text, we scraped the PubMed Central database of Open Access articles and processed them for next sentence prediction using the tool provided with the source code for “Knowledge Enhanced Contextual Word Representations” by [18]. At the end of this training phase, KnowBert can be used as a typical pretrained BERT model. Due to time constraints, we were unable to generate candidates for the approximately 180M sequences in the corpus, and had to limit our re-training corpus to approximately 6M sequences. As shown in Table 1, this lack of retraining data has prevented the language model from successfully integrating the KAR, with a masked LM perplexity several orders of magnitude larger than BERTBASE , BERTLARGE , and the KnowBert models produced by [18]. Table 1. Masked language model perplexity for both BERT models, the KnowBert variants produced by [18], and KnowBert-UMLS.
4.2
Model
Perplexity
BERTBASE BERTLARGE KnowBert-Wiki KnowBert-Wordnet KnowBert-W+W KnowBert-UMLS
5.5 4.5 4.3 4.1 3.5 10387.7
NER
We choose to fine-tune KnowBert-UMLS on the Biomedical Named Entity Recognition task on the n2c2 corpus, previously known as i2b2 2010 [22], with an 80%–20% split between training and validation sets using cross-entropy loss. In Table 2, we compare our performance versus four BERT-based models, namely BioBERT [11], clinicalBERT [1], BlueBERT [17] and BERTBASE , all fully finetuned on the NER task with the same linear classifier architecture. The performance of our various baselines were taken from [7].
KnowBert-UMLS
769
Examples of correct and incorrect predictions made by KnowBert-UMLS, formatted according to the IOB2 standard, can be found in Tables 3 and 4 respectively. The example in Table 4 is a quite typical incorrect prediction, as it consists of a span that overlaps with the correct span and has a correct label. This type of error is the most common, constituting 32% of the model’s mistakes. Many of these mistakes are ambiguous even to humans – for instance, the matter of having to include the token “Estimated” in the “blood loss” entity is not selfevident. We perform a complete breakdown of error types as specified by [7] in Table 5. Table 2. Performance of BERT-based language models on the n2c2 NER task, measured as micro-averaged strict precision, recall and F1. Results for BioBERT, ClinicalBERT and BlueBert from [7]. Model
P
R
F1
BERTBASE BioBERT clinicalBERT BlueBERT KnowBert-UMLS
0.85 0.86 0.87 0.88 0.80
0.87 0.88 0.88 0.90 0.81
0.86 0.87 0.88 0.89 0.80
Table 3. Example of correct NER predictions by KnowBert-UMLS pulled from the n2c2 evaluation set. Sequence status post
total
abdominal hysterectomy and
bilateral
salpingo-oophorectomy .
True
O
O
B-treatment I-treatment I-treatment
O B-treatment
I-treatment
O
Predicted
O
O
B-treatment I-treatment I-treatment
O B-treatment
I-treatment
O
Table 4. Example of incorrect NER prediction by KnowBert-UMLS pulled from the n2c2 evaluation set. Sequence Estimated
blood
loss
True B-problem I-problem I-problem O B-problem I-problem Predicted
was 100 cc O O
O O
O O
While the contextualized word representations contain enough information for the classification model to perform significantly better than chance, our results reveal a decrease in performance with respect to a non-modified BERTBASE . This is further demonstration of the fact that the re-training procedure is the performance bottleneck and requires more text than our candidate generator can realistically process in a reasonable time frame.
770
G. Piat et al.
Table 5. Breakdown of types of mistakes made by KnowBert-UMLS in proportion of total prediction mistakes made. Error type Correct label Overlapping span Incorrect label Overlapping span Correct span False positive False negative
Proportion (%) 32.0 10.2 15.1 29.1 13.6
Our evaluation is performed with SeqEval [16] in strict mode. Like the results from [7], its metrics are on an entity-level rather than at token-level, meaning that a true positive is a fully matching mention span. A predicted mention that overlaps with a true mention but is not identical counts as a false positive and a false negative.
5
Conclusions
Successfully integrating UMLS knowledge into a pretrained LM using the KnowBert method presents a significant challenge due to the size of the knowledge base and the difficulty of generating candidate mentions. Our candidate generator based on QuickUMLS was not able to generate candidates with enough efficiency and precision to make re-training possible at the required scale. We are currently working on generating candidates for larger chunks of the re-training corpus in order to evaluate the progress made by Knowbert-UMLS as a function of corpus size, and make projections on its performance when trained on the full dataset. In order to successfully re-train KnowBert-UMLS, the candidate generator must be improved significantly. Its main source of false negatives is the introduction of abbreviations of long terms in the beginning of the text which are subsequently re-used. These abbreviations are often absent from the UMLS and cannot be identified by the generator. Solving this issue would likely increase recall significantly when identifying candidate spans. This may allow a different recall/time compromise to be found within QuickUMLS settings. Regardless of possible improvements to recall however, deploying this at scale, whether for re-training or practical text processing purposes, is likely to remain prohibitively slow for most individuals and organizations. Future work will involve finding a more effective and computationally efficient approach to tackle candidate generation, for instance as a machine learning problem or with a fast NER-based span pre-selection step. Furthermore, whilst we chose to evaluate the performance of KnowBertUMLS using BERTBASE as a backbone to isolate the effect of the KAR, the
KnowBert-UMLS
771
KnowBert method has the advantage of being compatible with other approaches such as BioBERT, clinicalBERT, BlueBERT, or SciBert. In addition to the potential performance improvements on biomedical tasks, these pretrained models may be less expensive to re-train due to the potentially smaller distributional shift between pre-training, KAR training, and re-training corpora. In addition to the improvements that need to be made to make KnowBertUMLS competitive, there are a number of potential ways to enhance it and expand its range of applicability. Multiple Knowledge Bases. As shown by [18], KnowBert is capable of accommodating multiple KARs for multiple KBs simultaneously. Depending on the practical application, it could be useful to develop a KnowBert model combining UMLS with WordNet, Wikipedia, YAGO [21], or other specialized KBs. It would also be interesting to assess the performance of one such model in order to understand to what extent multi-specialization is possible. Re-training with Adapters. Adapters, as proposed by [10], have seen some success for efficiently fine-tuning pretrained LMs such as BERT. It is conceivable that this approach may aid in the re-training process by reducing the number of parameters to train, and may help reduce the memory footprint of KnowBert in some practical applications. Specifically, in cases that involve multiple knowledge bases or sets of knowledge bases used independently from each other, such an approach may allow one copy of a pretrained LM to be loaded into memory whilst the relevant set of KARs and adapters can be applied as a function of the token sequence being processed.
References 1. Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78 (2019) 2. Arumae, K., Bhatia, P.: CALM: continuous adaptive learning for language modeling. arXiv preprint arXiv:2004.03794 (2020) 3. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3615–3620 (2019) 4. Bhowmik, R., Stratos, K., de Melo, G.: Fast and effective biomedical entity linking using a dual encoder. In: Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis, pp. 28–37 (2021) 5. Cuzzola, J., Jovanovi´c, J., Bagheri, E.: RysannMD: a biomedical semantic annotator balancing speed and accuracy. J. Biomed. Inf. 71, 91–109 (2017) 6. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
772
G. Piat et al.
7. Fraser, K.C., Nejadgholi, I., De Bruijn, B., Li, M., LaPlante, A., El Abidine, K.Z.: Extracting UMLS concepts from medical text using general and domain-specific deep learning models. In: EMNLP-IJCNLP 2019, p. 157 (2019) 8. Gururangan, S., et al.: Don’t stop pretraining: adapt language models to domains and tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360 (2020) 9. He, B., et al.: Integrating graph contextualized knowledge into pre-trained language models. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 2281–2290 (2020) 10. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019) 11. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, btz682 (2019) 12. Lee, K., He, L., Lewis, M., Zettlemoyer, L.: End-to-end neural coreference resolution. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 188–197 (2017) 13. Maldonado, R., Yetisgen, M., Harabagiu, S.M.: Adversarial learning of knowledge embeddings for the unified medical language system. In: AMIA Summits on Translational Science Proceedings, 2019, p. 543 (2019) 14. Michalopoulos, G., Wang, Y., Kaka, H., Chen, H., Wong, A.: UmlsBERT: clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1744–1753 (2021) 15. Mohan, S., Li, D.: MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts. arXiv:1902.09476 [cs] (2019) 16. Nakayama, H.: seqeval: a python framework for sequence labeling evaluation (2018). https://github.com/chakki-works/seqeval 17. Peng, Y., Yan, S., Lu, Z.: Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In: Proceedings of the 18th BioNLP Workshop and Shared Task, pp. 58–65 (2019) 18. Peters, M.E., et al.: Knowledge enhanced contextual word representations. In: EMNLP (2019) 19. Poerner, N., Waltinger, U., Sch¨ utze, H.: E-BERT: efficient-yet-effective entity embeddings for BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 803–818 (2020) 20. Soldaini, L., Goharian, N.: Quickumls: a fast, unsupervised approach for medical concept extraction. In: MedIR Workshop, Sigir, pp. 1–4 (2016) 21. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: 16th International Conference on the World Wide Web, pp. 697–706 (2007) ¨ South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/VA challenge on con22. Uzuner, O., cepts, assertions, and relations in clinical text. J. Am. Med. Inf. Assoc. 18(5), 552–556 (2011) 23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 24. Wang, R., et al.: K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. arXiv:2002.01808 [cs] (2020)
KnowBert-UMLS
773
25. Wang, X., et al.: KEPLER: a unified model for knowledge embedding and pretrained language representation. Trans. Assoc. Comput. Linguist. 9, 176–194 (2021) 26. Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M., Liu, Q.: ERNIE: enhanced language representation with informative entities. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441–1451 (2019)
One Step Beyond: Keyword Extraction in German Utilising Surprisal from Topic Contexts J. Nathanael Philipp(B) , Max K¨ olbl, Yuki Kyogoku, Tariq Yousef, and Michael Richter Institute of Computer Science, NLP Group, Universit¨ at Leipzig, Augustusplatz 10, 04109 Leipzig, Germany {jonas nathanael.philipp,tariq.yousef}@uni-leipzig.de, [email protected]
Abstract. This paper describes a study on keyword extraction in German with a model that utilises Shannon information as a lexical feature. Lexical information content was derived from large, extra-sentential semantic contexts of words in the framework of the novel Topic Context Model. We observed that lexical information content increased the performance of a Recurrent Neural Network in keyword extraction, outperforming TexTRank and other two models, i.e., Named Entity Recognition and Latent Dirichlet Allocation used comparatively in this study. Keywords: Lexical semantics · Information theory Model · Recurrent Neural Network
1
· Topic Context
Introduction
Our epistemic interest is threefold: we are interested in (i) whether information as a lexical feature of a word, derived from semantic contexts, represents the semantic relevance of that word in a text, (ii) whether information is a learnable lexical feature by a Recurrent Neural Network (RNN), and (iii) whether the performance of RNN in keyword extraction improves when lexical information is given. We assume that high information of a word in a text is an important cue for the language processor to include the word in question in the text interpretation. We assume that lexical information is a semantic feature of words and that lexical information can be derived from semantic contexts. This approach makes information as a word feature more suitable than TF-IDF -weighing of words based on their frequencies in documents, or TextRank, a graph-based model that captures frequencies of co-occurrences of words. Both methods, however, do not determine the relevance of a word in a semantic context. As pointed out in related work, there are approaches and studies in which information theory is used for information retrieval. However, neither of these approaches attempts to derive lexical information from semantic discourse. Our approach is meant to close this c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 774–786, 2022. https://doi.org/10.1007/978-3-031-10464-0_53
One Step Beyond: Keyword Extraction in German
775
’semantic gap’ because language comprehension uses semantic contexts, such as defined in discourse models like [18]. Our concept of information is located within Shannon’s Information Theory [40], and we use Shannon information (SI) as a lexical feature (in due course, we will use SI to denote lexical information content.) In particular, we follow principles of Surprisal Theory [14,17]: The information of a sign is given by the log transformation of its conditional probability, where the condition of this probability may depend on extra-sentential data. The question is how likely, or expected, it is for a language processor to find a sign in an utterance, a paragraph, a complete text or even within the entire corpus. For the derivation of the information of words from semantic contexts, we employ in this study the novel Topic Context Model (TCM) [19,20] that considers as contexts the topics both in the local environment of a target word (for instance a text or a paragraph) and in the larger environment (for instance, a corpus). The point of departure of the TCM is that natural language processing and comprehension require a semantic context. A little fictitious example may illustrate the idea of TCM: assume that the word chocolate occurs in a text in the context of two topics, say food and rain forest, whereby the topics have been disclosed by some topic model (for instance, Latent Dirichlet Allocation, [7], see subsection Topic Context Model below) In the context of food, chocolate may occur ten times and in the context rain forest, chocolate it may occur five times. In the first context, chocolate is less surprising than in the topic context rain forest. This is to say, in the linguistic discourse of our example, the conditional probability of chocolate, given the topic food, is high. In contrast, the likelihood is lower given the topic rain forest. Utilising Shannon’s definition of information [40], chocolate carries less information in the context of food than in the context of rain forest. In studies of [19,20], evidence was gained that TCM captures semantic principles of language comprehension. This study aims to provide additional positive evidence for this finding. We chose German as the language for keyword extraction because we consider the high morphological complexity and thus the high number of tokens in this language to be a special test requirement for the TCM. Unlike contextualised word embeddings [2,31], such as the BERT model [6, 11], TCM defines a larger contextual space than word-embeddings models. The contextual space may comprise the complete corpus and thus be a semantic model of the discourse. In comparison to TCM, contextualised word embeddings are much more data-intensive, which might cause problems with languages that are only sparsely represented by digital corpora. Keywords (in the following, we use the term keywords interchangeably to refer to both keywords and keyphrases) in the definition of [8] are “a short set of one or a few words that represent a concept or a topic covered in a document”, they should obey the criterion of ’informativeness’ [42]. They reflect the ’core sentiment’ of a document [5,45], making access and recovery of information and documents [5] possible. That is to say, keywords are classifying features of texts. Why do we utilise information content for keyword extraction, or more generally, why do we assume a link between information theory and the semantics of natural language?
776
J. N. Philipp et al.
The advances in computational linguistics and computer science illustrate that Shannon’s information theory applies to the lexical semantics of natural language and thus to human communication, beginning with the influential work of [28]. In addition, to quote [3], there is “a strong need for an extension of the classical communication model to characterize not only sequences of bits, but also the meanings behinds these bits”. [12] claimed in this seminal work Knowledge and the Flow of Information that “meaning is generated by the way information is coded” [12] and, additionally, that the transfer of knowledge can be modelled by information theory that meets the “quantitative demands of communication theory” [12]. [39] assume that information theory should also be capable of dealing with different levels of communication as put in [39, 4]: “How precisely do the transmitted symbols convey the desired meaning? (The semantic problem)”. At first, semantics was not of interest for Shannon, “[t]hese semantic aspects of communication are irrelevant to the engineering problem[. . .]” [39]. Information theory scientists like Colin Cherry stated that it is “important to emphasize, at the start, that we are not concerned with the meaning or the truth of messages; semantics lies outside the scope of mathematical information theory.” [9]. However, Shannon’s co-author Warren Weaver emphasised that mathematical information should have a semantic level [39]. Subsequently, information theory was successfully applied to natural language, even by Shannon himself, who worked on the entropy of printed, written English [38]. His study takes up the work of Zipf, who in 1935 presented his work on the relation between frequency, form and rank of linguistic units. Numerous studies followed in the ensuing decades on the correspondence of information and form, for instance, the relation of word length and information (for example [25,32]), on semiotics, i.e., information and the arbitrariness of linguistic signs, on syntax [13] and, to a much smaller degree also on semantics and pragmatics (cf. [36] on measuring semantic similarity with information content, [26] on measuring the consistency of translations with entropy and [4] on the semantics and expressiveness of intensifiers). What are the basic ideas of information theory, and how does it fit into communication models using language? Shannon defines information as the likelihood of a sign [40]: information is the log-transformation of the sign’s probability, and in surprisal theory [14], information is derived from conditional probabilities, i.e., probabilities given a context. [21] equates a sign’s information with its surprisal, proportional to the effort needed to process that sign. [21] points out that in a model of language comprehension that employs information theory, large, extra-sentential contexts need to be considered. However, there is no clear definition of what a context is. This makes the notion of extra-sentential somewhat difficult to grasp and might explain why, to our best knowledge, there are no studies on the calculation of information using large contexts. In this paper, we will take up and discuss the idea of this type of context by employing the above-mentioned Topic Context Model (TCM). In the remainder of the paper, we summarise in Related Work relevant studies, amongst others on the
One Step Beyond: Keyword Extraction in German
777
correspondence of information and semantics, on information retrieval in the framework of information theory and on neural networks. The section Methods describes our corpus, the calculation of lexical information by the Topic Context Model and the architecture of the Recurrent Neural Network that is employed in our study. The sections Results and Conclusion and Outlook contain the presentation of the results and their evaluation, as well as a short note on the relation of our approach to a model of language learning.
2
Related Work
For keyword extraction, tf-idf and TextRank have been widely used. Their drawback, however, is that they do not capture semantic information. TF-IDF proceeds from the assumption that frequencies of words provide evidence of similarity both of words and documents, while TextTank evaluates adjacency relations of words in documents to generate a graph of semantic similarity. It makes no use of semantic similarities between words that are. Neither model, however, calculates semantics within discourses, that is, semantic contexts. Similarly, the KEA algorithm [44] that was developed on the basis of tf-idf, first occurrence, and graph-based approaches [10,27,41] lack a semantic aspect w.r.t. that type of contexts. In order to extract semantic information, [34] described experiments determination of information distribution in texts and words, where subjects had to guess linguistic units in a textual context to determine information distribution in texts and words [30,33]. [36] introduced a method for determining semantic distances similarity of concepts by information content based on the WordNet ontology. The point of departure is that concepts are more similar the more information they share. Resnik found that his measure yielded “encouraging” correlations with human similarity ratings in a study of [29]. Follow up studies utilising information content as a semantic measure in WordNet [22,37,48] resulted in even higher correlations with [29]. [26] proposed entropy as a measure to determine the consistency of translations by checking the language-to-language alignments. However, to the best of our knowledge, information retrieval is not a prominent application field of information theory. [42] presented a study on keyphrase extraction. Their concept of informativeness was operationalized by calculating the Kullback Leibler Divergence between the probability distribution in a foreground corpus, which for instance, can be research results of a conference, and a probability distribution in a background corpus, for instance, research results in general. [35] utilised collocation information for extractive summarisations, that is, Shannon Entropy of bigrams. Mutual information has been used by [1] and [16] for abstractive summarisation. Information theory is not common for keyword extraction, but a few methods haven been proposed using information theory, for example [15]. Upon detecting keywords, they assume that highly relevant words should be concentrated in some portions of the text, and their model incorporates the distribution of occurrences of a word in the corpus.
778
J. N. Philipp et al.
The application of (R)NNs for keyword extraction yields promising results. [46] propose a novel recurrent neural network called joint-layer RNN. The proposed joint-layer RNN is provided with two hidden layers and their two output layers are combined into an objective layer. The novel method is applied to keyphrase extraction tasks from a single tweet and also integrates word sequence as contextual information. It brings better results than the other state-of-theart methods such as RNN, long short-term memory (LSTM), etc. Likewise, [47] introduce a new model called target center-based LSTM into which both preceding and following contexts of the target word are fed as contextual information. In conclusion, they point out that the semantic relation and important words, albeit with low frequency, should be incorporated in future studies.
3
Methods
3.1
Text Corpus
We used the Heise1 corpus from [19,20] as data resource. The Heise corpus contains a collection of texts on, as the website operator himself says, information and telecommunication technology and related fields, but also on the social impact of these technologies. We chose the corpus because of its thematic diversity in the above-mentioned fields, and assumed that the variety of topics would optimise the performance of the TCM. The corpus contains 100,673 texts with 38,633 keywords that occur in the text. Due to the memory limits of the LDA, we randomly selected texts from that corpus in order to maximise the number of texts while keeping the total number of words (lemmas, determined by spaCy2 ) under 240,000. Thus our corpus consists of 30,284 texts with 239,065 words. Each text consists of a headline and the text itself. In addition, for 15,454 texts, we have a lead, i.e., a short summary of the text, usually one or two sentences long. All named entities consisting of multiple words are additionally treated as a single word. The Heise/FAZ text collection offers a good basis for the application of the TCM since lexical information can be derived from topic contexts within a large set of local environments, i.e., short texts. We randomly picked 10% and used it as the validation set. The rest was used for training. There are 7,347 keywords in total, the named-entity recognizer (NER) from spaCy classified 122 as LOC, 288 as MISC, 23 as ORG, 40 as PER and 6775 as neither. Table 1 shows the results of the NER, treating it as a baseline, for the keyword extraction on the training set. They reflect that only a small fraction of the keywords are named-entities, but for about 86% of the texts there is at least one keyword that is a named-entity. Similarly to what was reported in [20], the precision is low while the recall is relatively high, which also reflects the much 1 2
https://heise.de. https://spacy.io.
One Step Beyond: Keyword Extraction in German
779
higher number of words which are named-entities but not keywords. The same, as expected, holds true for the validation set, see Table 2. Table 1. Precision, recall, F1-measure, and the accuracy–values (a1–a5) of the NER keyword results for the training set. Accuracy n says how often a model predicted at least n keywords correctly. Model
a1
NERORG/PER
0.6437 0.2002 0.0545 0.0237 0.0154 0.1878
0.2199 0.1777
NERORG/PER/LOC
0.6845 0.2381 0.0721 0.0291 0.0186 0.1466
0.2437 0.1616
NERORG/PER/MISC
0.8382 0.4199 0.1760 0.0772 0.0453 0.1423
0.3562 0.1910
NERORG/PER/LOC/MISC 0.8648 0.4550 0.1981 0.0875 0.0508 0.1259
0.3785 0.1785
3.2
a2
a3
a4
a5
Precision Recall F1
TextRank
As a baseline, we employ TextRank [27]3 . For keyword extraction, the TextRankalgorithm builds a (directed) graph with words (or even sentences) for nodes within a text of a paragraph. The weight of a word is determined within a sliding context window and results essentially from the number of outgoing links of the words directly preceding the target word. We use a window size of 4 and compare the top 5 and top 10 words that TextRank reports. 3.3
Topic Context Model
The TCM calculates the SI of a word both from large, extra-sentential environments, a target word occurs in, which are typically a corpus, and from smaller local environments, for instance texts or paragraphs. The TCM is built within the framework of Surprisal Theory [14,17]. The idea is that the contexts it uses to compute the SI values are the topics within the local and larger environment of a target word. To detect the topics, the topic model Latent Dirichlet Allocation (LDA) is used [7]. In general, we calculate SI using the following formula 1: SI(wi ) = − log2 (P (wi |topics))
(1)
LDA. LDA is a statistical topic model that tries through a generative process to detect and identify the topics that appear in a corpus, and whose words belong to them [7]. We define the context as a topic and calculate the average information SI for each word, see formula 2, where n is the number of topic of the LDA. n
1 SI(w) = − ln(w|ti ) n i=1 3
https://github.com/jnphilipp/TextRank.
(2)
780
J. N. Philipp et al.
The term w|ti refers to the probability P (wi |context), and is the probability of a word w given a topic ti , which is calculated in formula 3. Here, cd (w) is the frequency of a word w given a document d, |d| is the total number of words in a document d, W T is the normalised word topic distribution of the LDA4 , and P (ti |d) is probability for a topic t and a document d given by the LDA. w|ti =
cd (w) W Tw,ti P (ti |d) |d|
(3)
Furthermore we use a scoring function, see formula 4. This is mainly to ensure that the calculated (SI) values lie between −1 and 1, which we refer to as SI. SI(w) − μd SI(w) = score(w) = tanh 2 σd +
(4)
Here μd is the mean of the SI for all the words w in a document d and respectively σd2 is the variance. To ensure that the standard deviation is not zero we add = 1e−7 . We trained the LDA with 1000 topics. 3.4
Recurrent Neural Network
Similar to [46] and to [19,20], the RNNs we use predicts whether a word in the input sequence is a keyword or not. The input texts are the same as for the LDA. The architectures of the RNNs are straightforward. For each of the text parts headline, lead, and text, the RNN has a separate in- and output with one or two bidirectional GRU(s) in between. In total we trained 15 different RNNs, each trained for 15 epochs. First we and (iii) SI only. have three different input types: (i) text only, (ii) text and SI In (i), the words are given to an embedding layer resulting in a vector with a values are used directly. In case of (ii), the length 128 for each word. The SI value resulting in a vector with embedding vector is concatenated with the SI a length of 129 for each word. For each for the three input types we trained three configurations: one bidirectional GRU with a different size (256, 512, 1024) and two configurations with two bidirectional GRUs, also with different sizes (256, 512). The embedding and the bidirectional GRU(s) are the same for all three input types. Figure 1 shows as input. If a word a schematic of an architecture of an RNN with text and SI is a keyword, 1 is outputted; if not, then 0 is outputted.
4
model.components / model.components .sum(axis=1)[:, np.newaxis] as suggested by https://scikit-learn.org/stable/modules/generated/sklearn.decomposition. LatentDirichletAllocation.html.
One Step Beyond: Keyword Extraction in German
781
im;besonders;vom;coronavirus;betroffen;. . . −0.979;−0.426;−0.302;−0.483;0.131;. . .
headline
lead
text
Bidirectional GRU headline
lead
text
0;0;0;1;0;. . .
as input. Fig. 1. Schematic of the RNN architecture with text and SI
4
Results
In Table 2, we report the results of our models on the validation set. In addition to precision, recall, and F1-measure, we also report, similarly to [19,20], accuracy 1 to 5. Accuracy n says how often a model predicted at least n keywords correctly. Concerning the NER, we see similar results to what we already saw in Table 1: only a small amount of named entities per text are keywords, and as expected, the NER models have a low precision with a relatively high recall. However, the chance is high NER at least selects one keyword correctly, which is indicated by high a1 values. The most complex NER model achieved an a1 of 88% and outperformed all other models. For the LDA, we calculate precision, recall, etc. of the first 10% and 20% of the most informative words of a text. The results of the LDA show that while keywords and the top informative words of a text overlap, keywords are not necessary among the most informative words. The low precision values succumb to the same effect as for the NER, i.e., a lot more words are reported as candidates than there are actual keywords. For TextRank, we report two results: one where we look at the top five words and one where we look at the top ten words. The accuracy values are quite high, especially a1, but TextRank suffers from the same precision, recall and F1 problems that we observed with the NER. The keywords that TextRank generates are, to a great extent, no ground truth-keywords, which is represented by the low precision, while the recall of both TextRank-models is much higher. However, when it comes to F1 values, TextRank is clearly outperformed by the RNN. On the other hand, the TextRank top-10-model outperforms all other models in a2 and a3, but a3 the lead is paper-thin. Regarding the RNN models, the first index refers to the input type, e.g., T if it saw the SI or on both. The second index if it was trained on texts and SI refers to the size of the GRU, the main difference between the five architecture
782
J. N. Philipp et al.
Table 2. Precision, recall, F1-measure, and the accuracy–values (a1–a5) of the employed methods. Accuracy n says how often a model predicted at least n keywords correctly. Model
a1
NERORG/PER
0.6351 0.1912 0.0532 0.0205 0.0135 0.1868
a2
a3
a4
a5
Precision Recall F1 0.2148 0.1749
NERORG/PER/LOC
0.6813 0.2259 0.0657 0.0271 0.0162 0.1457
0.2380 0.1595
NERORG/PER/MISC
0.8487 0.4148 0.1691 0.0809 0.0509 0.1425
0.3584 0.1910
NERORG/PER/LOC/MISC 0.8771 0.4528 0.1896 0.0918 0.0558 0.1249
0.3808 0.1784
TextRanktop5
0.7781 0.3633 0.1998 0.1724 0.1707 0.2161
0.4341 0.2725
TextRanktop10
0.8606 0.5112 0.3078 0.2550 0.2398 0.1371
0.5351 0.2094
LDA10%
0.1219 0.0228 0.0162 0.0162 0.0159 0.0107
0.0552 0.0173
LDA20%
0.2665 0.0816 0.0532 0.0479 0.0476 0.0128
0.1371 0.0228
RNNT,256
0.6902 0.3636 0.2919 0.2847 0.2840 0.5363
0.4669 0.4638
RNNT,SI,256
0.7028 0.3768 0.3015 0.2933 0.2929 0.5571
0.4785 0.4798
RNNSI,256
0.3362 0.1159 0.1143 0.1143 0.1143 0.3249
0.2038 0.2355
RNNT,512
0.6989 0.3795 0.3012 0.2936 0.2926 0.5544
0.4772 0.4773
RNNT,SI,512
0.7028 0.3768 0.3048 0.2959 0.2952 0.5576
0.4792 0.4812
RNNSI,512
0.2645 0.0908 0.0888 0.0888 0.0888 0.2518
0.1605 0.1845
RNNT,1024
0.6929 0.3702 0.2926 0.2853 0.2847 0.5562
0.4697 0.4759
RNNT,SI,1024
0.6767 0.3639 0.2860 0.2807 0.2801 0.5423
0.4602 0.4645
RNNSI,1024
0.3445 0.1159 0.1133 0.1133 0.1133 0.3245
0.2069 0.2371
RNNT,2×256
0.6764 0.3557 0.2734 0.2678 0.2672 0.5416
0.4531 0.4622
RNNT,SI,2×256
0.6988 0.3791 0.3025 0.2932 0.2926 0.5586
0.4775 0.4817
RNNSI,2×256
0.3554 0.1219 0.1189 0.1189 0.1189 0.3349
0.2144 0.2455
RNNT,2×512
0.6635 0.3504 0.2751 0.2698 0.2692 0.5615
0.4471 0.4660
RNNT,SI,2×512
0.6929 0.3686 0.2933 0.2853 0.2847 0.5527
0.4696 0.4751
RNNSI,2×512
0.3326 0.1169 0.1139 0.1139 0.1139 0.2299
0.3094 0.2031
types. However, the results of the RNN show that it is enough only to be trained to outperform (F1) all previously mentioned methods. Increasing the on the SI, and RNNSI,2×256 RNN size improves the precision, recall and F1. RNNSI,512 are exceptions to this. The RNNs trained only on the texts achieved even better results, indicating does not. that the texts provide some information that the SI Training the RNNs on both text and SI further improves the performance and has the overall best performance. But this improvement is only by a small amount, usually improving F1 by 0.01, which is not significant. An exception here as it is the only RNN which performance did not improve. is the RNNT,SI,1024 We attribute the lower performance of these models to the slower learning of the models.
One Step Beyond: Keyword Extraction in German
783
The accuracy values show that (i) for about 88% of texts, at least one keyword is a named entity, as far as the spaCy model is concerned (ii) all RNNs trained have for a3–a5 respectively the same values, showing that here is their on SI limit.
5
Conclusion and Outlook
The first two research questions could be answered in the affirmative: our study disclosed a correspondence of a word’s SI - derived from semantic contexts - and its semantic relevance in a text. RNN can be trained on SI; that is to say, lexical information is learnable from texts, the RNN only trained on SI performed on a par with TextRank. Regarding the third research question, we observed small improvements in the RNN-models results when trained both on the text and on SI, albeit the improvements are not significant. Thus the results of the third research question are inconclusive. Our study provides evidence for a semantic level of Shannon Information derived from semantic contexts in the discourse of a target word. The study can be interpreted in the sense that SI is a predictor of comprehension and understanding of natural language since we observed that SI is learnable from linguistic input. Lexical information in our approach is a cue to the semantic relevance of a word, with the amount of information content determining the strength of that cue. This interpretation of lexical information is compatible with connectionist models of language processing and learning, such as the Competition Model [23,24], in which the strength of cues determines the point of time and the quality of acquisition of linguistic levels. It follows that content words are learned earlier and initially better than function words because they carry higher information (see for a discussion on noun and verb learning [43]. The RNN-models, particularly when trained both on text and SI outperform TextRank w.r.t. standard measures precision, recall and F1. This is a clear indication that RNN, also when trained on the SI feature, selected keywords more accurately than the baseline models NER, LDA and TextRank. We conclude that RNN uses the relevant semantic properties of keywords to a stronger extent than the baseline models. The semantics of Shannon information is situated within the framework of distributional semantics. We thus postulate that further studies are needed to explore how lexical information content could contribute to a distributional model of lexical semantics. For models like word2vec that represent words as vectors derived from frequencies of cooccurrences, lexical information might improve usability and the semantic representation. Further research into the architecture and hyper-parameters of the RNN is needed to improve the usage of the lexical information further. Acknowledgments. This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project number: 357550571.
784
J. N. Philipp et al.
The training of the LDA and neural networks was done on the High Performance Computing (HPC) Cluster of the Zentrum f¨ ur Informationsdienste und Hochleistungsrechnen (ZIH) of the Technische Universit¨ at Dresden.
References 1. Aji, S.: Document summarization using positive pointwise mutual information. Int. J. Comput. Sci. Inf. Technol. 4(2), 47–55 (2012) 2. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018) 3. Bao, J., et al.: Towards a theory of semantic communication. In: 2011 IEEE Network Science Workshop, pp. 110–117. IEEE (2011) 4. Bennett, E.D., Goodman, N.D.: Extremely costly intensifiers are stronger than quite costly ones. Cognition 178, 147–161 (2018) 5. Bharti, S.K., Babu, K.S.: Automatic keyword extraction for text summarization: a survey. arXiv preprint arXiv:1704.03242 (2017) 6. Biemann, C., Heyer, G., Quastoff, U.: Wissensrohstoff text. eine einf¨ uhrung in das text mining (2. auflage) (2021) 7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003) 8. C ¸ ano, E., Bojar, O.: Keyphrase generation: a multi-aspect survey. arXiv preprint arXiv:1910.05059 (2019) 9. Cherry, E.C.: A history of the theory of information. Proc. IEE Part III Radio Commun. Eng. 98(55), 383–393 (1951) 10. Cohen, J.: Trusses: cohesive subgraphs for social network analysis. National security agency technical report 16:3-1 (2008) 11. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 12. Dretske, F.: Knowledge and the Flow of Information. MIT Press (1981) 13. Hahn, M., Jurafsky, D., Futrell, R.: Universals of word order reflect optimization of grammars for efficient communication. Proc. Natl. Acad. Sci. 117(5), 2347–2353 (2020) 14. Hale, J.: A probabilistic early parser as a psycholinguistic model. In: Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, pp. 1–8. Association for Computational Linguistics (2001) 15. Herrera, J.P., Pury, P.A.: Statistical keyword detection in literary corpora. Eur. Phys. J. B 63(1), 135–146 (2008). arXiv: cs/0701028 16. Huo, H., Liu, X.H.: Automatic summarization based on mutual information. Appl. Mech. Mater. 513–517, 1994–1997 (2014) 17. Jaeger, T.F., Levy, R.P.: Speakers optimize information density through syntactic reduction. In: Advances in Neural Information Processing Systems, pp. 849–856 (2007) 18. Kamp, H., Van Genabith, J., Reyle, U.: Discourse representation theory. In: Gabbay, D.M., Guenthner, F. (eds.) Handbook of Philosophical Logic, pp. 125–394. Springer, Dordrecht (2011). https://doi.org/10.1007/978-94-007-0485-5 3
One Step Beyond: Keyword Extraction in German
785
19. K¨ olbl, M., Kyogoku, Y., Philipp, J., Richter, M., Rietdorf, C., Yousef, T.: Keyword extraction in German: information-theory vs. deep learning. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: NLPinAI, pp. 459–464. INSTICC, SciTePress (2020) 20. K¨ olbl, M., Kyogoku, Y., Philipp, J.N., Richter, M., Rietdorf, C., Yousef, T.: The semantic level of Shannon information: are highly informative words good keywords? A study on German. In: Loukanova, R. (ed.) Natural Language Processing in Artificial Intelligence—NLPinAI 2020. SCI, vol. 939, pp. 139–161. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63787-3 5 21. Levy, R.: Expectation-based syntactic comprehension. Cognition 106(3), 1126– 1177 (2008) 22. Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304 (1998) 23. MacWhinney, B.: The competition model. In: Mechanisms of Language Acquisition, pp. 249–308 (1987) 24. MacWhinney, B., Bates, E.: Functionalism and the competition model. In: The Crosslinguistic Study of Sentence Processing, pp. 3–73 (1989) 25. Mahowald, K., Fedorenko, E., Piantadosi, S.T., Gibson, E.: Info/information theory: speakers choose shorter words in predictive contexts. Cognition 126(2), 313– 318 (2013) 26. Dan Melamed, I.: Measuring semantic entropy. In: Tagging Text with Lexical Semantics: Why, What, and How? (1997) 27. Mihalcea, R., Tarau, P.: TextRank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 404–411 (2004) 28. Miller, G.A.: The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychol. Rev. 63(2), 81 (1956) 29. Miller, G.A., Charles, W.G.: Contextual correlates of semantic similarity. Lang. Cogn. Process. 6(1), 1–28 (1991) 30. Novak, L., Piotrovskij, R.: Esperimento di predizione ed entropia della lingua rumena. Statistics linguistics, Bolona (1971) 31. Peters, M.E., Neumann, M., Zettlemoyer, L., Yih, W.: Dissecting contextual word embeddings: architecture and representation. arXiv preprint arXiv:1808.08949 (2018) 32. Piantadosi, S.T., Tily, H., Gibson, E.: Word lengths are optimized for efficient communication. Proc. Nat. Acad. Sci. 108(9), 3526–3529 (2011) 33. Piotrowski, R.: Text informational estimates and synergetics. J. Quant. Linguist. 4(1–3), 232–243 (1997) 34. Piotrowski, R.G.: Quantitative linguistics and information theory (quantitative linguistik und informationstheorie) (2005) 35. Ravindra, G.: Information theoretic approach to extractive text summarization. Ph.D. thesis, Supercomputer Education and Research Center, Indian Institute of Science, Bangalore, (2009) 36. Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint arXiv:cmp-lg/9511007 (1995) 37. Seco, N., Veale, T., Hayes, J.: An intrinsic information content metric for semantic similarity in WordNet. In: ECAI, vol. 16, pp. 1089 (2004) 38. Shannon, C.E.: Prediction and entropy of printed English. Bell Syst. Tech. J. 30(1), 50–64 (1951) 39. Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. University of Illinois Press (1949)
786
J. N. Philipp et al.
40. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (2013) 41. Tixier, A., Malliaros, F., Vazirgiannis, M.: A graph degeneracy-based approach to keyword extraction. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1860–1870 (2016) 42. Tomokiyo, T., Hurst, M.: A language model approach to keyphrase extraction. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pp. 33–40 (2003) 43. Waxman, S., Xiaolan, F., Arunachalam, S., Leddon, E., Geraghty, K., Song, H.: Are nouns learned before verbs? Infants provide insight into a long-standing debate. Child Dev. Perspect. 7(3), 155–159 (2013) 44. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: KEA: practical automated keyphrase extraction. In: Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pp. 129–152. IGI Global (2005) 45. Zhang, C.: Automatic keyword extraction from documents using conditional random fields. J. Comput. Inf. Syst. 4(3), 1169–1180 (2008) 46. Zhang, Q., Wang, Y., Gong, Y., Huang, X.-J.: Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 836–845 (2016) 47. Zhang, Yu., Tuo, M., Yin, Q., Qi, L., Wang, X., Liu, T.: Keywords extraction with deep neural network model. Neurocomputing 383, 113–121 (2020) 48. Zhou, Z., Wang, Y., Gu, J.: A new model of information content for semantic similarity in wordnet. In: 2008 2nd International Conference on Future Generation Communication and Networking Symposia, vol. 3, pp. 85–89. IEEE (2008)
Language Use and Susceptibility in Online Conversation Lu Xiao(B) , Qiyi Wu, Sucheta Soundarajan, and Jinfen Li Syracuse University, Syracuse, NY 13210, USA [email protected]
Abstract. Prior study indicates that persuasion attempts in small and private online communications can be very influential. Yet, most existing online persuasion studies focus on large-scale and open online discussions. In this study, we investigated whether and how one’s language use indicates their susceptibility to persuasion in one-to-one synchronous online text-based chats. We analyzed 815 one-to-one online discussions by 321 pairs. Our results show that discussions in which one or both participants change their views tend to have more positive emotions, more affective processes, and more impersonal pronouns. Additionally, individuals who did not change their minds tend to focus more on problem solving whereas those who changed their minds focus more on the relationship building. Our findings imply the potential of using surface level linguistic features in predicting the persuasion outcome in a one-to-one online discussion, shedding light on the development of persuasive dialogue system which is on the rise. Keywords: Online persuasion · Susceptibility · One-to-One discussion
1 Introduction Persuasion is defined as an attempt to change one’s actions, beliefs, or behaviors [1]. Researchers have found that many factors influence the outcome of an online per-suasion process, such as the language used in the communication and the credibility of the communicator [2, 3], the persuasion strategies applied in the communication [4], the interaction dynamics [5], and the online environments [6]. However, most of these studies examine persuasion processes in online discussion forums like Reddit r/changemyview, where participation is often open and large-scale. Other online persuasion settings have not attracted enough research attention. At the same time, the influence of private online communications in spreading conspiracy theories, disinformation, or expressing stigmatizing opinions is expected to be more significant when the mass media is no longer trusted [7] or public platforms suppress the expression of such opinions or the sharing of this kind of information. A better understanding of online persuasion processes and mechanisms in private communications such as synchronous one-to-one online communications is expected to help advance this research area. Another gap in online persuasion studies is the perspective the researchers take in the study. There are two sides in a typical persuasion process: the side of the persuader that © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 787–799, 2022. https://doi.org/10.1007/978-3-031-10464-0_54
788
L. Xiao et al.
makes persuasion attempts and the side that faces the attempts. Prior studies have mainly focused on the side of the persuader by answering questions like what linguistic features indicate an online comment’s persuasion power and how one’s credibility affects the persuasion power of their comments. With a few exceptions, questions about the other side of the persuasion are much less explored [2, 5, 8]. To contribute to narrowing these two identified research gaps, we analyzed one-toone synchronous online text-based chats. In this context, the two participants first provided their responses to questions about a scientific topic, and then exchanged their perspectives in the provided chat window. They were then asked to provide their responses again to the same questions. Each chat session is about one question. In this persuasion context, we attempted to answer these research questions: 1) Do the linguistic features of the participants reveal how likely they will be persuaded, and how? 2) Do the linguistic features of the discussion indicate that a successful persuasion has taken place, i.e., someone has changed their view because of the discussion, and how? The rest of the paper is organized as follows. We first review the literature about the connections between language use in the text and the persuasion power of the text, the existing studies about persuasion in online discussions, and the studies about online persuasion in one-to-one setting. We next describe the dataset we used in the analysis and the process of filtering and cleaning data. We also describe how we conducted the language and statistical analysis. We present our findings in detail in the results section next. We discuss contributions and implications of our work as well as the limitations of the study afterwards and conclude with the future work suggestions.
2 Related Work 2.1 Language Use and Persuasion Language used in a message is considered one of the most important elements that affect the persuasiveness of the message [9]. For instance, psycholinguistic re-searchers found that positive words were more easily comprehended than negative ones by receivers [10]. Language intensity refers to the degree of attitude of a source’s position deviating from the neutrality on an issue [11]. It is shown that the source’s language intensity plays a role in the receiver’s attitude change such that as the speaker’s language intensity increases, the clarity of the message also increases [12]. Syntactic structure of a message concerns the structure of the message, i.e., the ordering of the words. Prior study has shown that this ordering matters for persuasion [13]. Additionally, in communication situations, social power affects the language style used in the process [14]. The use of ‘powerless’ language such as certain tag questions (e.g., “don’t you think?”), hesitation (e.g., “umm”, “hmm”) and hedging (e.g., “I guess”, “sort of”) are associated with lower social power [15], whereas powerful language indicates higher social status [16]. Areni and Sparks [16] discovered that speakers using powerful language can manifest as more persuasive than those who speak powerfulness language. Murphy [17] outlined four widely observed features of the text that are known to lead to persuasion: Argument structure, content, comprehensibility, and credibility. Argument structure is not only about the ordering the words but also implies the strategies applied in the persuasion process [18], e.g., the use of both sides of an argument to make the
Language Use and Susceptibility in Online Conversation
789
message more persuasive [19]. Comprehensibility concerns how much cognitive load it requires for a receiver to understand the message, and credibility is the trustworthiness of the source. The language use affects the judgment of the message’s credibility and comprehensibility [20]. The content aspect focuses on the word or phrase choices such as the use of powerful or powerless language or emotional words. 2.2 Persuasion in Online Discussions There is an increasing research interest to investigate the factors of persuasion processes in online discussions, such as the participants’ gender [21] and prior experiences [22– 24], and the group setting [25]. Previous studies examined various factors of an online comment’s persuasiveness in changing a participant’s view, such as the participant’s argument itself [8] and how strongly they hold the position [5], the interaction dynamics between the participant and the other people in the discussion [2], and the reasoning strategies used in the others’ comments [4]. Researchers have explored the linguistic features indicative of a comment’s persuasion power in Reddit’s r/changemyview (CMV) discussions (e.g., [2]) and in Wikipedia’s article for deletion (AfD) discussions [6]. As an active subreddit, CMV offers a place where people share their views regarding a subject, welcome different perspectives that attempt to change their views. In an AfD discussion, participants offer their opinions with respect to how to handle the article being proposed to be deleted from Wikipedia. Per Wikipedia’s policy on deleting articles, participants need to justify their opinions and the person who makes the decision at the end of the discussion needs to make the decision based on these justifications as opposed to the majority vote of opinions. Researcher found that persuasive comments in these discussions tend to be more likely to reveal negative emotions than non-persuasive comments. 2.3 Online Persuasion in One-To-One Setting There are a couple of studies that are about online persuasion in one-to-one setting. The author in [26] analyzed the intricate organization of strategic disclosures and appeals employed in human persuasion conversations. These conversation data were obtained from their Amazon Mechanical Turk’s experiment which was designed to have one participant persuade the other to donate to a specific charity. Besides the dialogue data, they also collected information about participants’ demographic and psychological backgrounds including personality, morality, value systems, and their willingness to donate. The authors annotated 10 persuasion strategies in these dialogue data (e.g., logical appeal, emotion appeal, credibility appeal, etc.), and through machine learning classification experiments they explored which types of persuasion strategies led to a greater amount of donation depending on the individuals’ personal background. Interested in modeling the development of people’s persuasive ability, Luu, Tan, and Smith [27] introduced a method that quantifies the skill level of individual de-baters. This method is built upon the Elo model of skill in two-player games [28] and integrates linguistic features of debaters’ content. The researchers found that the best debaters’ skills improve over time, and their linguistic profile contributes to the estimation of their skill improvement.
790
L. Xiao et al.
In summary, there has been extensive work in traditional persuasion studies that connect the language use and persuasion behavior and outcome from surface level (e.g., the use of positive words) to high level language use (e.g., the syntactic structure of a message). Online persuasion studies in the context of discussion also explore various language patterns from different levels. There are only a few studies that examine oneto-one online persuasion contexts. The surface level linguistic cues that signify the persuasion power or capability in such contexts remain largely unexplored.
3 Research Methodology The dataset we analyzed was obtained from an experiment conducted by Educational Testing Service (ETS) [29]. In this experiment, student participants from the United States who have attended college for at least one year were recruited and randomly paired into teams and assigned to answer a set of scientific questions that probe their knowledge about volcano. Participants in a team did not know whom they were discussing with and were given time to have a warm-up discussion before the experiment. Then they were asked to answer seven single-choice (SC) questions and four open-ended questions about volcanoes. For each SC question, participants should first provide their answers individually, and then discuss with their partners, and lastly answer the question individually again. The experiment was conducted in online environment and text-based chat was provided for discussion. ETS recruited 1,000 students and randomly paired them into 500 groups. 20 teams’ data were removed due to uncompleted or invalid answers. Our data include the teams’ discussions of seven SC questions, members’ answers to the seven SC questions both before and after the discussion, as well as the individuals’ responses to a questionnaire administered by the ETS researchers about their demographics (age, gender, native language, etc.). We identify the persuasion power of the discussion for a SC question by measuring the differences between the initial and final answers of the team members to the question. If a member did not change the initial answer, then the discussion did not have the persuasion power on this individual, and vice versa. We filtered the discussions that may bring confounding issues to our measurement as follows: • Discussions in which participants talked about problems with the online environment. After the discussion of the question, there is a time interval for the participants to change their answers in the system. Our preliminary analysis of the discussion content discovered that some participants were not able to change their answers because of the system error. For instance, they may make comments like “I can’t change my answer”, “It won’t let me change”, “it gets stuck here”, etc. We remove such discussions from the analysis. • Discussions that both team members have the same before- and after- discussion answers. This implies that no persuasion attempt was made during the discussion. Therefore, we removed these discussions from the analysis. Considering that native and non-native English speakers may differ in their language use thus introducing a noise to the analysis, we filter the discussions that have non-native
Language Use and Susceptibility in Online Conversation
791
English speakers. We also acknowledge the potential differences in language use even among native English speakers due to cultural and historical backgrounds, e.g., Latin vs African Americans. However, we did not differentiate them in this analysis, given the small chance of these differences affecting the use of language for the persuasion power. After filtering the missing and unrelated data, and removing individuals whose native language is not English, we have 321 teams and 815 discussions remaining for the analysis (Fig. 1).
Fig. 1. Experiment process conducted by ETS
Based on the team members’ initial answers and final answers, there are seven types of situations observed in this dataset: 1. neither team member changed their initial answers, and their initial answers were different (N = 269 discussions, Type A discussions) 2. one team member changed their initial answer to their partner’s initial answer and their partner did not change their answer (N = 512 discussions, Type B discussions) 3. neither team member changed their initial answers which were different (N = 10 discussions) 4. members had the same answers and stayed the same (N = 10 discussions) 5. members’ initial answers were different, and their final answers were the initial answers of their team members, i.e., the pair switched their answers (N = 5 discussions) 6. members had different answers initially and changed their answers but their final answers were also different (N = 3 discussions) 7. members had different answers initially and had the same final answers after the discussion (N = 16 discussions).
792
L. Xiao et al.
As shown in the above list, the last five situations have a few occurrences ranging from 3 to 16. We therefore focus on comparing the language patterns in the first two types of discussions through statistical analysis. We first apply Linguistic Inquiry & Word Count (LIWC) analysis to the discussion content. LIWC has been widely used as a language analysis tool in prior online persuasion studies that examine the linguistic features of persuasive comments (e.g., [3]). With the LIWC analysis tool, each discussion content is an input text. As a bag-of-words approach, LIWC calculates the percentage of various linguistic, psychological, social, and emotional categories in this discussion input, based on the appearances of the words in the text that are associated with different categories [30]. The output of LIWC analysis for each discussion content therefore shows the percentage of each category in the text. We next conduct comparative analyses to examine the differences in the use of these LIWC categories between two types of discussions. For a Type B discussion, we label the member the Agreeable who changed their initial answer to be the same as the initial answer of the other member, and the NonAgreeable who did not change the initial answer. Interested in the dynamics between Agreeable and Non-Agreeable members in a discussion, we compare the word count of their comments in a discussion and the number of turns each made. We also explore whether one type of members tends to start the discussion more than the other. We also compare the language use by two types of individuals: those who never changed their initial answer for all seven questions versus the rest individuals in the data. Of the 642 individuals (i.e., 321 teams), 259 never changed their answers, and 383 changed at least once. At the individual level, we concatenate the sentences by an individual from all the discussions they participated out of the 815 discussions and use these sentences as the input to LIWC for that individual’s language use. We next conduct comparative analyses to examine the differences in the use of the LIWC categories between two types of individuals.
4 Results 4.1 Language Use in Discussions In Type A discussions, both members kept their initial answers which were different from each other (N = 269 Discussions). In Type B discussions, one member changed their initial answer to be the same as the initial answer of the other member, and the other member did not change the initial answer (N = 512 Discussions). Our analysis of language use shows that the use of five linguistic features was statistically significantly different between the two types of discussion content. These features are assent, affect, impersonal pronoun, positive emotion, and dictionary words. As shown in Table 1, the median of these linguistic features of Type B discussions are higher than the median of Type A discussions. Assent indicates agreement words like ‘agree, ok, yes’, and affect is about the use of words that reflect the affective processes including positive and negative feelings (e.g., ‘happy’, ‘cried’). It is not surprising that discussions that lead to one member changing his/her mind tend to have more assent words as they were showing their agreement to the other members. Perhaps more interesting finding is that these discussions also tend
Language Use and Susceptibility in Online Conversation
793
to have more positive emotions and more affective processes. As discussed in the related work section, previous studies show that negative emotions are found to occur more in persuasive comments than non-persuasive ones [2, 6]. What has contributed to these contradictory findings? In previous studies, the online discussions are asynchronous and open to many participants. In the experiment, on the other hand, people engaged in one-to-one and synchronous communication. We speculate that these differences have played a role in how emotions affect online persuasion processes. Further investigation is needed to explore this. As introduced earlier, each discussion was about the answer to a choice question regarding volcano. Given that there is no statistically significant difference in the amount of content (measured as in the number of words) in a discussion, it is likely that the higher use of impersonal pronouns in Type B discussions implies more focus on the question or the discussion topic hence the more usage of impersonal pronouns such as it, what, which, that, etc. e.g., “I put d but i wasn’t sure i know it’s definitely not c do you remember the video?”, “what letter did you choose”. For instance, it and its variations (e.g., its, it’s, etc.) occurred most in LIWC’s impersonal pronoun category. In Type A discussions, there are 119 occurrences which is about 0.44 occurrence per discussion (N = 269). In Type B discussions, the number of occurrences is 353, i.e., about 0.67 occurrence per discussion (N = 512). Table 1. LIWC result of Mann–Whitney U Test on discussion level (p < .05) LIWC features
Median (Type A Discussion)
Median (Type B Discussion)
Effect size
Assent
1.74
3.67
0.14
Affective processes
5.56
6.73
0.10
Positive emotion
4.17
5.26
0.08
Impersonal pronouns
3.92
4.65
0.07
79.31
80.32
0.07
Dictionary words
4.2 Dynamics in Type B Discussions To examine the dynamics between Agreeable and Non-Agreeable members in a Type B discussion, we calculated the ratio of the number of words in the Agreeable member’s comments over that in the Non-Agreeable member’s comments, R1, and the ratio of the number of turns the Agreeable member had over that the Non-Agreeable member had, R2. Our analysis shows that the 95% confidence interval for R1 is (1.42, 1.74) and for R2 is (1.06, 1.20). These findings imply that in a Type B discussion, the Non-Agreeable member tended to talk more both in terms of turns and the content. To examine whether one type of members tended to start the discussion more than the other, we analyzed the discussions followed. If the Non-Agreeable member had the first turn, we coded as 1, otherwise we coded 0. We then examined the 95% confidence interval of this sequence
794
L. Xiao et al.
of 0 s and 1s. The 95% confidence interval is (0.58, 0.69). This suggests that there are more situations in which Non-Agreeable members started the discussion first. 4.3 Language Use in Individual Comments The comparative analysis of the language use between two types of individuals, namely, those who never changed their answers (non-susceptible) vs. the rest users in the data (susceptible) shows that multiple LIWC features are used statistically significantly different in terms of their frequencies. We present in Table 2 the LIWC categories that show an effect size of at least 0.1 (p < 0.05). Table 2. LIWC result of Mann–Whitney U Test on individual level (Effect Size ≥ 0.1) LIWC features
Median (Non-susceptible Individuals)
Median (Susceptible Individuals)
Effect size
Analytical thinking
52.55
39.45
0.19
Authentic
58.07
69.96
0.15
Comparatives
3.92
3.33
0.15
Adjectives
5.36
4.55
0.11
Insight
3.25
3.64
0.11
Regular verbs
16.87
17.81
0.10
Auxiliary verbs
8
8.89
0.13
Dictionary words
78.47
80.36
0.12
1st personal singular pronoun
5.41
6.58
0.17
2nd person singular pronouns
1.05
1.39
0.12
Impersonal pronouns
4.23
4.9
0.13
Function Words
41.82
44.64
0.17
Analytical thinking is about the ability to identify problems and find information to develop a workable solution. People who score high in analytical thinking demonstrate more formal and logical thinking. Individuals who never changed their answers had higher analytical thinking scores than the rest of the group. On the other hand, individuals who changed at least once scored higher in the Authentic category, which is about the level of personal, honest, and disclosing text used in the language, the higher the score, the less guarded intention. One possible explanation is that the two types of individuals engaged in the discussions with different styles: more logical thinking as solving a problem (e.g., “more stations mean more data to consult”, “High frequency appears at the beginning of the seismic event. The rocks start to crack at that time.”) vs. more personal or self-disclosing as building a relationship (e.g., “what do you think?”, “we
Language Use and Susceptibility in Online Conversation
795
thought it was the best”). Therefore, people who were more into the problem solving in these discussions were less likely to change their answers as they either have put more effort in coming up their answers or are more serious with their answers, whereas people who were more into relationship building were more willing to change their answers as a compromise or accommodation to the other. Interestingly, in their LIWC analysis of susceptible and non-susceptible users’ Reddit comments, Mensah, Xiao, and Soundarajan (2019) also found that non-susceptible users tend to use more words in the analytical thinking category. Non-susceptible users also used fewer function words such as auxiliary words and pronouns. Prior experiments about language use in group interactions suggest that people with higher status consistently used fewer first-person singular, and more second-person singular pronouns [31]. Our finding is partially consistent with these earlier experiments. In these earlier experiments, mentioned earlier, the influential discussions have more impersonal pronouns. It is therefore likely that the observed impersonal pronouns at the discussion level are mainly by those individuals who are more susceptible. The use of function words is to make the text grammatically correct, more readable, or comprehensible. This implies that non-susceptible users put less effort on making the other party understand their perspectives, which is consistent with the observed differences on their thinking styles. The mean of comparatives category in LIWC 2015 dictionary is 2.23 [30], which is less than that in the texts by both types of individuals. Given that the discussions were about choosing an answer from the provided options, this finding is understandable (e.g., “A and B?”, “4, sort of the middle between the crater and the base?”). The observation that non-susceptible participants used more comparatives implies that they engaged in more logical or formal thinking in this context, which is consistent with the observed differences in the thinking style between the two types of individuals. On the other hand, our explanation is based only on the word frequency comparison, more investigation is desired through triangulated approaches, e.g., by soliciting users’ feedback through an online questionnaire.
5 Discussion In today’s social media world where it is increasingly common that users come across fake news articles and misinformation content people may become more and more skeptical about the social media content. In fact, a recent experiment shows that people’s trust on the content is significantly affected by the sharer, potentially even more so than the source itself [32]. In their examination of the information related events of early 2014 focusing on the annexation of Crimea, [7] showed how Russia or pro-Russian entities leveraged a variety of tools and methods to achieve information superiority. The authors pointed out the significance of “private” or interpersonal channels of communication in influencing the others. These studies imply the importance of understanding the persuasion processes and mechanisms in online private communications such as one-to-one chats or discussions is closed and small group chat spaces, which have been limitedly explored (please refer to the related work section for details).
796
L. Xiao et al.
Our discussion analysis of one-to-one online chats implies the potential of using surface level linguistic features (e.g., impersonal pronouns) in predicting the outcome of a one-to-one online discussion that involves changing a member’s opinion. The individual level analysis suggests that one’s surface level linguistic patterns in a certain communication context are associated with one’s likeliness to change their perspectives. These findings are expected to contribute to the understanding and prediction of persuasion processes and outcomes in the private online communications. Our findings also offer insights to the development of persuasive dialogue systems such as chatbots. Chatbot is a computer program that engages in conversations with humans. Because of its productivity, chatbots are widely adopted by businesses to provide information and support customer service and marketing (e.g., [33]). It is expected that the knowledge of delivering a persuasive argument has great potential in the development of such chatbot systems. One of our plans is to go beyond the exploration of surface level linguistic cues in understanding online persuasion processes in one-to-one settings. Prior studies have demonstrated the potential of persuasive strategies in the processes [26, 34]. We expect that the discourse parsing structure of a message reflects the use of persuasive strategies, at least to some extent. We are also interested in examining the contextual factors in one-to-one persuasion contexts. Through a crowdsourcing study, [35] explored how worker deliberation affects resolvability and accuracy. They found that various factors determine case resolvability such as the level and reason for the initial disagreement and the amount and quality of deliberation activities. Our next step is to explore whether and how the availability of the initial reasoning affects the discussion outcome and interacts with the individual’s susceptibility. Our study has two major limitations. First, the participants’ confidence level of their prior choices can affect their susceptibility to the alternative choices. Their confidence level, however, may not be reflected from their comments. Second, persuasion in a scientific argument can be different from subjective discussions. In the experiment, the participants were working on scientific problems as part of the learning activities, and it was only one session. The language use patterns observed in this setting might be different from another discussion context, say, people offer their subjective opinions regarding an issue and are tasked to convince each other to accept their views.
6 Conclusion Social engagement and interaction are increasingly taking place through new and advanced forms of online communication. More and more of these interactions involve complex processes of persuasion and influence [36, 37]. Researchers are increasingly interested in studying these processes such as understanding the many factors that many factors influence the outcome of an online persuasion process. However, most of these studies examine persuasion processes in online discussion forums like Reddit r/changemyview, where participation is often open and large-scale. Other online persuasion settings especially those that involve small group and private communications have not attracted enough research attention, despite the expectation that persuasion attempts
Language Use and Susceptibility in Online Conversation
797
in such settings are likely more influential and create higher societal impact, e.g., in spreading conspiracy theories, disinformation, or expressing stigmatizing opinions [7]. In addition, while prior studies have mainly focused on the side of the persuader by answering questions like what linguistic features indicate an online comment’s persuasion power and how one’s credibility affects the persuasion power of their comments, only a few explore the other side of the persuasion – the side that faces the persuasion attempt – e.g., how they respond, the characteristics of those who are more likely to be influenced or who are not easily persuaded. As a first step to close these literature gaps, we analyzed one-to-one synchronous online text-based chats to explore the linguistic features of the participants that reveal how likely they are to be persuaded. Our analysis shows that those who did not change their minds tend to use more analytical thinking whereas susceptible participants used more words that were at personal and self-disclosing level. Susceptible participants also used more function words, suggesting that they put more effort into making the other party understand their perspectives. In the same communication context, we also explored the language use in a discussion that indicates the successful persuasion in the process. We found that discussions that had successful persuasions tend to have more positive emotions, have more affective processes, and use more impersonal pronouns.
References 1. O’Keefe, D.J.: Persuasion, The International Encyclopedia of Communication (2002) 2. Tan, C., Niculae, V., Danescu-Niculescu-Mizil, C., Lee, L.: Winning arguments: interaction dynamics and persuasion strategies in good-faith online discussions. In: Proceedings of the 25th International Conference on World Wide Web (WWW), pp. 613–624 (2016) 3. Xiao, L., Khazaei, T.: Changing others’ beliefs online: online comments’ persuasiveness. In: Proceedings of the 10th International Conference on social media and Society, pp. 92–101 (2019) 4. Hidey, C., McKeown, K.R.: Persuasive influence detection: the role of argument sequencing. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 5173–5180 (2018) 5. Mensah, H., Xiao, L., Soundarajan S.: Characterizing susceptible users on Reddits changemyview. In: Proceedings of the 10th International Conference on Social Media and Society, pp. 102–107 (2019) 6. Xiao, L.: A message’s persuasive features in Wikipedia’s article for deletion discussions. In: Proceedings of the 9th International Conference on Social Media and Society, pp. 345–349 (2018) 7. Jaitner, M., Mattsson, P.A.: Russian Information Warfare of 2014, Presented at the 7th International Conference on Cyber Conflict, Tallinn, Estonia (2015). https://www.ccdcoe.org/upl oads/2018/10/Art-03-Russian-Information-Warfare-of-2014.pdf 8. Jo, Y., Poddar, S., Jeon, B., Shen, Q., Rosé, C. P., Neubig G.: Attentive interaction model: modeling changes in view in argumentation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 1, pp. 103–116 (2018) 9. Ng, N.S., Bradac, J.J.: Power in Language: Verbal Communication and Social Influence. Sage Publications Inc., Thousand Oaks (1993)
798
L. Xiao et al.
10. Jacoby, J., Nelson, M.C., Hoyer, W.D.: Corrective advertising and affirmative disclosure statements: their potential for confusing and misleading the consumer. J. Mark. 46(1), 61–72 (1982) 11. Bowers, J.W.: Language intensity, social introversion, and attitude change. Speech Monographs 30, 345–352 (1963) 12. Hamilton, M.A., Hunter, J.E.: The effect of language intensity on receiver evaluations of message, source, and topic. Persuasion: Advances through meta-analysis, pp. 99–138 (1998) 13. Motes, W.H., Hilton, C.B., Fielden, J.S.: Language, sentence, and structural variations in print advertising. J. Advertis. Res. (1992) 14. Blankenship, K.L., Craig, T.Y.: Language and persuasion: tag questions as powerless speech or as interpreted in context. J. Exp. Soc. Psychol. 43, 112–118 (2007) 15. Blankenship, K.L., Craig, T.Y.: Language use and persuasion: multiple roles for linguistic styles. Soc. Pers. Psychol. Compass 5(4), 194–205 (2011) 16. Areni, C.S., Sparks, J.R.: Language power and persuasion. Psychol. Mark. 22(6), 507–525 (2005) 17. Murphy, P.K.: What makes a text persuasive? Comparing students’ and experts’ conceptions of persuasiveness. Int. J. Educ. Res. 35(7–8), 675–698 (2001) 18. Chambliss, M.J., Garner, R.: Do adults change their minds after reading persuasive text? Writ. Commun. 13(3), 291–313 (1996) 19. Allen, M.: Meta-analysis comparing the persuasiveness of one-sided and two-sided messages. Western J. Speech Communicat. 55(4), 390–404 (1991) 20. Hosman, L.A.: Language and persuasion, The persuasion handbook: developments in theory and practice (2002) 21. Guadagno, R.E., Cialdini, R.B.: Online persuasion: an examination of gender differences in computer-mediated interpersonal influence. Group Dyn. Theory Res. Pract. 6(1), 38 (2002) 22. Cooke, A.D., Sujan, H., Sujan, M., Weitz, B.A.: Marketing the unfamiliar: the role of context and item-specific information in electronic agent recommendations. J. Mark. Res. 39(4), 488–497 (2002) 23. Gershoff, A., Mukherjee, A., Mukhopadhyay, A.: Consumer acceptance of online agent advice: Extremity and positivity effects. J. Consum. Psychol. 13(1–2), 161–170 (2003) 24. Lydon, J.E., Jamieson, D.W., Zanna, M.P.: Interpersonal similarity and the social and intellectual dimensions of first impressions. Soc. Cogn. 6(4), 269–286 (1988) 25. Price, V., Nir, L., Cappella, J.N.: Normative and informational influences in online political discussions. Commun. Theory 16(1), 47–74 (2006) 26. Wang, X., et al.: Persuasion for good: towards a personalized persuasive dialogue system for social good. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 5635–5649 (2019) 27. Luu, K., Tan, C., Smith, N.A.: Measuring online debaters’ persuasive skill from text over time. Trans. Associat. Computat. Linguist. 7, 537–550 (2019) 28. Elo, A.E.: The Rating of Chessplayers, Past and Present. Arco Publishers (1978) 29. Hao, J., Liu, L., von Davier, A.A., Kyllonen, P.C.: Initial steps towards a standardized assessment for collaborative problem solving (CPS): practical challenges and strategies, Innovative assessment of collaboration, pp. 135–156. Springer, Cham (2017). https://doi.org/10.1007/ 978-3-319-33261-1_9 30. Pennebaker, J.W., Booth, R.J., Boyd, R.L., Francis, M.E.: Linguistic inquiry and word count: LIWC2015, The development and psychometric properties of LIWC2015 (2015) 31. Kacewicz, E., Pennebaker, J.W., Davis, M., Jeon, M., Graesser, A.C.: Pronoun use reflects standings in social hierarchies. J. Lang. Soc. Psychol. 33(2), 125–143 (2014) 32. Sterrett, D., et al.: Who shared it? Deciding what news to trust on social media. Digit. J. 7(6), 783–801 (2019)
Language Use and Susceptibility in Online Conversation
799
33. Xu, A., Liu, Z., Guo, Y., Sinha, V., Akkiraju, R.: A new chatbot for customer service on social media. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 3506–3510 (2017) 34. Yang, D., Chen, J., Yang, Z., Jurafsky, D., Hovy, E.: Let’s make your request more persuasive: Modeling persuasive strategies via semi-supervised neural nets on crowdfunding platforms. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3620–3630 (2019) 35. Schaekermann, M., Goh, J., Larson, K., Law, E.: Resolvable vs. irresolvable disagreement: a study on worker deliberation in crowd work. In: Proceedings of the ACM on Human-Computer Interaction, vol. 2, no. CSCW, pp. 1–19 (2018) 36. Dey, S., Duff, B., Karahalios, K., Fu, W.T.: The art and science of persuasion: not all crowdfunding campaign videos are the same. In: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW), pp. 755–769 (2017) 37. Nguyen, D.T., Dabbish L.A., Kiesler, S.: The perverse effects of social transparency on online advice taking. In: Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW), pp. 207–217 (2015)
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness? A Reddit Study Lu Xiao(B) and Humphrey Mensah Syracuse University, Syracuse, NY 13210, USA [email protected]
Abstract. Online interactions increasingly involve complex processes of persuasion and influence. Compared to the long history and richness of persuasion studies in traditional communication settings, we have limited understanding of how people are influenced by others in online communications and how persuasion works in online environments. While it is common in online discussions that some comments are threaded under a specific thread, it is un-known whether and how the thread level affects its perceived persuasiveness. To explore this research inquiry, we collected and analyzed threaded discussions in Reddit’s r/changemyview context. We found that the perceived persuasiveness of a comment fluctuates systematically from the top thread level to the most nested level. We conducted a semantic similarity analysis among adjacent comments in the threads examining how similar the comments are with respect to their content. Our results suggest that the first thread comment brings up a new idea or perspective, and the next comment matures it by adding new information to elaborate it, therefore, this comment is more likely to receive a delta point than the first comment. Additionally, this pattern continues onto the next comments. Implying that there is a common reasoning pattern in engaging in the threaded discussions in Reddit r/changemyview, our study sheds light on a comprehensive understanding of online participants’ reasoning behavior in threaded discussions. Keywords: Online persuasion · Threaded discussion · Reddit r/changemyview
1 Introduction Internet users increasingly interact with others through social media. The four at-tributes of social media communication (SMC) – anonymity, synchronicity, openness, and heterogeneity – make it convenient and fast to share and spread information in one’s social network. Meanwhile, SMC increasingly involves complex processes of persuasion and influence, apart from mere information sharing and dissemination [1, 2]. These attributes make SMC very challenging for users to engage in complex interactions such as argumentation. A good understanding of online persuasion processes and mechanisms is important for better understanding and supporting argument mining research. Online persuasion has become a growing research field, and a subreddit discussion forum called r/changemyview (CMV) is a popular site for such research because of its © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 800–813, 2022. https://doi.org/10.1007/978-3-031-10464-0_55
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness?
801
purpose and rules. CMV is an active subreddit with about 1.1 million subscribers (as of September 10, 2020). A CMV user publishes one of her believed views on a subject as the original post (OP). The user is expected to be open to different views and be willing to change the belief if she is convinced by the other users’ comments/arguments. Once a view is posted on the CMV forum, other users can provide comments/arguments and reason against the OP’s initial view. If a comment successfully changes the OP’s original belief, CMV requires OP to give a delta point () and provide an explanation on why and how the comment has changed their view. CMV is heavily moderated forum, monitoring and controlling any post or comment not following the rules. The CMV’s delta mechanism, along with its strict rules, provides a unique and valuable place for studying online persuasion. Previous CMV studies have examined various factors of a comment’s persuasiveness in changing OPs’ view. Such as the use of function words and punctuation marks, the sentence structure, the order of the comments, the interaction dynamics of discussion, and the users’ credibility [3–5].
Fig. 1. A Reddit r/changemyview threaded discussion
In CMV, a comment may respond to the original post (OP) of the discussion thread, or to any previous comment. If a comment responds to OP, it is at the same level with the OP, i.e., it is at the top-level in the discussion. If the comment responds to a previous comment, it is nested in a thread rooted at the comment it responds to. Reddit allows at most nine levels of embedded threading in a discussion (or ten levels if we include the top-level). Figure 1 shows an example of a CMV with different discussion threads.
802
L. Xiao and H. Mensah
Prior research of top-level comments suggests that the likelihood of a CMV comment to receive the delta point would decrease if it entered the discussion at a later stage [5, 6]. However, it is still unknown whether and how a comment’s thread-level relates to its likelihood of getting a delta point. To fill this gap, we first examined the percentage of comments that received delta points at different threaded levels in our dataset. We noticed that while the number of CMV comments that received delta points decreases from top-level to the highest-level underneath in general, the percentage of such comments fluctuates up and down from one threaded level to its adjacent level. We then investigated whether this fluctuation is affected by the development and change of topics in these comments by measuring the how similar the comments are to each other in terms of their content.
2 Related Work Researchers are increasingly interested in studying persuasion mechanisms and processes on social media. In these environments, the whole communication history is often accessible to the participants and even to the public. One way to leverage this resource is to create annotated corpus for persuasion studies. For example, Anand et al. [7] developed a classification of the persuasive attempts in online texts. This classification scheme included Cialdini’s six principles of influence [8], Marwell and Schmitt’s twelve strategy types for securing behavioral compliance [9], and some of Walton et al.’s argumentation schemes [10]. Also, with a crowdsourced annotation approach, Habernal and Gurevych [11] classified 11,650 web arguments into two groups: convincing arguments and non-convincing arguments and made the annotated corpus available to the public. Computational techniques are also being developed to explore the indicators of a social media comment’s persuasive power. For example, request fulfillment is central to many social platforms such as Q&A websites (e.g., StackOverflow.com) and crowdfunding communities (e.g., Kickstarter.com). For such requests to be fulfilled by other participants, the requester needs to employ persuasion strategies to convince others to help. By examining the request content, researchers found that the language in these requests, gratitude and reciprocity, and the status of the requester in the interactions, have predictive power on the success of a request (e.g., [12, 13]). Furthermore, [5, 6] studied persuasive comments from Reddit r/changemyview (CMV) discussions. They found statistically significant differences in the use of function words between non-persuasive and persuasive comments. From a psycho-logical and social perspective, function words reflect how people communicate, whereas content words convey what they are saying. Therefore, function words are much more closely linked to measures of writers’ and readers’ social and psychological worlds [14]. In addition, persuasive comments tend to be longer in sentences and have a more complex structure. Besides these surface level linguistic features, sentence structure is also found to be an important indicator. For example, Iyer et al. [15] developed a multiclass classification, unsupervised domain independent model that detects the type of persuasion in the text. The researchers found that sentence structure plays an important role in indicating different persuasion tactics.
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness?
803
The contextual factors are also explored in the online persuasion research. Guadagno and Cialdini [16] conducted two studies that compared participants’ attitudes after hearing a series of arguments from a same-gender communicator via e-mail and face-to-face interaction. The studies found that while the communication medium mattered to women in terms of the perceived persuasiveness of the messages it did not matter to the men. Specifically, women are more likely to be persuaded in face-to-face interaction than in the e-mail condition, whereas men showed no difference. Price et al. [17] analyzed a series of 60 online group discussions and found that one person’s arguments were influenced the others’ arguments or even just the others’ opinion statements. The author in [5] found that the entry order of a comment and the status of the commenter in the CMV community affects its perceived persuasiveness. Despite of these prior efforts, the role of a comment’s position in a thread on its perceived persuasiveness has not been studied to our best knowledge.
3 Research Methodology 3.1 Dataset We followed the data collection procedure and the SQL database structure suggested by [18] to collect CMV comments for this study. In our dataset, each OP has a unique submission ID (SID). Each comment has the following information with its unique ID (CID): (1) the SID, (2) its parent comment’s CID, (3) its thread level, and (4) the comment’s content. For each discussion, we created a “tree” structure to reflect the threads in the discussion. The root of such a tree is the original post (OP). Its children are the comments at the top-level. Each of these children may have one or more branches, representing the threaded discussions in response to the comment associated with the child. Figure 2 presents a visual mapping between a CMV discussion and its “tree” structure. As shown in this figure, this discussion tree has three subtrees with lengths varying from four to five, from the root of the tree, i.e., the OP. Note that the top-level comments are at level 0 and the depth is one from the root OP. With the collected data, we obtained 8,303 discussion trees with 216, 454 comments. Among these subtrees, 10,415 have comments that have received delta points.
Fig. 2. Structure of the topic and its “tree” structure
804
L. Xiao and H. Mensah
3.2 Delta Awarded Comments on the Trees: Our Observation and Hypotheses With these 8,303 discussion trees, we first examined the percentage of comments that received delta points at each level. Since a delta point is usually given by the user of the original post, i.e., OP, to another participant, we made sure that the calculation of this percentage used the number of comments at a level that are not from OP, instead of the total number of comments. Table 1 shows the number and percentage of comments at each thread level made by the original posts (OPs) or by others. Table 2 shows the number and percentage of comments that have received delta points. Additionally, we acknowledge that each discussion tree may have multiple subtrees with varying levels. We therefore also counted the number of times that a level had the highest percentage of delta-received comments among the subtrees of a discussion and provided this information in Table 2 as well. For example, as can be shown in Table 2, there were 1,538 times that level 1 had the highest percentage of delta-receiving comments in a discussion. Table 1. Number and percentage of comments at different thread levels by OPs or by others Level #
# of comments
# of comments by OP
% of OP’s comments
# of Comments by others
1
21421
3882
18.12%
17537
2
29829
294
0.99%
27818
3
27570
4703
17.06%
22502
4
29608
274
0.93%
25767
5
21644
3046
14.07%
18327
6
20116
181
0.90%
18256
7
14261
1789
12.54%
12286
8
12010
125
1.04%
11113
9
8775
1057
12.05%
7619
10
7384
70
0.95%
6935
At the top-level, the comments are less likely to receive delta points because they are entered later than the comments at the other levels [5]. While the likelihood of receiving delta points also decreased in general as the thread level goes higher and higher, the percentage of the comments that received delta points at a thread level fluctuates from one level to the next, e.g., as shown in Table 1, 7.83% at level 1, 13.86% at level 2, 3.23% at level 3, 7.08% at level 4, etc. In other words, the third, fifth, and seventh comments are more likely to receive delta points than the second, fourth, and sixth level comments. Note that the thread level does not reflect the entry level of a comment, thus it is understandable that there are fluctuations in the decreasing trend. However, what have contributed to this systematic pattern of fluctuation?
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness?
805
Table 2. Situations of comments receiving delta points at different thread levels Level #
# of delta comments
% of delta comments
# of times this level has the highest percentage of delta comments
1
1374
7.83%
1538
2
3855
13.86%
3529
3
726
3.23%
253
4
1825
7.08%
1533
5
409
2.23%
146
6
885
4.85%
700
7
225
1.83%
83
8
439
3.95%
336
9
94
1.23%
40
10
198
2.86%
145
Threaded comments are those that respond to previous comments. We speculated that this observed fluctuating pattern with respect to the likelihood of receiving delta points, i.e., perceived persuasiveness. For instance, when comment B responds to comment A and becomes the child of comment A, its content is still very relevant to A but with some different information to show a different perspective or point. At the same time, comment B has a new but relevant aspect to A but this aspect was not a mature thought yet in this discussion context. When comment C responds to comment B, its main purpose was to explicate that point, e.g., further elaboration, more examples, etc. Therefore, comment C is perceived to be more persuasive than comment B even though they are adjacent to each other. Then, as comment D responds to comment C, it is the introduction of another new but relevant aspect to this threaded discussion, and when comment E responds to D, it matures this new aspect and hence is perceived more persuasive than comment D. Situated in the CMV context and in Table 1, level 1 comments are like comments B responding to the top-level comments (i.e., comment A), level 2 comments are like comments C, level 3 comments are comments D, and level 4 comments are like comments E. To examine whether this explanation is right, we need to measure whether level 1 comments introduced new but relevant aspect to top-level comments; whether level 2 comments are more elaborated with examples or details or are better written. Considering the number of comments gathered, it is impractical to manually read and analyze these issues. Hence, as a first step, we measured the semantic similarity among these comments. By definition, semantic similarity measures how two texts are similar at the content level. We hypothesized that as level 1 comments introduce new but still very relevant ideas to level 0, and level 2 comments are to offer more information or better arguments to mature the ideas in level 1, level 0 and level 1 comments would be more similar with each other than level 1 and level 2 comments. Likewise, level 2 and 3 comments would have a higher similarity than that of level 3 and 4 comments, and so on. In general, we
806
L. Xiao and H. Mensah
hypothesized that two adjacent levels’ comments are more similar than two non-adjacent ones (e.g., level 0 and level 1 are more similar than level 0 and level 2). In summary, our hypotheses can be formulated as follows based on Table 2: • Hypothesis 1: The similarity scores of two higher adjacent levels are equal or higher than that of two lower adjacent levels starting from (level 0, level 1). Hypothesis 1a: Similarity score of (level 1, level 2) is less than that of (level 0, level 1) Hypothesis 1b: Similarity score of (level 3, level 4) is less than that of (level 2, level 3) Hypothesis 1c: Similarity score of (level 5, level 6) is less than that of (level 4, level 5) Hypothesis 1d: Similarity score of (level 6, level 7) is equal or higher than that of (level 7, level 8) Hypothesis 1e: Similarity score of (level 8, level 9) is equal or higher than that of (level 9, level 10) • Hypothesis 2: Among the comments of three sequential levels, the similarity score of the two adjacent levels is higher than that of the two non-adjacent ones 3.2.1 Similarity Measure and Mann Whitney U Test: Our Analysis In this similarity analysis, we focused on the measurement and comparison among nontop-level comments. Therefore, a subtree in this analysis should have at least four levels, i.e., the root, the top-level comment, and two non-top level comments. Thus, we removed those subtrees with a depth of three or less from further analysis. Also, we noticed that some comments were deleted and the corresponding text field in the dataset has the content “[deleted]” or “[removed]”. We removed the subtrees that contained “[deleted]” or “[removed]” from further analysis. In the end, we obtained 5,631 discussion trees (i.e., 5,631 OPs and the corresponding discussions) and 29,261 subtrees. The largest number of subtrees had five level. From the distribution of the path length in Fig. 3, we considered levels up to 10. There are many ways to measure how two texts are similar to each other at the content level. We chose a common method – cosine similarity in this study. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. With this approach, every two comments were treated as a small corpus and a collection of unique words were obtained from this corpus. Then, each comment was converted into a vector of the number of each unique word. Then, Euclidean dot product formula below was used to calculate the cosine value of the two vectors that represent the two comments. The cosine similarity score has the value within 0 and 1 where 0 indicates the two comments had no similarity and 1 means the two comments were identical. We computed the cosine similarity between comments
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness?
807
12000 10000
Number
8000 6000 4000 2000 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 path length
Fig. 3. Number of subtrees with different length.
in adjacent levels (levels 0, 1 and levels 1, 2) and for non-adjacent level comments (levels 0, 2). →.→ x y sim(X , Y ) = cosθ = →.→ |x|
|y|
The occurrences of stop words (e.g., I, the, an, a) in the two comments do not represent that the two are similar in the content. Therefore, to avoid their noise in the similarity score, we removed stop words with Scikit-Learn Library before the cosine similarity measure. To compare the semantic similarity scores according to the hypotheses, e.g., the score of (level 1, level 2) case versus that of (level 0, level 1) case for hypothesis 1, we first tested the normality of the data by performing Kurtosis test on the similarity score sequence of each case. We then performed the non-parametric Mann Whitney U tests. 3.3 Hypothesis Testing: Our Results The test results support all our hypotheses. We plotted a boxplot of the semantic similarity computation between different levels in Fig. 4. We also explain the results more specifically as follows. For the first hypothesis – the similarity scores of two higher adjacent levels are equal or higher than that of two lower adjacent levels starting from (level 0, level 1) – we examined all the sub-hypotheses, and the results are shown below. Hypo. 1a: (level 1, level 2)’s similarity score is lower than that of (level 0, level 1). We observed that comments of level 0 and 1 were significantly more similar than that of level 1 and 2 (p < .001). The median of similarity scores between 0 and 1 was 0.20 while that of level 1 and 2 was 0.18. Hypo.1b: (level 3, level 4)’s similarity score is lower than that of (level 2, level 3). We observed that the similarity score between level 3 and 4 comments is less than that of level 2 and 3 comments (Score (2, 3) = 0.18, Score (3, 4) = 0.14, p < 0.001). Hypo. 1c: (level 5, level 6)’s similarity score is lower than that of (level 4, level 5). We learned that level 4 and 5 comments were significantly more similar than level 5 and 6 comments (Score (4, 5) = 0.18, Score (5, 6) = 0.15, p < 0.001).
808
L. Xiao and H. Mensah
Hypo. 1d: (level 6, level 7)’s similarity score is not lower than (level 7, level 8)’s. We observed that comments occurred at level 6 and level 7 were semantically more similar than those at level 7 and level 8 (Score (6,7) = 0.18, Score (7,8) = 0.15, p < 0.001). Hypo. 1e: (level 8, level 9)’s similarity score is now lower than (level 9, level 10)’s. The median similarity score between level 8 and level 9 comments is 0.20 while that between level 9 and 10 comments is 0.16 (p < 0.001). Hypo. 2: Among the comments of three sequential levels, the similarity score of the two adjacent levels is higher than that of the two non-adjacent ones.
Fig. 4. Semantic similarity measures between different levels
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness?
809
From our Mann Whitney U tests between non-adjacent groups, we learned that this hypothesis is supported in all five groups of level triplets that we considered. For the first group of levels (level 0, 1 and 2), we learned that even though the two lower adjacent levels (level 1 and level 2) were less similar in comparison to the upper adjacent levels (level 0 and level 1), they were more similar when compared to level 0 and level 2 (p < 0.001). A similar observation was made for levels (2, 3 and 4), (4, 5 and 6), (6, 7 and 8), and (8, 9 and 10) with p values all less than 0.001. The fact that all our hypotheses are supported by the result implies that there is a common reasoning pattern in engaging in the threaded discussions in Reddit r/changemyview. In other words, when one replies to a comment starting a thread, one offers relevant but different information to show a different perspective. As the focus is to present a different viewpoint, one’s thought is not mature yet. Then, the individual who replies to one tends to focus on explicating that point, e.g., further elaboration, more examples, etc. Furthermore, the next reply to this individual’s comment is the introduction of another new but relevant aspect to this threaded discussion, and the one after again focuses on making this new point more mature.
4 Discussion Our finding offers insight on the processes and mechanisms of people’s engagement in a threaded discussion. Engaging in online threaded discussions has become an important part of our daily life, and even in our education [19–21] or our professional career [22]. While various studies have been conducted to understand the dynamics in threaded discussions [23–26], we have limited knowledge on how people reason in this process – do they offer their own views independently? Or do they tend to attack and counterattack as in a structured argumentation style? Or do they build on others’ arguments when offering their own? Or else? A good understanding of people’s engagement styles and reasoning processes in the threaded discussions helps us better interpret the process, outcome, and impact of these discussions which are essential components in distributed collaborations. As a first step towards a comprehensive understanding of people’s reasoning behavior in online threaded discussions, our study contributes to the research literature on online threaded discussions. Our study is conducted in Reddit r/changemyview discussion forum context, it unknown whether and how this general context contributes to the observed engagement style. As our study shows the usefulness of using semantic similarity scores to study this phenomenon, a future task can be to compare the semantic similarity scores of threaded comments in other online discussion contexts to explore the generalizability of our finding and/or the influence of the discussion context on people’s engagement style in threaded discussions. Contributing to online persuasion research, our finding offers empirical input to the recent effort of developing computational techniques to automatically classify an online comment’s perceived persuasiveness. Researchers use this online context in building classifiers to classify the comment’s perceived persuasiveness [5, 6]. Focusing on toplevel comments, they have considered various surface level linguistic cues such as the use of pronouns and punctuation marks, the entry order of the comment, and the commenter’s status. Our finding suggests that as the thread level is correlated with its likelihood of
810
L. Xiao and H. Mensah
receiving a delta point, it can be potentially useful in the task of classifying a threaded comment’s perceived persuasiveness. The technology support for distributed teamwork and remote collaboration has advanced a lot in the last decades. At the beginning, researchers focused on the development of technologies that enable synchronous and asynchronous communications among team members such as electronic mails and shared virtual workspace [27]. When such backbone support becomes available, attentions are paid to better support various team activities such as information sharing, knowledge building, and team coordination activities [28]. Researchers also investigate how the collaboration technology may support community building and individual development [29]. In the team communication spaces, researchers study the content to explore team dynamics reflected from these discussions, e.g., conflict and conflict management [30], Threaded discussions are often essential components in distributed collaborative activities. Complex interactions in these discussions such as the persuasion processes and mechanisms are yet to be explored in the teamwork context. From this aspect, our work that explores the connection between a comment’s thread level and its perceived persuasiveness, and how users’ reasoning processes intertwine in these threaded discussions offers insights on the future remote collaboration research, besides its direct contribution to online persuasion and threaded discussion research. In addition, our hypotheses focused on the information processing and idea development aspect of the threaded discussion. Prior literature has indicated that information processing is only one aspect in the process of evaluating comments and (potentially) be persuaded. Another future direction is to explore how various contextual factors in team collaboration activities (e.g., members’ roles, collaboration history) play a role in the persuasion processes in these threaded discussions. Although all six hypotheses are supported, our study has a limitation that should be noted. Specifically, our reasoning as to why certain levels’ comments are more likely to receive delta points has only considered the connection between a comment’s perceived persuasiveness and the amount of information it offers, that is, if a comment offers more information, then it is more likely to receive a delta point. Therefore, we only focused on the semantic similarity measure to probe the influence by the information amount or information entropy. On the other hand, people can be influenced by the incoming comments through both central route and peripheral route [31]. Central route is related to information processing which involves careful and logical assessment of the comments including their arguments and the information, whereas the peripheral route is about the contextual cues that affect the people’s decision making. A user may give a delta point to the comment because of the content of the comment or because of the contextual factors at the time, e.g., the user’s prior position, the user’s mood, the others’ comments in the discussion, etc. In fact, the observed systematic fluctuation of a comment’s likelihood of receiving a delta point at varying thread level suggests that the comment’s thread level is an influential contextual cue as well. Therefore, it may not be sufficient to only focus on the influence at the central route processing to explain this systematic fluctuation.
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness?
811
5 Conclusion With rapid advances in collaborative and social computing, web technologies that offer non-traditional communication environments are growing fast. Social engagement and interaction, such as civic engagement and public deliberation, are increasingly taking place through these new and advanced forms of online communication. Reddit CMV discussions have gained an increasing research attention in online persuasion study because of the characteristics of this discussion forum. Yet, little study has examined whether and how a comment’s thread level plays role in its perceived persuasiveness. As a pioneering attempt to address this research gap, we analyzed the threaded comments in CMV discussions focusing on the connection between their thread levels and their likelihood of being perceived as persuasive. Our analysis results show that a comment’s thread level is impactful in this likelihood and the relationship between the two is not a straight increase or decrease. Rather, there is a systematic fluctuation in the comment’s likelihood of being perceived as persuasive from one thread level to next. We hypothesized that the first thread comment brings up a new idea or perspective, and the next comment matures it by adding new information to elaborate it, therefore, this comment is more likely to receive a delta point than the first comment. And this pattern continues onto the next comments. From the semantic similarity perspective, the first thread comment and its root comment are therefore more similar than their subsequent comments (as a subsequent comment needs to add new information to it). We also hypothesized that this pattern would continue onto the next comments. To test these hypotheses, we measured the semantic similarities between comments of different levels. Supporting these hypotheses, our results offer insights on understanding the reasoning behavior of online participants in threaded discussions.
References 1. Bail, C.A.: Exposure to opposing views on social media can increase political polarization. Proc. Nat. Acad. Sci. 115(37), 9216–9221 (2018) 2. Xiao, L.: A message’s persuasive features in Wikipedia’s article for deletion discussions. In: Proceedings of the 9th International Conference on Social Media and Society, pp. 345–349 (2018) 3. Hidey, C., McKeown K.R.: Persuasive influence detection: the role of argument sequencing. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI 2018), pp. 5173–5180 (2018) 4. Jo, Y., Poddar, S., Jeon, B., Shen, Q., Rosé, C.P., Neubig G.: Attentive interaction model: modeling changes in view in argumentation. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 1, pp. 103–116 (2018) 5. Tan, C., Niculae, V., Danescu-Niculescu-Mizil, C., Lee, L.: Winning arguments: interaction dynamics and persuasion strategies in good-faith online discussions. In: Proceedings of the 25th International Conference on World Wide Web (WWW), pp. 613–624 (2016) 6. Xiao, L., Khazaei, T.: Change others’ beliefs online: online comments’ persuasiveness. In: Proceedings of the 10th International Conference on Social Media and Society (2019) 7. Anand, P., et al.: Believe me—we can do this! Annotating persuasive acts in blog text. In: Proceedings of the 10th AAAI Conference on Computational Models of Natural Argument (AAAIWS 2011-10), pp. 11–15 (2011)
812
L. Xiao and H. Mensah
8. Cialdini, R.B.: Influence: The Psychology of Persuasion, pp. 173–174. Collins, New York (2007) 9. Marwell, G., Schmitt, D.R.: Dimensions of compliance-gaining behavior: an empirical analysis. Sociometry 350–364 (1967) 10. Walton, D., Reed, C., Macagno, F.: Argumentation Schemes. Cambridge University Press, Cambridge (2008) 11. Habernal, I., Gurevych, I.: Argumentation mining in user-generated web discourse. Comput. Linguist. 43(1), 125–179 (2017) 12. Hsieh, H.-P., Yan, R., Li, C.-T.: Will i win your favor? Predicting the success of altruistic requests. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J.Z., Wang, R. (eds.) PAKDD 2016. LNCS (LNAI), vol. 9651, pp. 177–188. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-31753-3_15 13. Mitra, T., Gilbert, E.: The language that gets people to give: phrases that predict success on kickstarter. In: Proceedings of the 17th ACM Conference on Computer Supported Cooperative Work & Social Computing (CSCW), pp. 49–61 (2014) 14. Tausczik, Y.R., Pennebaker, J.W.: The psychological meaning of words: LIWC and computerized text analysis methods. J. Lang. Soc. Psychol. 29(1), 24–54 (2010) 15. Iyer, R.R., Sycara, K.P., Li, Y.: Detecting type of persuasion: is there structure in persuasion tactics? In: Proceedings of the 17th Workshop on Computational Models of Natural Argument (CMNA) at International Conference on Artificial Intelligence and Law (ICAIL), pp. 54–64 (2017) 16. Guadagno, R.E., Cialdini, R.B.: Online persuasion: an examination of gender differences in computer-mediated interpersonal influence. Group Dyn. Theory Res. Pract. 6(1), 38 (2002) 17. Price, V., Nir, L., Cappella, J.N.: Normative and informational influences in online political discussions. Commun. Theory 16(1), 47–74 (2006) 18. Khazaei, T., Xiao, L., Mercer, R.: Writing to persuade: analysis and detection of persuasive discourse. In: Proceedings of iConference (2017). http://hdl.handle.net/2142/96673 19. Hecking, T., Chounta, I.A., Hoppe, H.U.: Investigating social and semantic user roles in MOOC discussion forums. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge (LAK), pp. 198–207 (2016) 20. Fan, Y.C., Wang, T.H., Wang, K.H.: Studying the effectiveness of an online argumentation model for improving undergraduate students’ argumentation ability. J. Comput. Assist. Learn. 36(4), 526–539 (2020) 21. Cole, M.T., Swartz, L.B., Shelley, D.J.: Threaded discussion: the role it plays in e-learning. Int. J. Inf. Communicat. Technol. Educat. (IJICTE) 16(1), 16–29 (2020) 22. Peterson, M.: Teaching the online marketing research course for MBA students. J. Market. Educat. 43, 371–385 (2021). 02734753211001422 23. Himelboim, I., Gleave, E., Smith, M.: Discussion catalysts in online political discussions: content importers and conversation starters. J. Comput. Mediat. Commun. 14(4), 771–789 (2009) 24. Kang, J.H., Kim, J.: Analyzing answers in threaded discussions using a role-based information network. In: 2011 IEEE Third International Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third International Conference on Social Computing, pp. 111–117 (2011) 25. Zhu, C., Rodríguez-Hidalgo, R.C.R.H., Questier, F., Torres-Alfonso, A.M.: Using social network analysis for analysing online threaded discussions. Int. J. Learn. Teach. Educat. Res. 10(3) (2015) 26. Samory, M., Cappelleri, V.M., Peserico, E.: Quotes reveal community structure and interaction dynamics. In: Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW), pp. 322–335 (2017)
How Does the Thread Level of a Comment Affect its Perceived Persuasiveness?
813
27. Ishii, H.: TeamWorkStation: towards a seamless shared workspace. In: Proceedings of the 1990 ACM Conference on Computer-Supported Cooperative Work (CSCW), pp. 13–26 (1990) 28. Pinelle, D., Gutwin, C., Greenberg, S.: Task analysis for groupware usability evaluation: modeling shared workspace tasks with the mechanics of collaboration. ACM Trans. Comput. Hum. Interact. (TOCHI) 10(4), 281–311 (2003) 29. Carroll, J.M., Rosson, M.B., Farooq, U., Xiao, L.: Beyond being aware. Inf. Organ. 19(3), 162–185 (2009) 30. Scott, C.P., Wildman, J.L.: Culture, communication, and conflict: a review of the global virtual team literature. Lead. Global Teams 13–32 (2015) 31. Petty, R.E., Cacioppo, J.T.: The elaboration likelihood model of persuasion. In: Communication and Persuasion, pp. 1–24. Springer, New York (1986). https://doi.org/10.1007/978-14612-4964-1_1
Ultra-Low-Power Range Error Mitigation for Ultra-Wideband Precise Localization Simone Angarano1,2(B) , Francesco Salvetti1,2,3 , Vittorio Mazzia1,2,3 , Giovanni Fantin1,2 , Dario Gandini1,2 , and Marcello Chiaberge1,2 1
Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy {simone.angarano,francesco.salvetti,vittorio.mazzia,giovanni.fantin, dario.gandini,marcello.chiaberge}@polito.it 2 PIC4SeR, Politecnico di Torino Interdepartmental Centre for Service Robotics, Turin, Italy 3 SmartData@PoliTo, Big Data and Data Science Laboratory, Turin, Italy
Abstract. Precise and accurate localization in outdoor and indoor environments is a challenging problem that currently constitutes a significant limitation for several practical applications. Ultra-wideband (UWB) localization technology represents a valuable low-cost solution to the problem. However, non-line-of-sight (NLOS) conditions and complexity of the specific radio environment can easily introduce a positive bias in the ranging measurement, resulting in highly inaccurate and unsatisfactory position estimation. In the light of this, we leverage the latest advancement in deep neural network optimization techniques and their implementation on ultra-low-power microcontrollers to introduce an effective range error mitigation solution that provides corrections in either NLOS or LOS conditions with a few mW of power. Our extensive experimentation endorses the advantages and improvements of our low-cost and power-efficient methodology. Keywords: Deep learning positioning
1
· Edge AI · Ultra-wideband · Indoor
Introduction
As Global Navigation Satellite System (GNSS) is the benchmark solution for outdoor positioning, Ultra-wideband (UWB) real-time locating systems (RTLS) have recently become the state of the art technology for localization in indoor environments [13]. Indeed, with its high signal frequency and very narrow pulses, UWB outperforms all other wireless positioning systems like WiFi and BLE thanks to its decimeter level of precision and higher resilience to multipath effects [9]. Nevertheless, in a real-world scenario, the complexity of the environment often leads to partial or total obstruction of the signal between the transmitter and the receiver, thus causing a substantial degradation of the positioning c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 814–824, 2022. https://doi.org/10.1007/978-3-031-10464-0_56
Ultra-Low-Power Range Error Mitigation for UWB Precise Localization
815
Fig. 1. Hardware setup for ultra-low-power UWB range error mitigation. The DecaWave EVB1000 board is connected to a power supply and an external microprocessor (Arduino Nano 33 BLE Sense) that locally runs a highly optimized and power-efficient deep neural network for range error mitigation. We also show our custom board designed for the DWM1001C module for size comparison. Future works will fully integrate our methodology on our custom board, providing a compact solution for precise localization.
performances. The non-line-of-sight (NLOS) condition affects the time-of-arrival (ToA) measurement introducing a positive bias in the ranging estimation [8]. Moreover, multipath components also strongly influence range estimates, especially in indoor environments where walls and furniture are often made of reflecting materials. Therefore, to achieve better localization accuracy, a mitigation algorithm is needed, and this must be robust and general enough to be effective in a large set of different scenarios. Most NLOS identification and mitigation methodologies proposed in the literature are based on channel impulse response (CIR) statistics [2], likelihood ratio or binary hypothesis tests [10], and machine learning techniques. As regards the latter, several techniques have been investigated, such as representation learning models [12], support vector machines (SVM), [16] and Gaussian processes (GP) [15]. Despite the chosen methodology, the resulting mitigation algorithm must require a low computational effort to be usable in a real-world use case and consequently come out of a pure research scenario. Indeed, most applications, like robotic indoor navigation and person or object tracking, typically make use of single-board computers with limited computational capabilities and stringent power consumption requirements. In this research project, we propose a highly optimized deep learning model for range error mitigation that requires a few mW to compensate NLOS and LOS signals. The proposed methodology can run at high frequency on ultralow-power microcontrollers, enabling the design of small and low-power devices
816
S. Angarano et al.
for precise indoor localization. The complete setup used in our experimentation is presented in Fig. 1. The main contributions of this paper are the following. – Introduce UWB range error mitigation for ultra-low-power microcontrollers with deep learning at the edge. – Modify and highly optimize a deep learning model, leveraging the latest weight quantization and graph optimization techniques for power and latency reduction. – Evaluate the real-time performance of the resulting network, measuring latency, power, and energy usage with different CIR sizes. The rest of the paper is organized as follows. Section 2 presents the proposed methodology with a detailed explanation of the network and the adopted optimization techniques for power consumption and latency reduction. Section 3 presents the experimental results and discussion after briefly describing the DeepUWB dataset used for the tests. Finally, Sect. 4 summarizes the main achievements of the work and proposes future research developments.
2
Methodology
In this section, we present the proposed methodology for the embedded implementation of a UWB error mitigation algorithm on an ultra-low-power microcontroller. We present details on the deep neural network design and the techniques adopted to optimize it, quantize all its parameters to 8 bits integers, and deploy the final model on the target board. We model the mitigation process as presented by Angarano et al. [1]: dˆ = d + Δd
(1)
where the goal is to predict an estimate of the error Δd on the UWB range measurement in order to compensate the observed quantity dˆ and recover the true distance d. Therefore, we adopt a DNN model that predicts an estimate yˆ of the true latent error y = Δd as a non-linear function of the input CIR vector X, measured by the UWB sensor. We denote with K the number of temporal samples of the CIR. We develop our methodology so that the dimension K can be changed to analyze its effects on the computational efficiency and the accuracy of the algorithm. 2.1
Network Design
The original design of the Range Error Mitigation Network (REMNet) presented in [1] is adapted to be executed in real-time on a low-power microcontroller. The neural network should be fully quantized to perform all the operations with 8-bit integers and meet the real-time constraints. So, in order to overcome software and hardware limitations of standard low-power microcontroller solutions, we modify the original REMNet architecture, removing all self-attention blocks that boost the accuracy performance but compromise compatibility.
Ultra-Low-Power Range Error Mitigation for UWB Precise Localization
817
Fig. 2. Range Error Mitigation Network (REMNet) architecture modified to ensure compatibility with the target embedded board. The input of the model is the K × 1 tensor representing the CIR of the measurement. Subsequent N residual reduction modules progressively reduce the original dimension K. Finally, a fully connected layer composes the high-level extracted features of dimension K/2N · F and outputs the range error estimation.
The overall modified architecture of the REMNet model is shown in Fig. 2. Starting from the input tensor CIR X of size K ×1, we extract low-level features with a first 1D convolution operation with a kernel of dimension k0 . The core of REMNet is the residual reduction module (RRM). Firstly, the residual is computed with respect to a 1D convolution of kernel kn ; then, a reduction block decreases the temporal dimension K with a 1D convolution with a stride of 2. The reduction block again has a residual connection characterized by a 1D convolution with a kernel k of dimension 1 and stride 2 to match the temporal dimension. Overall, each RRM block computes the following non-linear mapping function: RRM(X ) = Red(Conv1D(X ) + X ) (2) where Red(X ) = Conv1Ds=2 (X ) + Conv1Dk=1;s=2 (X )
(3)
The network is characterized by a stack of N RRM, all with ReLU as nonlinear activation functions [7]. After N RRM blocks, we obtain a tensor with shape K/2N × F . We perform a flattening operation to feed a regression head composed of dropout and a fully connected layer that predicts the final estimate of the compensation value Δd. We denote with F the number of features of each
818
S. Angarano et al.
Fig. 3. LoS and NLoS samples from the DeepUWB dataset, with normalized amplitude. In the NLoS Case, the signal travels along many routes until it reaches the antenna. That makes the ToA estimation ambiguous, introducing a positive bias in the ranging measurement.
convolutional operation. We always use zero padding and the same value kn for each convolutional kernel. 2.2
Network Optimization and Quantization Techniques
To achieve the goal of a real-time implementation, the range error mitigation technique must respect constraints on memory, power, and onboard latency. We study different graph optimization and quantization methods to reduce computational cost without compromising performance. Several techniques have been developed to increase model efficiency in the past few years [4], from which the following methods are chosen. First, network pruning and layer fusing are applied to remove nodes and operations that give almost no contribution to the output. Moreover, the number of bits used to represent network parameters and activation functions is reduced by quantizing the float32 values to int8 ones. Combining these strategies strongly increases efficiency with minimal impact on performance. Graph optimization is first applied to the model trained in plain float32 without quantization to investigate its effects on accuracy and dimension. Finally, a third version of the network is obtained by quantizing weights, activations, and math operations through scale and zero-point parameters. We follow the methodology presented by Jacob et al. [4], in which each weight and activation are quantized with the following equation: r = S(q − Z)
(4)
where r is the original floating-point value, q the quantized integer value, and S and Q are the quantization parameters (respectively scale and zero point).
Ultra-Low-Power Range Error Mitigation for UWB Precise Localization
819
Fig. 4. Principal component analysis representation of DeepUWB benchmark dataset, showing different spatial configurations for different rooms and materials. The original CIR dimensions are projected into a three-dimensional space. The first row shows the data point projection divided into the five considered environments. On the other hand, the second row highlights the effects of materials on signal propagation. It is clear how different molecular structures affect the signal in different ways.
A fixed-point multiplication approach is adopted to cope with the non-integer scale of S. This strategy drastically reduces memory and computational demands due to the high efficiency of integer computations on microcontrollers. The final step is to convert and import the quantized model into the embedded application system. As most microcontrollers do not have the resources to run a filesystem, we provide the network in a C source file that can be included in the program binary and loaded directly into memory, as suggested by [14]. All the results obtained with the models at different quantization steps are presented in Sect. 3.
3
Experiments and Results
In this section, we perform an experimental evaluation of different optimized versions of REMNet. Moreover, we test the accuracy and performance of the network on a low-cost microcontroller-based development board, reporting inference speed, and power consumption. 3.1
The DeepUWB Dataset
In the following experiments, we employ the indoor samples of the DeepUWB dataset presented in [1] and publicly available on Zenodo1 . The data is obtained using DecaWave EVB1000 transmitters and taking several LOS and NLOS in 1
http://doi.org/10.5281/zenodo.4290069.
820
S. Angarano et al.
different indoor and outdoor environments in the presence of various types of obstacles. Figure 3 presents a comparison between LoS and NLoS samples from the dataset. Range estimates taken in NLoS conditions are typically positively biased [8]. For each of the 55,000 measures, both ground-truth distance and the one given by the UWB boards are included, as well as the environment scenario, the obstacle materials, and the CIR vector used as input for REMNet. Three differently sized rooms are selected for indoor measurements to cover various office-like situations: a large one (10 m × 5 m), a medium one (5 m × 5 m), and a small one (5 m × 3.5 m). Regarding obstacles, various typical objects for an indoor scenario are used to cover a wide range of materials, including plastic, glass, metal, and wood. Figure 4 shows the result of Principal Component Analysis (PCA) on DeepUWB: it is noticeable that the three indoor scenarios occupy close areas in the 3D space, very distinct from outdoor and through-the-wall measurements. That is due to the presence of strong multipath components. Measurements taken in the presence of different materials tend to occupy different regions, with heavier and more screening ones, such as aluminum, being highly concentrated and distant from materials like plastic and wood. 3.2
Experimental Setting
We keep aside the medium-sized room measurements as our primary scope is to evaluate the effectiveness of REMNet in compensating the error for general indoor scenarios. In total, 36023 and 13210 training and testing data points are used, respectively. LoS and NLoS samples are kept together in the sets to evaluate the performance of the network in both cases. Real-time range mitigation with the whole CIR vector could be very computationally intensive [17]. For this reason, a study is conducted on the number of samples necessary to have an acceptable error correction. Then, the Tensorflow Lite2 framework is used to perform graph optimization and to quantize weights, activations, and math operations. The final test measures the inference frequency of the model deployed on an Arduino Nano 33 BLE Sense3 , alongside its power usage. In order to select the optimal number of input features, we conduct a grid search study on the number of CIR temporal samples. We progressively reduce the input dimension K from 157, suggested in [3], to 8. The Mean Absolute Error (MAE) is used as the loss function and metric. Box plot of model performance with the different CIR input sizes is shown in Fig. 5. The network hyperparameters are obtained with an initial random search followed by a grid search exploration to fine-tune them and compromise accuracy and efficiency. We use N = 3 residual reduction modules with kernel dimensions k0 = 5 and kn = 3, and F = 16 filters. 2 3
https://www.tensorflow.org/lite. https://store.arduino.cc/arduino-nano-33-ble-sense.
Ultra-Low-Power Range Error Mitigation for UWB Precise Localization
821
Fig. 5. REMNet performance (mitigated ranging error) with different CIR input size dimensions. For each test, we report LoS and NLoS MAE as well as overall standard deviation (σ). It is clear how a reduced number of input features degrades the performance of the model. Moreover, an input with eight dimensions appears to be the minimum amount of information required to obtain an acceptable range error estimation. Table 1. Model performance for different CIR lengths before and after applying optimizations. The results for a multilayer perceptron are included as a reference. Model
CIR Params MAE [m] MAEGO [m] MAEIN T 8 [m]
REMNet 157 128 64 32 16 MLP
157
5905 5841 5713 5649 5617
0.0687 0.0702 0.0704 0.0710 0.0712
0.0687 0.0702 0.0704 0.0710 0.0712
0.0690 0.0698 0.0701 0.0713 0.0714
54401
0.0769
0.0777
0.0775
Finally, all experimentation adopt Adam as the optimization algorithm [6] with momentum parameters β1 = 0.9, β2 = 0.999, and = 10−8 . The optimal learning rate λ = 3e − 4 is experimentally derived using the methodology described in [11] and kept constant for 30 epochs, with a batch size of 32. We employ the TensorFlow framework4 to train the network on a PC with 32 GB RAM and an Nvidia 2080 Super GP-GPU. The overall training process can be performed in less than 5 min. 3.3
Quantitative Results
The medium room data samples, used as test set, have a starting MAE of 0.1242 m and a standard deviation of σ = 0.1642 m. The results obtained by 4
https://www.tensorflow.org/.
822
S. Angarano et al.
Table 2. Model dimensions for different CIR lengths before and after applying optimizations. The results for a Multilayer Perceptron (MLP) are included as a reference.
Model
CIR Params Dim [kB] DimGO [kB] DimIN T 8 [kB]
REMNet 157 128 64 32 16 MLP
157
5905 5841 5713 5649 5617
317.321 317.321 317.157 317.052 317.499
32.988 32.732 32.220 31.964 31.836
23.088 23.024 22.896 22.832 22.800
54401
216.047
125.885
60.320
Table 3. Real-time performance for different optimized models, including inference frequency, consumed power, and network energy usage. Model
CIR fm [Hz] Vcc [V] Iabs [mA] Pabs [mW] Einf [mJ]
REMNet 157 128 64 32 16
17.2 21.2 41.0 77.8 140.0
3.3 3.3 3.3 3.3 3.3
16.2 16.0 15.8 15.6 15.6
53.4 52.8 52.2 51.6 51.6
3.1 2.5 1.3 0.66 0.37
MLP
184.1
3.3
16.2
53.4
0.29
157
the trained reference architectures and their degradation, as optimizations are applied, are shown in Table 1. Each model has been tested five times with different random seeds to obtain statistically significant results. Performances prove the effectiveness of the model, as the MAE of the medium room samples is reduced by 45.7% using the reference model. Consequently, the final error of 0.0687 m is comparable to the actual LoS precision of EVB1000 boards [5]. Lastly, REMNet demonstrates to outperform a Multilayer Perceptron (MLP) with around 10% of its parameters. As regards model optimization, columns CIR and Params report the number of input samples used for the mitigation and the total model parameters, respectively. Moreover, the resulting MAE is reported for all three model configurations: reference, graph optimization (MAEGO ), and full 8-bit integer quantization (MAEIN T 8 ). The results show that the effect of graph optimization is null for REMNet, while the MLP performance slightly deteriorates. Integer quantization, instead, minimally increases the resulting MAE for all the models. Finally, our experimentation confirms that fewer dimensions of 128 tend almost linearly to degrade the network’s accuracy. Despite the insignificant effect of graph optimization on performance, memory occupancy greatly shrinks. Our results show a reduction of about 90% for REMNet, while the MLP only halves its memory footprint due to its higher
Ultra-Low-Power Range Error Mitigation for UWB Precise Localization
823
number of parameters. In addition, quantization allows a further reduction of REMNet memory requirements of an additional 30%, confirming the great benefit of using both optimization and quantization techniques provided by TensorFlow Lite converter. The MLP reduces another 50% of its memory footprint with full integer quantization, reaching a final size of about three times REMNet. Therefore, the proposed model can outperform the baseline both in error mitigation capability and memory requirements. All the results on the memory footprint of the models under examination are presented in Table 2. Finally, the inference speed and the power consumed by the Arduino board for each considered model are analyzed and presented in Table 3. The frequencies, denoted as fm , have been measured as the reciprocal of the maximum inference time over a series of tests. In all the cases, they can be considered compliant for real-time applications. In particular, the MLP requires less computational effort despite the more significant number of parameters because it involves simpler math operations than REMNet. Moreover, reducing CIR length results in an almost linear increase in inference speed. To assess power consumption, instead, we measured the absorbed current Iabs with a voltage supply Vcc = 3.3 V. Results show that, as the microcontroller is constantly processing data, the power usage can be considered constant for all the cases. However, since inference speed significantly changes with model complexity, we computed the energy required for a single inference step Einf by dividing the consumed power Pabs by the frequency fm . It is noticeable that very few mJ are sufficient to run range mitigation, reaching values under 1 mJ. That proves that the already efficient design of the proposed model, in conjunction with 8-bit weight precision and graph optimization techniques, makes deep learning a feasible solution for effective ultra-low-power UWB range error mitigation.
4
Conclusions
In this paper, we introduced UWB range error mitigation for ultra-low-power microcontrollers. Our effective and power-efficient methodology builds on top of the latest advancement in deep learning and neural networks optimization techniques to provide precise localization in NLOS and LOS conditions with a few mW of power. We proposed a modified version of REMNet, a lightweight model designed explicitly for range error mitigation on ultra-low-power AI edge devices. Our extensive experimentation proves how the proposed system successfully runs at a high frequency on microcontrollers and provides enhanced localization in indoor environments. Future works will integrate the proposed methodology on a compact custom board designed around the DWM1001C DecaWave module to provide a compact solution for precise localization. Acknowledgments. This work is partially supported by the Italian government via the NG-UWB project (MIUR PRIN 2017) and developed with the contribution of the Politecnico di Torino Interdepartmental Center for Service Robotics PIC4SeR (https:// pic4ser.polito.it) and SmartData@Polito (https://smartdata.polito.it).
824
S. Angarano et al.
References 1. Angarano, S., Mazzia, V., Salvetti, F., Fantin, G., Chiaberge, M.: Robust ultrawideband range error mitigation with deep learning at the edge. Eng. Appl. Artif. Intell. 102, 104278 (2021) 2. Barral, V., Escudero, C.J., Garc´ıa-Naya, J.A.: NLOS classification based on RSS and ranging statistics obtained from low-cost UWB devices. In: 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5. IEEE (2019) 3. Bregar, K., Mohorˇciˇc, M.: Improving indoor localization using convolutional neural networks on computationally restricted devices. IEEE Access 6, 17429–17441 (2018) 4. Jacob, B., et al.: Quantization and training of neural networks for efficient integerarithmetic-only inference. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713 (2018) 5. Jim´enez, A.R., Seco, F.: Comparing Decawave and Bespoon UWB location systems: indoor/outdoor performance analysis. In: 2016 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pp. 1–8. IEEE (2016) 6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 7. Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML (2010) 8. Otim, T., D´ıez, L.E., Bahillo, A., Lopez-Iturri, P., Falcone, F.: Effects of the body wearable sensor position on the UWB localization accuracy. Electronics 8(11), 1351 (2019) 9. Schmid, L., Salido-Monz´ u, D., Wieser, A.: Accuracy assessment and learned error mitigation of UWB ToF ranging. In: 2019 International Conference on Indoor Positioning and Indoor Navigation (IPIN), pp. 1–8. IEEE (2019) 10. Silva, B., Hancke, G.P.: IR-UWB-based non-line-of-sight identification in harsh environments: principles and challenges. IEEE Trans. Ind. Inf. 12(3), 1188–1195 (2016) 11. Smith, L.N.: Cyclical learning rates for training neural networks. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 464–472. IEEE (2017) 12. Stahlke, M., Kram, S., Mutschler, C., Mahr, T.: NLOS detection using UWB channel impulse responses and convolutional neural networks. In: 2020 International Conference on Localization and GNSS (ICL-GNSS), pp. 1–6. IEEE (2020) 13. Tiwari, P., Malik, P.K.: Design of UWB antenna for the 5g mobile communication applications: a review. In: 2020 International Conference on Computation, Automation and Knowledge Management (ICCAKM), pp. 24–30. IEEE (2020) 14. Warden, P., Situnayake, D.: TinyML. O’Reilly Media Inc. (2019) 15. Xiao, Z., Wen, H., Markham, A., Trigoni, N., Blunsom, P., Frolik, J.: Non-lineof-sight identification and mitigation using received signal strength. IEEE Trans. Wirel. Commun. 14(3), 1689–1702 (2014) 16. Ying, R., Jiang, T., Xing, Z.: Classification of transmission environment in UWB communication using a support vector machine. In: 2012 IEEE Globecom Workshops, pp. 1389–1393. IEEE (2012) 17. Zeng, Z., Liu, S., Wang, L.: NLOS identification for UWB based on channel impulse response. In: 2018 12th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–6. IEEE (2018)
The Pareto-Frontier-Based Stiffness of a Controller: Trade-off Between Trajectory Plan and Controller Design Zhe Shen(B) and Takeshi Tsuchiya The University of Tokyo, Tokyo, Japan [email protected]
Abstract. Approaching a target for a UAV comprises a trajectory plan and a controller design (control after plan problems). Typically, the optimal trajectory (reference) is calculated before being tracked with a proper controller. It is believed that the quadrotor will follow the designed trajectory in the trajectory plan process. However, the dynamic state error usually, for a mismatched feed-forward, spoils this assumption, making the unwanted sacrifice in the objective function defined in the trajectory plan process. To solve this problem, the unavoidable dynamic state error is considered in the trajectory plan process, assuming the LQR without the feed-forward is applied in the subsequent control after plan problems; Copenhagen Limit estimates the dynamic state error as an analytical solution in the trajectory plan process. The trade-off results are provided in multiobjective Pareto front solutions and the mapped ‘pseudo’ Pareto fronts. We also explore the relationship between the controller and the corresponding ‘pseudo’ Pareto fronts. Keywords: Tracking control · Multiobjective optimization · Pareto optimality
1 Introduction Higher demands on UAV flying tasks inspire researchers to solve the multiobjective optimization trajectory design problems. Typical objectives comprise least fuel consumption [1], shortest flying time [2], safety concern [3], etc. The number of the objectives in multiobjective trajectory/path optimization can be double [2, 4], triple [5, 6], or even more [7]. A typical multiple-objective control problem with the analytical solution is the tradeoff between dynamic state error and the control effort [8]. ∞ x(t)T R1 x(t) + u(t)T R2 u(t) dt J = 0
Here, the dynamic state error can be analytically presented, since the reference is a setpoint. Together with the infinite time horizon, the analytical solution exists. Unfortunately, only numerical results can be received even with a slight change in this problem [9]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 825–844, 2022. https://doi.org/10.1007/978-3-031-10464-0_57
826
Z. Shen and T. Tsuchiya
The dynamic state error loses the analytical form for a moving reference in a general control problem without using a feed-forward signal. Here, the reason for prohibiting the feed-forward signal is that several numerical solvers introduce the mismatch between the desired feed-forward signal and the desired trajectory. Typical research endeavoring to decrease this mismatch can be found in [10, 11]. Since utilizing this feed-forward introduces the dynamic state error, we may discard it in a trajectory design problem result; the only result desired in the trajectory design process is the time-specified trajectory itself. A formula calculating/estimating the resultant dynamic state error in the subsequent tracking control problem is demanded. The author in [12] proposed an analytical solution of the dynamic state error in a limit form. It estimates the dynamic state error of some particular controlled (LQR) systems [20–22]. In this research, we base the estimation of the dynamic state error in trajectory design on [12]. In solving the multiobjective optimization problem, the weighted sum method [13] is used to generate the resultant Pareto frontier. The compromised optimal result is usually near the elbow region [14]. Our resulting ‘pseudo Pareto frontier’ shows the atypical property similar to the one in Schaffer’s function [15]. We model this result with a physical model and analyze its relationship with the set controller parameters. The dynamics of the quadrotor are widely analyzed and simulated both in 2 dimensional and 3 dimensional [23, 24]. We carry on the simulation in 2 dimensions due to its convenience. This paper is organized into seven sections. In Sect. 2, the system and the task allocation are illustrated. Section 3 explains the method of the trajectory trade-off scheme. And the relevant simulation setup is detailed in Sect. 4. Its results are presented and modeled in Sect. 5 and Sect. 6, respectively. Finally, we make conclusions and further discussions in Sect. 7.
2 Dynamics and Task Allocation 2.1 Planar UAV Dynamics The sketch of a quadrotor [24] and a planar UAV are in Fig. 1 and 2, respectively.
Fig. 1. The quadrotor model [24].
The Pareto-Frontier-Based Stiffness of a Controller
827
Fig. 2. The planar drone model.
A planar quadrotor is a 2-D quadrotor where the vehicle is restricted to move within a vertical plane. The gravity acts along the negative y-direction. The air resistance and noise are neglected. The relevant parameters are unanimous with [12] in Eq. (1)–(3). M = 0.54 Kg
(1)
d = 0.12164 m
(2)
g = 9.81 m/s2
(3)
M represents the quadrotor’s mass which is contributed by two rotors. d represents the length of each quadrotor’s arm. g is the gravitational acceleration. In the following, we use u1 and u2 to represent the thrusts provided by the left rotor and right rotor, respectively. Besides, the position vector of the quadrotor’s geometric center, [x y], is respect to the earth frame. The attitude is represented by the angle, q, rotating from the earth frame to the UAV’s body-fixed frame. Based on these, the state vector is defined in Eq. (4). ⎡ ⎤ x r = ⎣y⎦ (4) q The dynamics are deduced based on Newton–Euler equations, Eq. (5)–(6). F =m·a
(5)
τ =I ·α
(6)
The resulting dynamic equations are Eq. (7)–(9). x¨ = − y¨ =
1 · sin(q) · (u1 + u2 ) M
1 · (u1 + u2) · cos(q) − g M
(7) (8)
828
Z. Shen and T. Tsuchiya
q¨ =
1 · (u2 − u1) M ·d
(9)
Linearize the UAV model in the hovering state provides convenience both in the dynamics and the control. Typically, [16] analyzes the region where the linearizationbased controller guarantees stability. Linearizing Eq. (8) yields Eq. (10). y¨ =
1 · (u1 + u2) − g M
(10)
2.2 Task Allocation The quadrotor is to experience a vertical take-off (VTO) with some requirements. The typical method in such tasks is a control-after-plan problem; a trajectory is designed first before being tracked using a ‘well-designed’ controller. A well-designed controller usually contains the feed-forward signal from the trajectory design part, since a matched feed-forward signal helps track. However, for the reasons mentioned, the feedforward signal is prohibited here. Instead, a linear quadratic regulator (LQR) without feed-forward is used in the altitude tracking process. In the trajectory design process, our trajectory is expected to meet the following requirements: 1. The initial altitude is 0, the initial velocity is 0, the initial acceleration is 0. 2. The final altitude is 5 m. While the final velocity and the final acceleration are not restricted. 3. The entire flight completes within 1 s. 4. The following objective function, Eq. (11), is required to be minimized. 1
x¨ 2 + y¨ 2 + q¨ 2 dt min
(11)
0
5. Since we have determined the later LQR used in the tracking process, the dynamic state error can be estimated in the trajectory plan process. This process is detailed in Sect. 3. In this control process, LQR is designed to track the designed trajectory. The tracking LQR is designed with the following specifications: 1. There is no steady-state error. It means that the quadrotor is to reach the desired setpoint if sufficient time is given. However, the actual final position at 1 s can be a position other than the desired setpoint since the control process ends at 1 s rather than infinite time. 2. The desired time-specified trajectory itself is the only result allowed to be used as the reference. The feed-forward signal is not allowed in the altitude control. 3. The parameters of the controller are designed before the Trajectory Design process.
The Pareto-Frontier-Based Stiffness of a Controller
829
Notice that the actual moving trajectory cannot totally follow the desired trajectory designed in the Trajectory Design process without the feed-forward signal; the dynamic state error is unavoidable since it is the source of the control signal. Our ultimate goal is to find the actual moving trajectory with the least cost function in (11). It is the actual acceleration while tracking that is used to calculate the actual cost function.
3 Controller and Trajectory Trade-off Scheme 3.1 Altitude Controller The altitude control in tracking is an LQR. The adopted altitude LQR utilized is similar to the one in [12], which is detailed in (12) and (13). 0 0 y˙ 01 y = + 1 (u1 + u2 ) + (12) 0 0 y˙ y¨ −g M y reference + N1 N2 u1 + u2 = − k1 k2 + Mg (13) 0 y˙ k 1 and k 2 are the control constants; N 1 and N 2 are the offset constants; reference is the position reference from the Trajectory Design process. Theorem: For a stable LQR-controlled system, if k 1 , k 2 , N 1 , and N 2 in Eq. (13) satisfies: k1 = N1
(14)
Then the steady-state error is 0. Proof: See [12]. Once the eigenvalue pairs are determined, we design the LQR based on Eq. (14) to guarantee zero steady-state error. In our experiments, the eigenvalue pairs in each of four experiments are [−10 −100], [−20 −200], [−30 −300], [−50 −500], respectively. Note that they are determined before the trajectory generation process. 3.2 Attitude and Position Controller The attitude and position controllers are two PD controllers with feed-forward. Note that the feed-forward is not allowed in the altitude controller; this research presumes that the feed-forward is not used in tracking control. In contrast, the UAV’s attitude and position controls utilize the feed-forward information since we expect no dynamic state error in attitude. Equation (15) and (16) gives the relevant settings in our PD controllers. x¨ = r¨x + 10 · (˙rx − x˙ ) + 100 · (rx − x)
(15)
q¨ = r¨q + 80 · r˙q − q˙ + 100 · (rq − q)
(16)
830
Z. Shen and T. Tsuchiya
3.3 Trajectory Design The difference between a path and a trajectory is that a trajectory is a time-specified function [17]. Although the path has been already defined– a 5-m vertical path in 1 s– in Task Allocation, it is not specified with time. Undeniably, the trajectory can be directly designed with a constant velocity. However, a better solution, e.g., lower cost function in Eq. (11), may be found by assigning the time to the path properly. Our ultimate goal is to receive the lowest cost function in (11) in the actual tracking result. The conventional way to solve this problem is to design the following optimal trajectory design problem: 1 (17) min y¨ 2 dt 0
Subject to y(t = 0) = 0
(18)
y˙ (t = 0) = 0
(19)
y(t = 1) = 5
(20)
y¨ =
1 ·u−g M
0≤y≤5
(21) (22)
The analytical solution does not exist for this problem. Also, tracking this resulting trajectory using the designed controller, Eq. (12) and (13), will introduce the dynamic state error, which is the source of the feedback. Since the object will follow our planned trajectory with the existence of the dynamic state error, the dynamic state error seems to be acceptable while tracking. However, the existence of the dynamic state error indicates that the UAV does not totally follow the designed time-specified trajectory. Unless the dynamic state error is zero, e.g., the actual trajectory overlaps the desired trajectory, the UAV will not fully follow the designed trajectory. Consequently, the actual movement of the UAV will not meet the optimal solution to Eq. (17)–(22). Although the designed trajectory solves this problem, the UAV does not fully follow it. In fact, fully following it is impossible without a feed-forward, as we prescribed, since the dynamic state error is the feedback-control signal’s source. It can take several potential measures to decrease this dynamic state error. The way utilized in this research is to consider the dynamic state error while defining the objective function in (17). Instead of using the conventional objective function in (17), we consider a modified objective function, Eq. (23). 1 (23) min (¨y2 + μ · e(t)2 )dt 0
The Pareto-Frontier-Based Stiffness of a Controller
831
here, e(t) represents the estimation of the dynamic state error, e.g., (24) and (25), which has been thoroughly discussed in [12]. t t n vref (24) e(t, n) = i · (1 − p)n+1−i i=1 n n e(t) = lim e(t, n) n→∞
(25)
We utilize Eq. (25) to estimate the dynamic state error in (23) since both the system and the controller are continuous. In addition, the eigenvalue pairs in the tracking control, [−10 −100], [−20 −200], [−30 −300], [−50 −500], are degraded to the single eigenvalues, −10, −20, −30, −50, respectively, to accommodate the dynamic state error estimation in (24) and (25). The coefficient, μ in Eq. (23), is the weight of the dynamic state error. Equation (23) provides the trade-off scheme between the trajectory design and the controller ability consideration (dynamic state error). Assigning a significant weight on the dynamic state error tends to receive a less dynamic state error in the tracking result. However, the adverse effect is that the optimality might be affected. Here, it means that the resulting trajectory planned will not be the one with the lowest acceleration. The larger the weight is, the more compromise is made in optimality. Vice versa. Note that the above trade-off scheme appears in the trajectory plan process only; a particular trajectory is designed for each dynamic state error weight with the pre-defined controller. In comparison, the corresponding resulting actual trajectory in the tracking process may not inherit a similar trade-off unless the actual tracking trajectory overlaps the designed trajectory to be tracked. For this reason, the actual dynamic state error and the actual acceleration might not necessarily be compromised in the same way in the trajectory design process. This research analyzes the mentioned trade-off schemes between the dynamic state error and the objective – the acceleration—both in the trajectory design process and the tracking control result. 3.4 Experiments and Testbed In general, each experiment set consists of two sections: the optimal trajectory design and the tracking control of the designed trajectory. Different from conventional researches, the controller is predefined before the optimal trajectory design process. There are four experiments in total. We tune the altitude controllers with different eigenvalue pairs, [−10 −100], [−20 −200], [−30 −300], [−50 −500], respectively, in each experiment. We will take [−20 −200] as a demo to further demonstrate our experimental procedures. The first step is to decide the dynamic state error function, Eq. (24) and (25). Calculation in this step requires a first-order eigenvalue. The equivalent dominant eigenvalue for the eigenvalue pair [−20 −100] is −20.
832
Z. Shen and T. Tsuchiya
Once the dynamic state error function is obtained, the next step is to solve a multipleobjective optimization problem. The trade-off scheme’s objectives, Eq. (23), are the acceleration and the dynamic state error. We use the weighted sum method [13] to transfer the multiobjective optimization problem into a single-objective optimization problem with a varying coefficient, μ in Eq. (27). The constraints are in Eq. (18)–(22). The method for solving this optimization problem is direct collocation (trapezoid); the first step in the direct method is to discretize. Direct method is chosen since the dynamic state error estimator, Eq. (24) and (25), also possesses the nature of discrete. The platform to solve the trajectory design task is MATLAB. Further explanations of it can be found in Sect. 4. For deeper understanding of this solver, [18] is recommended. A Pareto frontier for trajectory design is received after repeating the optimal trajectory design problem with different dynamic state weights. This Pareto frontier in the trajectory design process exposes the trade-off between the predicted dynamic state error and the cost (designed acceleration). On the other hand, each trajectory design problem with a unique dynamic state error weight in Eq. (23) outputs a resulting trajectory. This resulting trajectory is further used as the reference in the subsequent tracking control. We simulate the tracking control in a 2-D UAV simulator in SIMULINK (Fig. 3). It comprises two sections, the Controller Section and the Dynamics Section.
Fig. 3. The dynamic system simulator is written in SIMULINK.
We equip the altitude controller and the attitude controller in the Controller Section in SIMULINK. The altitude controller with the eigenvalue pair, [−20 −200], is specified in (12)–(14). Furthermore, the attitude controller is set based on Eq. (15)–(16). The Dynamics Section in the simulator is based on the UAV’s dynamic equations, (7)–(9). Once we input a designed trajectory (reference) to the simulator, this simulator outputs the tracking result (actual trajectory in tracking). Note that we received various designed trajectories (references) in the trajectory design process from the problems with different dynamic state error weights.
The Pareto-Frontier-Based Stiffness of a Controller
833
Likewise, we will receive as many as the resulting actual tracking trajectories since each is the tracking result of the corresponding designed trajectory. We map these actual tracking results in a ‘pseudo-Pareto frontier’. A mapping ‘pseudo-Pareto frontier’ is a plot providing the information of the actual tracking trajectories. It is generated with the following procedure. Once we receive a resulting actual tracking trajectory, we record the actual dynamic state error and the actual acceleration. Using these two pieces of information, we plot a dot on a cost-error plane where the x-axis and y-axis represent the cost function and the dynamic state error, respectively. A collection of the dots will be received once we receive different resulting actual tracking trajectories. These dots in the cost-error figure show the similarity pattern with a typical Pareto frontier. Thus, we name this figure pseudo-Pareto frontier. Notice that every dot on the pseudo-Pareto frontier matches the corresponding dot in the Pareto frontier for trajectory design.
4 Solver Setup This section details these mathematical bases of the solver. We solve the optimal trajectory optimization problem using direct collocation method in this research. The first step in a direct method is to discretize the dynamics, the objective functions, and the dynamic state error estimation (Eq. (24)–(25)). After the discretization, the optimization problem is transferred to a nonlinear programming problem. A MATLAB built-in function (fmincon) is then utilized for solving this NLP problem before further several modifications, e.g., smoothen. In the discretization of the dynamic constraints, Eq. (21), we divide the entire time horizon into 60 timepieces. Thus, each time segment lasts 1/60 s. We only take the knot points into calculation. The dynamic constraint, the differential equation in Eq. (26), is approximated at knot points to maintain its property during discretizing. Equation (27) specifies the principle of the trapezoid in the dynamic constraints’ discretization [19]. x˙ = f (t) x(tk ) − x(tk−1 ) =
1 · t · f (tk ) + f (tk−1 ) 2
(26) (27)
here, t is the time segment and is equal to the difference between the t k and t (k-1) . As mentioned before, it is 1/60 s in our solver. As for the dynamic state error estimation, Eq. (24) and (25), several deductions are made before further application. First, the control constant, p in Eq. (24), is determined by the controller defined in Eq. (12)–(14).
834
Z. Shen and T. Tsuchiya
t
p = 1 − eλ· n
(28)
λ is the equivalent first-order eigenvalue. For example, the controller with closedloop system eigenvalue pair [−20 −200] is approximated with a first-order eigenvalue, λ = −20. Substituting Eq. (28) into Eq. (24)–(25), we receive the dynamic state error estimation in the form of integration in Eq. (29). t vref (t) · e−λ·t dt (29) e(t) = eλ·t · 0
Alternatively, Eq. (29) can also be obtained by solving an equivalent differential equation directly. However, it can lead to an unreasonable dynamic state error estimation for a discrete system other than Eq. (24). For example, we may discretize Eq. (29) into Eq. (30). However, Eq. (30) is not identical to Eq. (24) unless n is infinite. n t t t vref · i · e−λ·( n ·i) · (30) e(t, n) = eλ·t · i=1 n n Deciding on a proper dynamic state error estimation among Eq. (24), (29), and (30) can be controversial. The basic idea is that Eq. (29) is suggested for a continuous system. On the contrary, Eq. (24) is encouraged for a discrete system or a discrete controller. True, the controller and the system are both continuous in this problem. However, the nature of direct collocation makes them discretized in the trajectory optimization process. In this research, we employ Eq. (29) to estimate our dynamic state error. A further trade-off between these two formulas is beyond our scope in this research. Nevertheless, we still cannot calculate the integration in Eq. (29) since our calculation is based on the data on knot points. To solve this problem, the integration is approximated by Eq. (31). ⎡ ⎤ h(0) ⎢ ⎥ h(t ) ⎢ ⎥ ⎢ ⎥ ⎢ h(2 · t ) ⎥ N ·t ⎢ ⎥ 1 ⎢ 1 .. ⎥ · t 11 · · · 11 ·⎢ h(t)dt = (31) . ⎥ 2 2 0 ⎢ ⎥ ⎢ h((N − 2) · t ⎥ ⎢ ⎥ ⎣ h((N − 1) · t ⎦ h(N · t )
5 Results The results for the optimal trajectory design for the eigenvalue pair [−20, −200] are demonstrated in Fig. 4.
The Pareto-Frontier-Based Stiffness of a Controller
835
Fig. 4. Trajectory design with different μ
Fig. 5. Actual trajectories (tracking results)
With the increase of the coefficient μ in Eq. (23), the resulting trajectory receives an evolution. On the other hand, the actual tracking resulting trajectory also updated (Fig. 5) for the eigenvalue pair [−20, −200]. The blue curve in Fig. 6 is the Pareto frontier of the trajectory design for the eigenvalue pair [−20 −200]. The x axis represents the cost function defined in Eq. (17). The y axis represents the integral of the dynamic state error squared Eq. (32). e(t) =
1 0
e(t)2 dt
(32)
836
Z. Shen and T. Tsuchiya
The blue curve shows a trade-off between the trajectory plan (cost function) and the controller design (dynamic state error).
Fig. 6. Blue curve: Pareto frontier in the trajectory design process. Red curve: ‘pseudo Pareto frontier’ in the tracking control process.
Note that each data on the blue curve in Fig. 6 corresponds to a resultant tracking trajectory in Fig. 4. We know the exact value of the cost function as well as the dynamic state from each tracking result in Fig. 4. Thus, we can plot a data dot in Fig. 6 for each actual tracking trajectory in Fig. 4. As a result, we receive the red curve in Fig. 6. It is also worth mentioning that each dot on the trajectory Pareto frontier on the blue curve in Fig. 6 corresponds to a dot of its tracking result dot on the red curve in Fig. 6. We call this relationship ‘mapping’. The Pareto frontier (blue) is mapped to a ‘pseudo Pareto frontier’ (red) which represents the actual result in tracking process. After zooming the initial part of the pseudo Pareto frontier, we can see an atypical shape in Pareto frontier (Fig. 7). Both the cost and dynamic state error decrease near the neck of the ‘pseudo Pareto frontier’. This near-neck atypical shape in ‘pseudo Pareto frontier’ also occurs in this system equipped with the rest of our controllers with different eigenvalues. Our ultimate goal is to find the actual moving trajectory with the least cost function. Thus, the best compromised choice lies at the dot with the least cost function in pseudo Pareto frontier (red curve in Fig. 6 or Fig. 7).
The Pareto-Frontier-Based Stiffness of a Controller
837
Fig. 7. The ‘neck’ part of the ‘pseudo Pareto frontier’ zoomed from Fig. 6.
Notice that we do not receive the actual tracking result in the trajectory design process. The corresponding data dot mapping to the dot with the least cost in pseudo Pareto frontier is found on the Pareto frontier in the Fig. 6 blue curve, which is indicated in Fig. 8.
Fig. 8. The best compromised choice in the Pareto frontier in the trajectories design process.
The similar results can be found with different LQR controller settings (eigenvalues). The relationship between the ‘pseudo Pareto frontier’ and the controller is detailed in the next section.
838
Z. Shen and T. Tsuchiya
6 Stiffness of the Controller The Pareto frontiers (blue) and the mapped pseudo Pareto frontiers (red) for the UAV with the rest LQR parameter settings (Eigenvalue pairs) are plotted in Fig. 9, 10 and 11. They are the results from the LQR eigenvalue pairs [−10, −100], [−30, −300], and [−50, −500], respectively.
Fig. 9. Eigenvalue pair [−10, −100]
Fig. 10. Eigenvalue pair [−30, −300]
All the blue curves in the trajectory in Fig. 6, 9 and 11 are typical Pareto frontiers in a multiple-object optimization (trajectory design problem). While the mapped red Pareto frontiers represent the cost and the error in tracking results.
The Pareto-Frontier-Based Stiffness of a Controller
839
Fig. 11. Eigenvalue pair [−50, −500]
In addition, each of these pseudo Pareto frontiers contains the atypical shape demonstrated in Fig. 7. We find the data dot receiving the least cost in the pseudo Pareto frontier and identify the corresponding data dot in the Pareto frontier as what we were working in Fig. 8. The relevant corresponding data dot in the Pareto frontiers with the rest LQR parameter settings (Eigenvalue pairs) are marked in Fig. 12, 13 and 14 (eigenvalue pairs [−10, −100], [−30, −300], and [−50, −500], respectively). These dots in the Pareto frontiers match the trajectory receiving the lowest cost in its psuedo Pareto frontiers. All of them are near the neck of the Pareto frontiers.
Fig. 12. Eigenvalue pair [−10, −100]
840
Z. Shen and T. Tsuchiya
Fig. 13. Eigenvalue pair [−30, −300]
Fig. 14. Eigenvalue pair [−50, −500]
We model the atypical shape in pseudo Pareto frontier, e.g., Fig. 7, with a physical model which is detailed in the Fig. 15 and 16. A rope satisfies the Hooke’s law, Eq. (33), and is fixed on the walls on both sides. F = −k · x
(33)
We exert 1 Newtown force in the middle of the rope (Fig. 15). The consequence is that the rope extends, leaving from the initial level position (Fig. 16). We assume that the initial length of the rope is 2b. Also, we assume that the middle of the line displace a along the direction of the force. Based on these assumptions, the spring constant, k in (33), is calculated in (34). k=
1 a 2 − 21 4 · a · 1− 1 + b
(34)
The Pareto-Frontier-Based Stiffness of a Controller
841
Fig. 15. 1 Newtown on a rope (length: 2b) fixed on the wall on both sides.
Fig. 16. The rope obeys the Hooke’s law and changes its shape.
As indicated in Fig. 17, we take the largest decrease from initial cost in the cost function for the parameter a in (34). While we take the initial cost in Fig. 17 2b in (34).
Fig. 17. 2b is the initial cost. a is the largest decrease from initial cost in the cost function. (Eigenvalues: [−20 −200])
Consequently, we can calculate a spring constant based on Eq. (34) for the eigenvalue pairs [−20 −200]. The spring constants for the other eigenvalue pairs, namely [−10, − 100], [−30, −300], and [−50, −500], can also be obtained. The spring constants for its corresponding eigenvalues are plotted in Fig. 18.
842
Z. Shen and T. Tsuchiya
Fig. 18. The spring model for the pseudo Pareto frontiers.
7 Conclusions and Discussions When the LQR controller is more aggressive (e.g., a larger equivalent eigenvalue), the following results are observed: The designed cost function becomes closer to the real cost function. This can be found in Fig. 6, 9, 10 and 11; the designed cost function (blue curve) is closer to the true tracking result (red curve) with a larger eigenvalue. It indicates that the aggressive controller makes the tracking result closer to the trajectory we designed. Best compromised result with the lowest actual cost becomes closer to the head region of the Pareto Frontier of the trajectory design. The data dots illustrated in Fig. 8, 12, 13 and 14 are the best compromised result which s to the actual tracking results receiving the lowest cost in the pseudo Pareto frontier. These data points are closer to the head of the Pareto Frontier of trajectory design when the LQR equivalent eigenvalue is larger. The actual tracking result receiving the lowest cost does not happen for the case when μ = 0, which corresponds to the initial (top) data dot in the Pareto frontier. Instead, a better checking is rewarded by sacrificing the cost in the trajectory plan process. Though the trajectory designed is endowed with the lowest cost when μ = 0, the tracking process introduces the dynamic state error, making the actual cost not the lowest compared with the compromised trajectory possessing better tracking properties. An aggressive controller weakens the effect of this compromise; the nature of these strong controllers provides the less dynamic state error. That’s why the best compromised trajectory receiving the lowest cost in the tracking process locates closer to the initial data dot (head region) in the Pareto frontier when the controller is more aggressive. The equivalent spring model becomes more stubborn. The spring coefficient (k) in the physical model in Fig. 15 and 16 represents the stubbornness of the rope. The rope with a larger k is more stubborn, requiring more force to change its shape. Likewise, a large spring coefficient (k) for our pseudo Pareto frontier, e.g., Fig. 17 and 18, indicates that the corresponding pseudo Pareto frontier is more stubborn; the atypical shape in the pseudo Pareto frontier shown in Fig. 17 becomes less obvious.
The Pareto-Frontier-Based Stiffness of a Controller
843
The nature of the LQR parameters contributes to this consequence. Since the aggressive controller lets the actual moving trajectory better obey the designed trajectory as well as the dynamic state error, the pseudo Pareto frontier is closer to its corresponding Pareto frontier. However, there is no atypical shape in the Pareto frontier (this special shape only exists in the pseudo Pareto frontier). Thus, an aggressive controller eliminates this atypical shape in a larger degree. That is the underlying reason for receiving Fig. 18. The pseudo Pareto frontier in this paper shows advantages in analyzing the dynamic state error. The method of analyzing the controller in the view of a physical model can be extended to other controllers, while the limitations are obvious. Firstly, plotting pseudo Pareto frontiers highly rely on the estimation of the dynamic state error. However, the analytical solution to the dynamic state error deployed in this research works well only for low order systems. The analytical solution to the estimation of the dynamic state error for a high order system may not exist. Secondly, the solver in solving the pseudo Pareto frontiers can also affect the results. Improper settings in the solver (e.g., improper grid number) may lead to an unwanted result. Thirdly, this method, so far, is only compatible to the system equipped with LQR or PD controllers. The pseudo Pareto frontier for nonlinear controllers is beyond the scope of this research. This is mainly caused by the unavailable of the estimation of the dynamic state error. Several further steps can be explored to improve this approach. Firstly, different formula in estimating the dynamic state error awaits to be compared. Secondly, the estimation of the dynamic state error for advanced nonlinear controllers is desired, which would be helpful to study the relevant pseudo Pareto frontiers. The implementation of our controller into a more complicated dynamic is also our further step.
References 1. Cowling, I.D., Yakimenko, O.A., Whidborne, J.F., Cooke, A.K.: Direct method based control system for an autonomous quadrotor. J. Intell. Robot. Syst. 60(2), 285–316 (2010) 2. Yoon, J., Lee, J.: Altitude and roll control of a hovering quad-rotor air vehicle using the multi-objective approximate optimization of proportional–integral–differential control. Eng. Optim. 49(10), 1704–1718 (2017) 3. Hung, K.T., Liu, J.S., Chang, Y.Z.: A comparative study of smooth path planning for a mobile robot by evolutionary multi-objective optimization. In: 2007 International Symposium on Computational Intelligence in Robotics and Automation, pp. 254–259. IEEE (June 2007) 4. Zhou, X., Zhang, X.: Multi-objective-optimization-based control parameters auto-tuning for aerial manipulators. Int. J. Adv. Robot. Syst. 16(1), 172988141982807 (2019) 5. Davoodi, M., Panahi, F., Mohades, A., Hashemi, S.N.: Multi-objective path planning in discrete space. Appl. Soft Comput. 13(1), 709–720 (2013) 6. Ahmed, F., Deb, K.: Multi-objective optimal path planning using elitist non-dominated sorting genetic algorithms. Soft. Comput. 17(7), 1283–1299 (2013) 7. Coelho, B.N., et al.: A multi-objective green UAV routing problem. Comput. Oper. Res. 88, 306–315 (2017) 8. Hendricks, E., Jannerup, O., Sørensen, P.H.: Linear Systems Control: Deterministic and Stochastic Methods, pp. 522–529. Springer, Berlin (2008). https://doi.org/10.1007/978-3540-78486-9 9. Bryson, A.E.: Applied Optimal Control: Optimization, Estimation and Control. CRC Press (1975)
844
Z. Shen and T. Tsuchiya
10. Patterson, M.A., Hager, W.W., Rao, A.V.: A ph mesh refinement method for optimal control. Optim. Control Appl. Meth. 36(4), 398–421 (2015) 11. Hou, H., Hager, W., Rao, A.: Convergence of a Gauss pseudospectral method for optimal control. In: AIAA Guidance, Navigation, and Control Conference, p. 4452 (August 2012) 12. Shen, Z., Tsuchiya, T.: A novel formula calculating the dynamic state error and its application in UAV tracking control problem. arXiv preprint arXiv:2108.07968 (2021) 13. Marler, R.T., Arora, J.S.: Survey of multi-objective optimization methods for engineering. Struct. Multidiscip. Optim. 26(6), 369–395 (2004) 14. Yoo, S., Harman, M.: Using hybrid algorithm for Pareto efficient multi-objective test suite minimisation. J. Syst. Softw. 83(4), 689–701 (2010) 15. Zitzler, E., Thiele, L.: Multiobjective evolutionary algorithms: a comparative case study and the strength Pareto approach. IEEE Trans. Evol. Comput. 3(4), 257–271 (1999) 16. Cowling, I.D., Yakimenko, O.A., Whidborne, J.F., Cooke, A.K.: Direct method based control system for an autonomous quadrotor. J. Intell. Rob. Syst. 60(2), 285–316 (2010) 17. Aguiar, A.P., Daˇci´c, D.B., Hespanha, J.P., Kokotovi´c, P.: Path-following or reference tracking?: an answer relaxing the limits to performance. IFAC Proc. Vol. 37(8), 167–172 (2004) 18. Kelly, M.: An introduction to trajectory optimization: How to do your own direct collocation. SIAM Rev. 59(4), 849–904 (2017) 19. Betts, J.T.: Practical Methods for Optimal Control and Estimation using Nonlinear Programming. Society for Industrial and Applied Mathematics (2010) 20. Reyes-Valeria, E., Enriquez-Caldera, R., Camacho-Lara, S., Guichard, J.: LQR control for a quadrotor using unit quaternions: modeling and simulation. In: 23rd International Conference on Electronics, Communications and Computing, CONIELECOMP 2013, pp. 172–178. IEEE (March 2013) 21. Khatoon, S., Gupta, D., Das, L.K.: PID & LQR control for a quadrotor: modeling and simulation. In: 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 796–802. IEEE (September 2014) 22. Cowling, I.D., Whidborne, J.F., Cooke, A.K.: Optimal trajectory planning and LQR control for a quadrotor UAV. In: International Conference on Control (August 2006) 23. Lee, T., Leok, M., McClamroch, N.H.: Geometric tracking control of a quadrotor UAV on SE(3). In: 49th IEEE Conference on Decision and Control (CDC), pp. 5420–5425. IEEE (December 2010) 24. Luukkonen, T.: Modelling and control of quadcopter. Independent research project in applied mathematics, Espoo, 22 August, p. 22 (2011)
Remote Manipulation of a Robotic Arm with 6 DOF via IBSV Using a Raspberry Pi and Machine Vision Sandro Balarezo1(B) , Xavier Arias2 , and Kevin Espín2 1 Universidad Nacional de Colombia, Medellín, Colombia
[email protected]
2 Universidad de Las Fuerzas Armadas ESPE, Sangolquí, Ecuador
{axarias,kaespin2}@espe.edu.ec
Abstract. Robotic systems are being used worldwide in different areas of human development such as: social, academic and even medical. Thus, the company WLKATA has developed a manipulator robotic arm of 6 DOF with a movement precision of 0.2 mm called Miro-bot, the same one that has been used in this research project. In this document, the implementation of an image processing system is carried out using OpenCv in a Raspberry Pi to determine the position in Cartesian coordinates of a cube and the end effector of the robotic arm located on a work area of 15 × 20 cm wide and tall, respectively. The movement of the robotic arm is manipulated through a servo vision (SV) based control system, combining the variants of position-based servo vision (PBVS) and image-based servo vision (IBSV). This system allowed the generation of displacement trajectories between the end effector and the cube with a minimum error of 1%, in addition, it was possible to demonstrate the versatility that Mirobot presents to be controlled with a Raspberry Pi card. Finally, it can be concluded that the characteristics of movement, precision, repeatability, speed presented by this robotic arm are of industrial characteristics, which favors the development of new control systems based on algorithms with hardware and free software, reducing the cost. of implementation. Keywords: Robotic arm · Servo vision · Mirobot · IBSV
1 Introduction Robotics is found in all areas of human development from robotic arms used in industrial assembly processes for automobiles to robotic arms capable of facilitating work in hospitals. In September 2001, the first transatlantic surgery was performed in Strasbourg, where the gallbladder of a 68-year-old patient was removed using a robotic arm remotely manipulated by a surgeon from New York with 100% success. Currently, health personnel are actually affected, especially those of nurses, with a considerable reduction in the arrival of new elements to health services. This causes many health centers to have limited staff to care for their patients and to help doctors during the process of surgery. The main activity of a nurse during surgery is to provide the instruments when the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 845–854, 2022. https://doi.org/10.1007/978-3-031-10464-0_58
846
S. Balarezo et al.
doctor requires it. Certainly, not having enough staff to perform these jobs, patients are not treated quickly, increasing the risk of complications in their health due to long waiting times. To allow the surgeon to perform his or her job at any time with or without the full staff of nurses, a robotic arm is needed that can manipulate the surgical instruments [1, 2]. The solution to this problem consists in the implementation of a control system for a robotic arm, in this case the WLKATA Mirobot was used. An artificial vision system capable of differentiating the colors of different objects was also developed. The operation of this project begins when a user enters a keyboard command of the color or object he needs, the vision system recognizes its initial position and the robotic arm moves it to a fixed final position where another user can access this object without problems. This is how this project aims to speed up surgical interventions and the work of nurses in hospitals and clinics. This document is divided into the following sections: In Sect. 1, a bibliographical review was carried out on basic concepts, important parts of a manipulator robot and a description of the techniques based on servo vision (VS), detailing the techniques of PBVS and IBSV. Section 2 describes the materials used as the robotic arm, the control card and the work area; you can also see the control technique developed with image processing. Finally, Sect. 3 shows the results obtained and future work. 1.1 Manipulator Robot The robots that have a greater presence in industries are articulated robotic arms, according to the RIA (Robot Instituted of America) an industrial robot can be defined as “a multifunctional programmable manipulator designed to move materials, parts, tools or special devices through varied movements, programmed for the execution of different tasks”. The most important elements of a robot are: the mechanical system, actuators, sensors and control systems. Mechanical System: It is the physical structure composed of several joints, the terminal element or final effector is distinguished from the arm since instead it can be used from tweezers to special devices for different applications or tasks. Joint augmentation can improve maneuverability, but significantly hinders control systems. Actuators: These are the elements responsible for generating the necessary force to move the mechanical structure of the robot. Hydraulic elements can be used to obtain important powers, as can pneumatic systems. Generally, the use of electric motors, especially direct current motors, is increasingly common in manipulator robots. Finally, new forms of drive based on biomechanical concepts have been sought. Sensors and Control Systems: Hierarchically the servo control and joint monitoring systems are at a lower level, below the mechanical systems and actuators. The first level of robotic arm control requires a speed and position feedback to control your joints. The second level of control instead occupies the generation of trajectories to know at all times the position of the final effector when moving from one place to another. And the higher
Remote Manipulation of a Robotic Arm with 6 DOF via IBSV
847
levels of control occupy communication with the user, interpretation of programs, task planning, and finally sensory perception [3].
1.2 Servo Vision (VS) Servo vision or also known as a vision-based robot control system (VS), this technique consists of obtaining feedback information through a vision sensor or camera and thus controlling the movements of the final effector of a robot. There are two ways to set up the end effecter and camera: • Closed loop endpoint or hand-in-hand control, consists of connecting the camera with the end effecter to observe the position of the lens. • Open loop endpoint, this setting consists of placing the camera at a fixed point in the world and observing both the target and the movement of the robot’s end effecter [4]. Vision-based robot control techniques can be classified as follows: Position-Based Visual Servo (PBVS): This technique is based on the development of models with a single camera. This is achieved because the pose of the object of interest is previously estimated with respect to the camera used and then these commands are sent to the robot’s control system, which in turn allows the displacement of the final effect. The type of image used for this control system allows estimating the 3D position of the lens on a cartesian plane unlike the IBSV that uses 2D information [5]. Image-Based Visual Servo (IBSV): The image-based robot control system was proposed by Weiss and Sanderson. The law of this type of control is based on the error generated between the current and desired characteristics in the plane of an image, in this type of control no estimate of the target pose is applied. Within the characteristics can be considered the coordinates of the target, lines or regional moments [6].
2 Materials and Methods 2.1 Materials Robotic Manipulator Arm: WLKATA Mirobot is a mini-industrial manipulator robot of 6 degrees of freedom (6DOF). This robotic arm was developed especially for educational institutions working under the concept of STEAM. The term STEAM refers to an education system that integrates several disciplines such as science, engineering, technology, arts and mathematics to generate new knowledge. The Mirobot is known as a desktop robot and users can program it with a remote control, graphical programming, teaching mode and with the game mode [7]. Control System: A Raspberry Pi was used as the system controller, the Logitech C525 HD Webcam camera is responsible for capturing the images and a 24-volt source, as shown in Fig. 1.
848
S. Balarezo et al.
Fig. 1. Control system with Raspberry [8].
Work Area: A stable structure was designed with the Help of CAD/CAE software and built with aluminum supports. The function of this structure is to keep in a fixed position both the robotic arm, the camera, and the control system based on a Raspberry Pi. The material used for various support elements was made of PLA, a polylactic acid used within rapid prototyping processes with 3D printers (Fig. 2).
Fig. 2. Work area of the robotic arm [7].
2.2 Method Coordinate System: The Mirobot has two coordinate systems, one based on its six joints and a Cartesian spatial coordinate system. Figure 3 shows the axes of the cartesian coordinate system of the base (frame0) and the final effector (frame6). Kinematic Analysis: Applying an inverse kinematics analysis to the robotic arm was able to obtain the coordinates of each of the joints, in this way we can determine the position and orientation of the final effector. In Fig. 4 you can see the enumeration of each of the links that make up the Mirobot.
Remote Manipulation of a Robotic Arm with 6 DOF via IBSV
849
Fig. 3. Mirobot’s Cartesian spatial coordinate system [7].
Fig. 4. Kinematic Analysis of the Mirobot.
Technical Aspects: Table 1 shows the most important parameters of the operation of the Mirobot.
Table 1. Specification of Mirobot parameters. Specific parameters Number of axes
6
Load
150 g
Repeated positioning accuracy
0.2 mm
Communication interface
USB/Wifi/Bluetooth
Voltage source
100 V–240 V, 50/60 Hz
Input power
12 V/5 A DC
Power
Max 60 W
Working temperature
−10 °C–60 °C
850
S. Balarezo et al.
Control System. The main objective of the artificial vision system is to transform the image obtained by the camera into coordinates of all the elements that are in the work area of the robot. the image processing was carried out using the OpenCV tools in the Python software and was implemented in a Raspberry Pi 4, in Fig. 5 you can see a schematic diagram of the control system that will allow the final effector to be placed on a certain object.
Fig. 5. Block diagram of the control system with artificial vision.
2.3 Machine Vision System The control system developed has several stages of operation such as: acquisition of images, image processing to recognize each of the elements involved in the project and finally mathematical calculations are applied to identify the position of the cube and the final effector of the robotic arm. Image Acquisition. The images are acquired by the following elements: camera, lighting and digitizer. The lighting elements are crucial for a correct recognition of the colors of the objects, all the factors of physical lighting must be improved before using complex algorithms in programming, for this case a strip of LED lights of 12 Vdc was used, placed at the top of the work area. In Fig. 6 you can see the cube and the robotic arm located on the work area.
Fig. 6. Implementation of the artificial vision system.
Remote Manipulation of a Robotic Arm with 6 DOF via IBSV
851
Image Processing. Once the acquisition of the images is made, a BGR conversion to grayscale is applied followed by a threshold of 150 pixels, eliminating the noise by using a 2D filter. At the end of this stage contour detection is applied to determine exactly the position of the cube. In the Fig. 7, you can see the stages of image processing.
Fig. 7. Processing steps to detect a blue cube.
Image processing is applied to detect cubes of different colors on the work area, this experimentation was carried out to determine the proper functioning of the applied vision techniques, it should be noted that in each of these cases you must have a correct lighting and thresholding. The results of this process can be seen in Fig. 8.
Fig. 8. Processing steps to detect a yellow cube.
Coordinate Detection. To determine exactly the position of the cube and the end effector of the robotic arm, one must start from the contours previously detected in the image processing. First, a conversion of the measurements of the work area must be carried out by applying Eq. (1), which serves to determine the aspect ratio in centimeters. In this case, a work area of approximately 20 × 15 cm was used. Aspect ratio = width/height
(1)
852
S. Balarezo et al.
Once the calculation of the aspect radius of 0.75 is made, the formula is applied again to determine the measurements that the image must have in pixels, in this case you have a frame of 720 x 540 pixels. To determine the distance between the two points, the formula (2) was applied. X2 − (X1 + W1)
(2)
where: X2: is the position of the end effecter, X1: is the position of the cube and W1: is the size of the detected cube. In the following image you can see the representation of the coordinates on the cube and the final windy effector of the robotic arm, as shown in Fig. 9.
Fig. 9. Representation of the coordinates of the cube and the final effector.
3 Results and Discussion The results of the operation of the cube detection system and movement of the final effector of a robotic arm were obtained by applying the following test scheme: • • • • •
Detection of the position of the cube above the work area. Detection of the position of the end effector of the robotic arm. Calculation of the distance between the two elements. Movement of the end effecter from its initial position to the position of the cube. Real-time feedback of the position of the end effecter.
In the Table 2 you can see the positions in (x, y) of the cube and the position obtained by the final effecter, the error generated by each of the movements made is also shown. Displacement tests were performed considering the same starting point (0.0) for the end effects and varying the position of the cubes over the work area. In the Fig. 10, you can see the trajectories made for each of the positions of the cubes.
Remote Manipulation of a Robotic Arm with 6 DOF via IBSV
853
Table 2. Generating error in the displacement of the robotic arm. Cube position
Final Effector
Error %
X
Y
X
Y
220
150
222,2
151,5
1
230
140
234,6
142,8
2
150
100
152,25
101,5
1,5
80
120
82,4
123,6
3
60
220
66
242
1
45
110
49,5
121
1
P 01
P 02
160 140 120
X
X
100 80 60 40 20 0 0
50
100
150
200
160 140 120 100 80 60 40 20 0 0
250
50
100
P 03
200
250
60
80
100
P 04
120
140
100
120 100
80 60
X
X
150 Y
Y
80 60
40
40
20
20
0
0 0
50
100 Y
150
200
0
20
40 Y
Fig. 10. Displacement trajectories of the robotic arm.
4 Conclusions It was determined that the WLKATA Mirobot F1 robotic arm is used to perform cartesian motion operations and manipulation of objects using a plastic suction cup with relative ease, because it uses free software and hardware for its manipulation and programming.
854
S. Balarezo et al.
On the other hand, the artificial vision system developed allowed to determine the coordinates of the cubes located in the work area of 20 × 15 cm wide and high, respectively. The maximum error calculated when the final effector moves from its initial position (0.0) to the position that the cubes are in was 3% and with a minimum error of 1%. Applying a combination of IBSV and PBSV control techniques can determine with relative accuracy the position of all the elements that are on the work area, it should be noted that these servo vision control techniques decrease the number of sensors to control the movement of the robotic arm. Additionally, it was possible to determine that the robotic arm can move by describing a relatively straight trajectory, this means that the displacement time is minimal. As future work, it is intended to continue with this research by applying artificial vision techniques and evolutionary algorithms to generate a robotic movement bioinspired in animal behavior, such as previous work carried out in [9–11].
References 1. Monteagudo, L., Serrano, L., Hernández, C.: La telemedicina: ¿ciencia o ficción? (2005). http://scielo.isciii.es/scielo.php?pid=S1137-66272005000500002&script=sci_ arttext&tlng=pt. Accessed 24 Apr 2020 2. Ishak, M.K., Kit, N.M.: Design and Implementation of robot assisted surgery based on Internet of Things (IoT). In: Proceedings - 2017 International Conference on Advanced Computing and Applications, ACOMP 2017, pp. 65–70, June 2018. https://doi.org/10.1109/ACOMP.201 7.20 3. Ollero, A.: Robótica: Manipuladores y Robots Móviles. Barcelona: MARCOMBO (2001). https://books.google.es/books?hl=es&lr=&id=TtMfuy6FNCcC&oi=fnd&pg=PR17&dq= definicion+robots+manipuladores&ots=33HYKYsa5M&sig=0W5Br8oZXX6_wMEsqL DtP9F8XPQ#v=onepage&q=definicionrobots manipuladores&f=false. Accessed 14 Mar 2021 4. Corke, P.: Vision-based control. In: Springer Tracts in Advanced Robotics, vol. 73. Springer, pp. 455–479 (2011). https://doi.org/10.1007/978-3-642-20144-8_15 5. Chaumette, F.: Potential Problems of Stability and Convergence in Image-Based and PositionBased Visual Servoing. Springer, London (1998). https://doi.org/10.1007/bfb0109663 6. Sanderson, A.C., Weiss, L.E.: Adaptive visual servo control of robots. In: Pugh, A. (eds.) Robot Vision. International Trends in Manufacturing Technology. Springer, Heidelberg. https://doi.org/10.1007/978-3-662-09771-7_7 7. Bejing Tsinew Technologies Co. Ltd.: WLKATA Mirobot F1. User Manual. Beijing (2019). https://docplayer.net/185259624-Wlkata-mirobot-f1-user-manual.html. Accessed 14 Mar 2021 8. Viera-Maza, G.: PROCESAMIENTO DE IMÁGENES USANDO OPENCV APLICADO EN RASPBERRY PI PARA LA CLASIFICACIÓN DEL CACAO. Piura (2017) 9. Balarezo-Gallardo, S.-F., Hernández-Riveros, J.A.: Evolutionary parameter estimation of coupled non-linear oscillators. In: Solano, A., Ordoñez, H. (eds.) CCC 2017. CCIS, vol. 735, pp. 457–471. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-66562-7_33 10. Balarezo, S.F.: MODELO HEURÍSTICO PARA REPRODUCCIÓN DE TRAYECTORIAS DE DESPLAZAMIENTO CON UN ROBOT CUADRÚPEDO ARTICULADO (2019) 11. Balarezo, S., Arias, X., Espín, K., Aquino, M., Novillo, G.: Simulation system of a tomato sorting process using artificial vision. In: Botto-Tobar, M., Cruz, H., Díaz Cadena, A., Durakovic, B. (eds.) Emerging Research in Intelligent Systems. CIT 2021. LNNS, vol. 405. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-96043-8_11
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems Petar Durdevic(B) and Daniel Ortiz-Arroyo Aalborg University, Esbjerg 6700, Denmark {pdl,doa}@energy.aau.dk https://vbn.aau.dk/en/persons/125324, https://vbn.aau.dk/en/persons/104617
Abstract. The integration of deep learning and control techniques has created robotic systems capable of implementing visual servoing and navigating autonomously in unknown environments. However, analyzing the effect that timing interactions between controllers and deep neural networks have, has received little attention in the literature. In this paper we describe a novel model that includes the effects that detection loss and inference latency have on the controller of a visual servoing system. To test our model we created a target tracking system consisting of a video camera mounted on a moving platform that tracks objects using deep neural networks. Keywords: DNN · Visual-servoing Networked control systems
1
· Modeling · Control · Dynamics ·
Introduction
In [10] visual servoing is defined as a close-loop feed-back system that controls robot’s movements by processing visual features extracted from camera images. Position-based servoing (PBVS) uses 2D image features together with the geometry of the 3D space to control a robot. Contrary, in image-based visual servoing (IBVS) only the 2D image features are used. A hybrid system combines the 2 approaches. The survey on visual servoing in [12] describes a variety of approaches for grasping and planning in robotic manipulators. A more recent survey on visual servoing systems describe a variety of robotic systems that use classical and deep learning-based techniques to perform visual servoing [18]. A form of visual serving is the visual tracking of objects with a robot equipped with a video camera. Visual tracking of objects can be done with or without a controller. When the target object is moving, the controller moves the camera mounted on a robot to track the object. In other tracking systems the target object is static but the camera on the moving robot, requires a controller to keep the object within its field of view (FOV). Lastly, in other tracking systems a controller is not required since the camera is static at a fixed position and the c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 855–867, 2022. https://doi.org/10.1007/978-3-031-10464-0_59
856
P. Durdevic and D. Ortiz-Arroyo
target object is moving. In this last case, objects will be tracked only as long as they remain within the static FOV of the camera. State of the art computer vision systems rely on deep learning techniques to detect objects with high accuracy. Deep learning is a supervised machine learning technique whose goal is to minimize a loss function as expressed by the following equation: 1 L(yi , f1 (f2 (...fm (w, t)))) n i=1 n
minimize
(1)
where L is the loss function, yi are the testing labeled data of n images and f1 (f2 (...fm (w, t)) is a hierarchical deep neural network (DNN) of m layers. The DNN is trained on a dataset of size |t|, with the goal of optimizing the internal weights w in the neurons and minimize the training error with the backpropagation algorithm [6,14,15]. The loss function used in a deep neural network depends on the task at hand and has a significant effect on performance. Some of the loss functions commonly used for regression tasks are the mean square error, cross-entropy, and Huber loss. For classification tasks, the categorical cross entropy, Hinge loss or Kullback Leibler Divergence are commonly used. In the case of object detection, where bounding box regression is required, the Intersection over Union (IoU) metric is commonly used as loss function. The widespread use of deep learning techniques in visual servoing and target tracking systems is due to its significantly better performance, compared to the classical machine learning techniques. This is demonstrated by the fact that Deep Neural Network (DNN) models have consistently won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), in the categories of object detection and recognition [22] since 2012. This paper presents the modeling and analysis of a visual servoing system for tracking objects that includes a DNN sensor in its feedback control loop. We call our sensor Deep Neural Network Visual Servoing (DVS) sensor. Our experimental setup is an object tracking system that uses a servo mechanism to move the camera and keep the object within the field of vision of the camera. The effect of inference latency and detection loss on the system performance is studied and a dynamic model is defined. In addition, we discuss some potential control strategies. To our knowledge no other work has performed this type of analysis on DNN-based visual servoing or tracking systems. The paper is organized as follows: Sect. 2 describes relevant related work on the topic. Section 3 describes the two main factors that affect the timing interaction between the DNN and the controller in our system. 4 describes the experimental set-up used, Sect. 5 describes the model, Sect. 6 discusses some potential control solutions and Sect. 7 concludes the paper.
2
Related Work
In recent years, many visual servoing and tracking systems based on deep neural networks and other machine learning techniques have been proposed,
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems
857
An example is [11], in which the authors use the Interacting Multiple Models (IMM) algorithm to fuse several linear dynamic models of the target object. Additionally, the gain of an Extended Kalman Filter is automatically adjusted to minimize tracking errors. In a similar work, in [21], a Kalman filter is also used for object tracking, but in this case its noise covariance matrices are fine tuned through the Particle Swarm Optimization algorithm. Lastly, a face tracking algorithm is described in [29], combining a Kalman filter used as a visual state estimator and an echo state network-based self-tuning algorithm to create a robust face tracking system. In all previous visual tracking systems the camera is static. Contrarily, visual servoing tracking systems include a controller to move the robotic platform in which the camera is mounted, to follow or reach an object. Examples of visual servoing tracking systems are drones equipped with DNNs that track and follow forest trails autonomously as is described in [5]. Other examples of DNN-based visual servoing systems, are drones capable of detecting and tracking wind turbines to navigate towards them. An example of such a system is [20]. Another recent example is [4], where moving objects are tracked and followed by a ground robot. In this last case, the tracking system employs MobileNet [9], a DNN based on the Single Shot Multibox Detector (SSD) [16] architecture that detects objects and a PI controller that controls robot’s movements to keeping the target object within the FOV of the camera and at a safe distance. Lastly, using a completely different approach, in end-to-end systems the controller is completely eliminated by a DNN that has learned to mapping directly, image pixels to steering wheel commands, as is described in [2] using supervised machine learning. A similar system, Dronet described in [17], is capable of learning the steering angle and a collision probability, by using training data collected from driving cars and bicycles.
3
Effect of DNNs on Controllers
In all applications on visual servoing and visual tracking described in previous section, the DNNs continuously perform inferences in real-time. However, inferencing is a computationally expensive task and therefore, DNNs may introduce delays in the feedback control loop that may cause system’s instability. The delay could be substantially reduced by optimizing DNN’s architecture or using faster parallel processing hardware. For instance, optimized DNNs such as MobileNet V3-Small have a latency of 43 ms with an accuracy of 16.1 mAP [8] on mobile devices. TPUs and Jetson Nano Graphical Processing Units (GPU) boards can provide latencies from 2.9 to 18 ms when performing inferences with MobileNet V2 [7]. In spite of these low delays, the latency due to inferences performed in real time may still be significant even. This is because of the limited computing power in the host processor that additionally to the DNN, should manage the multiple tasks in kernel’s scheduler and the communication to send and receive commands that will be used to move and localize the robot at a certain position.
858
P. Durdevic and D. Ortiz-Arroyo
The effect of latencies in a DNN-based visual servoing system, is similar to the effect that random delays have on networked control systems. In this last case, the delays happen when the sensors send their data to the controller through a communication network of limited bandwidth as is described in [26] and [30]. One possible solution to the time delay problem in networked control systems is to use buffers, however when the delay is large the control system may be responding to past events causing system’s instability [19]. Latencies are not the only problem in DNN-based visual servoing systems. DNNs may be also unable of detecting objects, occasionally. This may happen when there are poor lighting conditions or the robot or target object are moving fast. In these conditions, no output detection will be produced by the DNN. We call this effect detection loss. One way to alleviate this problem is by using Kalman filters to estimate the most likely position of the target tracking object when a detection lost has been detected. However, the limitation of this approach is that if detection loss is significant, the Kalman filter may be unable of tracking an object. The detection loss effect in DNNs is similar to random packet loss in networked control systems. In these systems, data packets sent by the sensors or the controller may be lost in the network due to noise or traffic congestion in the routers. The network protocols used, either TCP or UDP, have different effects on the control system’s behavior since lost packets may or may not be sent again. Including the effect of DNN’s latency and detection loss in the modeling of a visual servoing system is a challenging problem since this makes the system no longer time-invariant.
4
Experimental Setup
We created an experimental visual servoing system that includes our DVS sensor and a control system. Its function is to detect and keep track of an object, so that when the servo motor that has the camera mounted moves, the camera rotates to keep track of an object. The object we have chosen for our experiments is a human face. Figure 1 shows the main components of our experimental setup.
Camera Servo PWM
NCS
USB3.1
USB
RPI
Teensy
I2C
Compass
UART
Fig. 1. Experimental setup of the visual servoing system
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems
859
As figure shows, an Intel’s Neural Compute Stick 2 (NCS) is attached to a Rasberry PI v3 (RPI) board through an USB 3.1 port. The Logitech 920 video camera sends images to the RPI that are processed by the DNN running in the NCS to detect objects. The Teensy v3.6 board is connected through I2C bus to a servo motor (HD-1800A - analog servo v.2) with the video camera attached. As the servo and the camera move, the NCS detects changes in the position of the target object within the image and sends the updated position to the controller running on the RPI. The controller in turn, sends a PWM signal to the servo motor attached to the Teensy board that rotates the camera in the x coordinate an angle ψ, to keep the target object within the field of vision of the camera. The IMU BDO055 compass, measures the three axis orientation data of the servo and camera. This data is used to compute the rotation angle of the camera and compare it with the calculated angle using the DVS sensor’s output data as is described in the next section. 4.1
DNN and Communication Protocol
In our experiments we used a pretrained DNN that has MobileNet V1 [9] as backbone. MobileNet was designed to have a small size and low latency. Its low latency is mainly achieved by using more efficient depthwise convolutions rather than standard convolutions in the first layers of the network. Contrary to standard convolutions, in depthwise convolutions, instead of applying the convolution filters and then combining the inputs in a single layer, the filtering and combination is performed in two layers. First, a single filter is used for each input channel and then pointwise convolution is applied to combine the outputs. MobileNet was pretrained to detect faces in images, producing as its output the position of the bounding box around a face and a confidence value. The image detection output of the DVS has the following data format: DV Sout = [iid , l, c, xmin , ymin , xmax , ymax ] where iid is the image id, l is the label, c is the confidence value, xmin , ymin are coordinates of the top left bounding box corner, and xmax , ymax are the coordinates of the bottom right bounding box corner. To communicate data from the MobileNet DNN, running on Intel’s NCS on RPI board, to the Teensy board, we created a simple communication protocol that sends data packets containing bounding box data and the detection latency of the network trained to detect faces, that achieved a confidence value larger than a threshold set arbitrarily to 0.7. The data packets sent have the following data format: [Δtdp , DV Sout ] where Δtdp is the detection data arrival time difference at the controller node and DV Sout is the data output of the network. The data packets are sent to the Teensy board through the UART interface, when a face is detected on an image.
860
P. Durdevic and D. Ortiz-Arroyo
With the DV Sout values we compute the width, height (w, h), and the center coordinates of the bounding box [u, v]T w = xmin − xmax
(2)
h = ymin − ymax
w 2 (3) h v = ymin + 2 Using the center of the bounding box u value in Eq. 3, the rotation angle for the camera in the x direction is calculated in Eq. 4. This angle is used to keep track of an object located within a bounding box. u − uc ψDV S = atan (4) f u = xmin +
where uc is the center of the image frame and f is the focal length. To calculate the rotation angle we need camera’s intrinsec parameter focal length. The Logitech 920 video camera used in our experiments has the intrinsec parameters shown in Table 1. Table 1. Camera’s Intrinsic Parameters Intrinsic parameters Value
5
Unit
Resolution
[672,384] pixels
Focal length
3.67
mm
Pixel size
3.98
µm
Model
A visual servo system can be represented by the block diagram in Fig. 2 where the output is generated by a DVS sensor. The sensor is perturbed by noise vk which is added to the DVS signal y˜k that is used as the feedback signal to a controller. This signal may suffer from detection losses, an effect similar to dropout of packages γk in networked control systems. Additionally, the signal may arrive with a delay τk . The controller in this case can be of any type. In the next sections, we discuss how to model the detection loss and the delay.
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems wk yksp −
ek
Controller
uck
Actuator
uak
Plant
861
vk yˆk
DVS
y˜k
Output
yk γk = 0 Delay τk
γk = 1
Fig. 2. Block diagram of the considered system
5.1
Definition of the System with Detection Loss
A stochastic plant can be described as a discrete time linear dynamic system by the Eq. 5 [26]: xk+1 = Axk + Buk + wk (5) yk = γk (Cxk + vk ) and the controller’s output in discrete time uk = −Kxk
(6)
where xk ∈ Rn is the state vector, uk ∈ Rm is the control input, wk is the white noise process, and vk is the measurement error and K is the control gain. γk is formally known as the package loss as described in [24–28], and is defined as a Bernoulli random variable. In our case it represents the detection loss, and it is defined in Eq. 7 1 if detection data arrives before or at time t, t ≤ k, (7) γk = 0 otherwise where the arrival time is defined as tk and k is the sample number with a sampling period ts , and thus for every detection the DVS sensor sends a packet yk . That is: Cxk + vk if the object is detected yk = (8) 0 otherwise this is the main difference from the formulation in [26], where: Cxk + vk if the packet is delivered yk = otherwise vk i.e. if the packet is not delivered the output is pure noise.
(9)
862
P. Durdevic and D. Ortiz-Arroyo
The detection D, is defined as: 1 if pd > ζ D= 0 otherwise
(10)
where pd is the probability of detecting an object and ζ is a threshold. The detection success, i.e. γk = 1, is dependent on the DNN’s inferencing process, which is difficult to model as it depends on a multitude of identifiable and unidentifiable sources. Some of these are: type of the DNN, amount of data used for training, size of the object within an image, current lighting conditions, use of acceleration hardware and amount of memory available in the computing device, among others. Assumption 1. If the object cannot be detected for k = {t, ∞} by the DNN, we will have the following condition: γk = 0, for k = {t, ∞} Typically this is considered as a sensor failure, but in the case of the DVS this represents the inability of the DNN to detect an object even though the system continues working. The DVS’s stochastic and dynamic properties have been used in our experimental set-up, described in Sect. 4. The experimental results are shown in Fig. 3. In this figure, the top plot shows the reference signal for the servo [u], the angle of the camera measured with the IMU [ψ], the object location in the image plane measured by the DNN [DNN], the data is normalized. The middle plot shows the computed translational velocity of the object, relative to the camera. The bottom plot shows the detection data arrival time difference at the controller node [Δtdp ]. From the results in Fig. 3, we can observe that the detection loss occurs more frequently as the relative object’s velocity increases and thus it becomes more difficult to detect. The unknown properties of γ represent a great challenge to the feedback control structure, as the loss of feedback signal might bring the system to instability. 5.2
Delay
In addition to the detection loss, the system shown in Fig. 2, suffers from a detection delay. The detection delay at time k, τk , represents the delay or latency of the DVS. τk comprises multiple delay sources, the main of which are: τk = τos + τif + τcp
(11)
where τos is the delay caused by the operating system scheduling, τif is the delay caused by the inference process within the DNN, τct is the delay caused by the communication protocol. Thus τk represents the time from when the object is observed to the time when its detection is outputted from the DVS and consequently this is analogous to the bandwidth of the DVS. In the experimental set-up, We have observed in our system that the following relationship holds:
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems
863
Fig. 3. Experimental results, analysis of the DVS sensor.
τif > τos + τcp
(12)
meaning that inference time is the largest latency. This relationship is highly dependable on the network type and size, and on the hardware and communication protocol. A more powerful tensor processing unit (TPU) could potentially invert this relationship. Another important aspect is that the total delay τk is dependent on the detection loss γk , similarly as was stated in [24], having the following relationship: ∞ if γk = 0 τk = (13) τk otherwise We have that if we get lost detection at time t we will not expect to receive that detection data in the future. Thus the measurement model is defined as: yk = γk Ck xk + vk
(14)
where vk ∼ N (0, Qk ) is the measurement error covariance. Now we can formulate the delayed output Eq. 5, following [13], as: yk∗ = γk Cs∗ xs + vk∗
(15)
864
P. Durdevic and D. Ortiz-Arroyo
where s = k − M , where M is the number of delayed samples. In comparison to a typical sensor used in robot feedback control, such as a gyroscope, which has a bandwidth of >500 Hz [3], the bandwidth of a DVS sensor depends on several factors as mentioned earlier. In the current work we have analyzed the bandwidth of our experimental set-up, by subjecting the sensor to a pseudorandom amplitude step input, refer to Fig. 4, to analyze the response and check if the delay is time varying. And in this case we have measured an average delay of τ¯k = −0.71776 s which equates to a bandwidth of 1.3931 Hz. This delay can play a significant factor in typical robotic systems as the bandwidth of the sensors is relatively low in comparison to the closed loop systems bandwidth.
6
Discussion
It is well known that the performance and robust stability of a control system changes significantly when the plant includes time-delay [23], and as shown earlier, detection loss adds a significant additional delay into the system which can be even larger than τk , if no object is detected for long periods of time. To ensure that the system behaves desirably the following assumption can be made:
Fig. 4. Experimental data, the DVS sensor subjected to a pseudo-fandom amplitude step input.
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems
865
Assumption 2. The closed loop system bandwidth ωCL follows the following inequality ωCL < max1 (τ ) Due to Eq. 13, the Assumption 2 can be difficult to fulfill, as the system might experience very long delays due to detection loss. This can be solved in a series of ways. One possible solution is to use a low pass filter (LPF), as is commonly used as a way of handling disturbances [23], or by changing the physical properties of the system e.g. the inertia. However, changing the inertia in real time is in most cases not practically feasible. On the other hand a dynamic LPF could easily be implemented, where τ could be estimated using normalized cross-correlation between z and x [1]. Switching Controller Rule One workaround to manage the detection losses, could be to introduce a hybrid controller structure: According to Assumption 1, we could under certain circumstances have a γk = 0, f or k = {t, ∞}, thus we define β as 1 if γk < dmax β= (16) 0 otherwise where u = βK1 x + (1 − β)K2 x
(17)
where dmax ∈ Z = {0, 1, 2, 3, . . . } is the maximal number of detection losses and β is the switching rule. If β = 1 one controller design option could be to have K1 = 0, i.e. stay still until the DVS sensor has regained tracking on the object. This evidently results in the system reducing its speed and with reduced speed increases the probability of tracking the object.
7
Conclusion
In this work we analysed the effect that the dynamic behaviour of a DVS may have on a control system, through a series of experiments. We then formulated a model for this type of visual servoing system, based on the principles of networked control systems, which models the random delay and detection loss associated with the DVS. One of the key differences between the networked control systems and the DVS is that the detection loss is different in nature to packet loss. The difference being that in the case of a DVS, if detection loss occurs, the data will be lost forever. However, in the case of a networked control system, in most cases the package drop will result in data arriving later in the buffer. In order to use the DVS signal as a feedback to a controller, an estimator would be needed to predict the true state during loss of data. This and how to implement it in our prototype system will be part of our future work. Lastly, we discussed some possible solutions to this problem and some potential control systems issues.
866
P. Durdevic and D. Ortiz-Arroyo
References 1. Finddelay: Estimate delay(s) between signals. https://se.mathworks.com/help/ signal/ref/finddelay.htm. Accessed 24 Nov 2021 2. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016) 3. Geiger, W., et al.: MEMS IMU for AHRS applications. In: 2008 IEEE/ION Position, Location and Navigation Symposium, pp. 225–231. IEEE (2008) 4. Gemerek, J., Ferrari, S., Wang, B.H., Campbell, M.E.: Video-guided camera control for target tracking and following. IFAC-PapersOnLine 51(34), 176–183 (2019). 2nd IFAC Conference on Cyber-Physical and Human Systems CPHS 2018 5. Giusti, A., et al.: A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robot. Autom. Lett. 1(2), 661–667 (2016) 6. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016) 7. Hadidi, R., Cao, J., Xie, Y., Asgari, B., Krishna, T., Kim, H.: Characterizing the deployment of deep neural networks on commercial edge devices. In: 2019 IEEE International Symposium on Workload Characterization (IISWC), pp. 35–48. IEEE (2019) 8. Howard, A., et al.: Searching for MobileNetV3. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324 (2019) 9. Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017) 10. Hutchinson, S., Hager, G.D., Corke, P.I.: A tutorial on visual servo control. IEEE Trans. Robot. Autom. 12(5), 651–670 (1996) 11. Jia, Z., Balasuriya, A., Challa, S.: Vision based data fusion for autonomous vehicles target tracking using interacting multiple dynamic models. Comput. Vis. Image Underst. 109(1), 1–21 (2008) 12. Kragic, D., Christensen, H.I., et al.: Survey on visual servoing for manipulation. Computational Vision and Active Perception Laboratory, Fiskartorpsv, 15:2002 (2002) 13. Larsen, T.D., Andersen, N.A., Ravn, O., Poulsen, N.K.: Incorporation of time delayed measurements in a discrete-time Kalman filter. In: Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No. 98CH36171), vol. 4, pp. 3972–3977. IEEE (1998) 14. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 15. LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989). https://doi.org/10.1162/neco.1989.1.4.541 16. Liu, W., et al.: SSD: single shot multibox detector. CoRR, abs/1512.02325 (2015) 17. Loquercio, A., Maqueda, A.I., del Blanco, C.R., Scaramuzza, D.: DroNet: learning to fly by driving. IEEE Robot. Autom. Lett. 3(2), 1088–1095 (2018) 18. Machkour, Z., Ortiz-Arroyo, D., Durdevic, P.: Classical and deep learning based visual servoing systems: a survey on state of the art. J. Intell. Robot. Syst. 104(1), 1–27 (2022). https://doi.org/10.1007/s10846-021-01540-w 19. Nilsson, J., et al.: Real-Time Control Systems with Delays (1998) 20. Petar, D., Ortiz-Arroyo, D., Li, S., Yang, Z.: Vision aided navigation of a quadrotor for autonomous wind-farm inspection. IFAC-PapersOnLine 52, 61–66 (2019) 21. Ramakoti, N., Vinay, A., Jatoth, R.K.: Particle swarm optimization aided kalman filter for object tracking. In: 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies, pp. 531–533 (2009)
Dynamic Analysis and Modeling of DNN-Based Visual Servoing Systems
867
22. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y 23. Sariyildiz, E., Oboe, R., Ohnishi, K.: Disturbance observer-based robust control and its applications: 35th anniversary overview. IEEE Trans. Industr. Electron. 67(3), 2042–2053 (2019) 24. Schenato, L.: Kalman filtering for networked control systems with random delay and packet loss. In: Conference of Mathematical Theory of Networks and Systems. MTNS 2006. Citeseer (2006) 25. Schenato, L.: Optimal estimation in networked control systems subject to random delay and packet drop. IEEE Trans. Autom. Control 53(5), 1311–1317 (2008) 26. Schenato, L., Sinopoli, B., Franceschetti, M., Poolla, K., Sastry, S.S.: Foundations of control and estimation over lossy networks. Proc. IEEE 95(1), 163–187 (2007) 27. Shi, L., Epstein, M., Murray, R.M.: Kalman filtering over a packet-dropping network: a probabilistic perspective. IEEE Trans. Autom. Control 55(3), 594–604 (2010) 28. Sinopoli, B., Schenato, L., Franceschetti, M., Poolla, K., Jordan, M.I., Sastry, S.S.: Kalman filtering with intermittent observations. IEEE Trans. Autom. Control 49(9), 1453–1464 (2004). https://doi.org/10.1109/TAC.2004.834121 29. Tsai, C.-Y., Dutoit, X., Song, K.-T., Van Brussel, H., Nuttin, M.: Robust face tracking control of a mobile robot using self-tuning kalman filter and echo state network. Asian J. Control 12(4), 488–509 (2010) 30. Zhang, W., Branicky, M.S., Phillips, S.M.: Stability of networked control systems. IEEE Control. Syst. 21(1), 84–99 (2001). https://doi.org/10.1109/37.898794
Implementation of a Balanced and Fluid Movement Six-Legged Spider Robot Mustafa Ayad(B) , Kwabena Boating, and Waley Zhang State University of New York at Oswego, Oswego, NY 13126, USA [email protected]
Abstract. Natural disasters are becoming more prevalent attributed to climate change in this rapidly developing era. With the dramatic increase in natural disasters brought on by human innovation, there will be a correlated increase in human society’s casualties. With the rapid development of technology, we believe that constructing an autonomous or a software-controlled robotic spider where its functionalities can include avoiding obstacles and navigating in rough terrain can solve this issue. Hexapod spider robots can also benefit from vital movement in their natural surroundings and low impact on the terrain. A mechanical spider can navigate and verve places where humans don’t have the ability to, essentially acting as a rescue robot. For example, suppose there is an earthquake and a tiny space where a human cannot fit. In that case, the robot will have the ability to accomplish that task—any natural disaster such as a hurricane, exploring war zones, and even volcanic eruption. Most of the Hexapod-legged robots have been used to explore hostile and remote environments such as seabed, nuclear power stations, and even different planets. This paper’s primary goal is to implement a six-legged spider robot that can move in a fluid movement that is balanced and can navigate effectively. Keywords: Fluid movement · Leg control servo
1 Introduction and Implementation Description With today’s technological advancement and strides, implementing a robot to perform tasks such as search and rescue scenarios that are incapable of accomplishing a human is essential for safety reasons. A hexapod will be helpful in such scenarios because the structure and geometry of a six-legged robotic spider are comparably more stable in navigating through rough terrains than a wheeled robot. Therefore, our team’s primary task was implementing a hexapod spider capable of maneuvering fluidly and through rough and challenging terrain. We mean our robotic spider could move and navigate effortlessly and efficiently by fluidity. We also suggest the legs will not be colliding with one another, and it will be a clean walking movement closely mimicking the gait walking characteristics of an actual spider [1]. We achieved this fluid movement by considering its kinematics, gait characteristics, mechanical assembly and leg formation, drive and actuating mechanism, motion conditions, and control system. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 868–881, 2022. https://doi.org/10.1007/978-3-031-10464-0_60
Implementation of a Balanced and Fluid Movement
869
Fig. 1. Step by step representation of the alternating tripod gait.
Figure 1 shows that the gait pattern is essential for the robot’s movement; hence, this gait pattern called the tripod gait characteristic will have three legs in its swing phase while the other three legs will be the instance phase. The motion condition issue is addressed by placing six legs at a point on each side with the center point focused in the middle. The joint rotational angle will be controlled by sending Pulse Width Modulation (PWM) signals and adjusting the duty cycles of these signals to the servo motor to determine its desired position for movement. Moreover, we had considered the proportionality between the velocity of the movement and the stride frequency using the Froude equation. Therefore, to achieve better movement velocity, we had to increase the stride frequency with the primary goal to keep balance and move the spider robot fluidly [1]. We had developed an algorithm based on all these specifications to achieve fluidity. The gait walking characteristics will be essential to achieving balance. There are various types of gait walking patterns, but we focused on the tripod gait. The tripod gait had been utilized in this paper as research had shown that this walking pattern showcased better stability and fluidity in rough terrain. This walking pattern we had employed had three legs in the swing phase while the other three had been in the stance phase. From Fig. 1, legs two, three, and six will be in the move phase while the other legs will be in the stance phase. Like a human gait cycle having two sets of motion to move forward, our spider robot will have two sets. For humans, those two sets of movement would be swinging our right leg and left arm forward while our left leg and right arm would be behind us; this would count as one motion. The other motion would be the opposite, with the left leg moving forward. The movement of the arms is mainly for centralizing our center of gravity. This principle would be primarily applied for our robotic spider, where we had mimicked the locomotion of a real spider where the front and rear right leg and the middle left leg would move forward or backward to create motion. The middle leg would act as a stabilizer like the human arms. This concept has greatly aided us in the development of the tripod gait. The tripod gait will create a triangle pattern movement in every completed motion from a bird’s eye view of the robot. Figure 2 shows the block diagram of the hexapod spider robot. It is understood that the center of mass of any symmetrical object is within the centroid of the object. Figure 3 shows the representation of the triangular support polygon made by the tripod gait. As our design incorporates hexagonal cassis and the legs are located symmetrically from end to end, the centroid point is located at the very center of the robot (1). For our hexapod to be statically and dynamically stable, the center of mass must be located within the support polygon shown in red in Fig. 3. Otherwise, the robot would topple over. If legs one, three, and five are in motion and legs two, four, and six
870
M. Ayad et al.
Fig. 2. Block diagram of the hexapod robotic spider.
are static, the static legs would create a support polygon, allowing the hexapod to be stable while the three other legs are in motion. The tripod gait was chosen as the main gait because the support polygon in the shape of a triangle made by the static legs is inherently stable; when you apply pressure on one of the vertices within the triangle, the force gets equally distributed among the other two vertices.
Fig. 3. Representation of the triangular support polygon made by the tripod gait.
Implementation of a Balanced and Fluid Movement
871
Fig. 4. The high-level design of the hexapod robotic spider.
The high-level design of the hexapod spider robot is shown in Fig. 4. Our implementation in this paper is an Arduino-based robot that uses servo motors connected to a connector board where the Arduino sends serial commands which instruct each leg based on the program. The Arduino microcontroller is helpful for this robot implementation since Arduino comes equipped with a servo library for large amounts of servo motors and PWM signals. The servo library is carefully written to use a one-timer to produce servo PWM on about 40 pins at once. Since our implementation required 18 servo motors with three equipped in each leg connecting the coxa, femur, and tibia, utilizing this servo library on Arduino will make it work. In addition, we had implemented PWM signals used to transmit data over the connector boards to control the servo motors’ positions Fig. 5.
Fig. 5. Servo joint movement diagram on each leg.
2 Related Work There has been much research and experiments relating to a hexapod spider mainly focused on the algorithm of locomotion, gait walking pattern, and spider weight. The interest in hexapods stemmed from the fact that they can quickly achieve static stability during walking compared to other legged robots. During the early stages of implementing
872
M. Ayad et al.
a hexapod spider design, these robots’ motion was usually human-operated [3]. Around 1977, OSU University developed a six-legged robot able to walk over short distances [4]. The most important discovery came when in 1983; Carnegie-Mellon University developed the “Six-Legged Hydraulic Walker.” This design was the first to mimic and utilize the different gait walking patterns to navigate rough terrains. This design was composed of computer control, hydraulic feedback, and human control. It was powered by a 13 kW gasoline engine [5]. Recent developments such as “A series of bio-inspired robots” were developed by Case Western Reserve University. This robot had about 24 Degrees of freedom, where the back legs had three degrees of freedom, they could rotate around. This design had great speed and agility where the design of each leg resembled the human anatomy consisting of the coxa, femur, and tibia. I will be like our design because we would design the spider legs resembling the human anatomy, where each position will have three degrees of movement and 18 degrees of motion in total. Research shows that having a single degree of freedom is ineffective. It means the robot will only have six motors driving its locomotion. Most hexapod spiders’ controls consist of many servo motors ranging from up to 12 servo motors and up are controlled by the Arduino where the microcontroller commands each leg. Usually, locomotion is achieved by commands being sent into the computer where there is a dual connection between the computer and the Bluetooth module. It is then transmitted to the microcontroller then the hexapod moves. This article utilized four different gaits: explorer, tripod, biped, and wave. Wave characteristic is where only one leg is in its swing phase, and the other five are in a stance phase. Tripod gait is three legs in the swing phase, and the other three are in a stance phase. Biped gait is like the tripod gait; the only difference is that the legs move serially, while the explorer gait characteristic is that each leg is located after every motion [6]. As explained in Ramdya, Thandiackal, and associates research in alternate faster insect gaits [7]. Their study found that the flat ground and bipod gait pattern is much quicker than the alternating tripod gait. But, in rougher conditions, tripod gait has more speed and stability than bipod gait. Our design uses the tripod gait characteristic because it can be used for rough and smooth terrains. In [7], the researchers found that the hexagonal model hexapod has a better turning ability, better stability margins, and greater stride length dependent on specific environmental conditions than certain things in the traditional rectangular model. They have used the following equation to derive the reachable area of a singular leg within the hexagonal model, 2 = (rmin + Q)2 + rmax
1 P 2
2 (1)
where rmax In Eq. (1), the maximum reachable area is the minimum reachable area, Q and P, where the legs would not collide. Using this formula, as stated in the article [7], they have found the stride length relationship between the two models. If the Q/P ratio is greater than 0.866, the hexagonal model would have a greater stride length. Thus, the hexagonal body allows the robot to move faster and turn faster than its counterpart, with only a 60-degree range of movement from each leg. The Hexagonal model also has the benefit of having smoother turns due to its shape and placement of each leg. This placement of the legs also improves the stability margin considerably proven in section D within [7]. In our design, we had implemented a hexagonal body to our hexapod
Implementation of a Balanced and Fluid Movement
873
based on Chu and Pang’s conclusions and will further tweak their design based on our specifications.
3 Analysis and Design Details To complete the robot implementation successfully and well done, we first need to determine the masses, shape, and sizes of our parts and how that will affect the movement and fluidity of the robot. We would be looking at the duty factor, Froude number, specific resistance, stability margin. The duty factor will determine how fast the robot is moving; the threshold of walking or running. In [8], the duty factor can be calculated by β=
Supporting period cycle time
(2)
We could make the discrepancy between walks and run where it is running, but the hexapod will walk if it is less than or equal to beta. In [9], Froude’s number can be calculated by Fr2 =
V2 gh
(3)
Froude’s number is essential. Using Eq. (3), we can manipulate that equation to compute the components and requirements needed for our hexapods’ stride and frequency. V represents the velocity; h will represent the tibia from the ground. We can estimate V by h times f because h is the height of the leg and f will be the stride frequency [10] replacing V squared with h and f will produce Fr2 =
hf 2 g
(4)
We had considered the leg architecture, servos, actuators, power supply, control architecture, the mechanical structure of the robot body, walking gaits, and speed. We picked a hexagonal robot for various reasons. Compared to a rectangular robot, rectangular robots’ architecture mostly requires a unique gait for turning action [11]. Sometimes they need about four steps to realize a turning action is happening. In contrast, a hexagonal architecture-type robot has shown better performances in terms of gait and direction changes. In addition, they can easily steer and maintain a more extended stability margin which is vital because our end goal is to implement a hexapod able to move fluidly. Research shows that hexagonal hexapod robots have better turning ability than rectangular robots [12]. For the kinematics implementation, there are various types. Animal gaits like mammals and reptiles usually inspire bio-inspired and non-zoomorphic legs; we want the legs consisting of the coxa, femur, and tibia at a certain angle to contain a hexagonal frame with the knee outward. The legs will be placed radially, allowing movement in all directions. We estimated that the power required for the spider robot circuit would require a 6 V 2800 mA battery
874
M. Ayad et al.
to control the whole course, and we had calculated the exact potency that will drive the robot. Eighteen servo actuator motors in total will be needed, where there will be three servos equipped in each leg for three degrees of freedom (DOF). Compared to other actuators, such as linear actuators, the servo actuator is cheaper and easy to control. The linear actuators aren’t used much because their range is limited. The pneumatic actuators are stiff and have slow or inaccurate responses because they need air supply from the board. Hydraulic actuators are decent for hexapod robots but carry an engine to drive the pump [13]. There are several ways of arranging actuators for the perfect synchronization of legs. We used the three degrees of movement solution where the design consists of links connected as knee joints. It had been controlled by the input of the angle of these connections to position the feet using PWM. Compared to other designs, this design will be helpful because of its flexibility of having three degrees of freedom. The design of pinion-belt arrangement uses a pulley and belt for control which is inefficient for a hexapod spider robot design because powerful motors are required to move each leg, thus requiring more mass added to the portion. It is not ideal for what we want to achieve: each leg’s fluid and efficient movement. The design of each leg will be like that of human anatomy. It will contain the coxa, femur, and tibia, where the servo motors will be connecting each component to achieve ultimate movement, as seen in Fig. 6. Each servo motor will be moved in a different direction, as shown in the figure to allow for adjustment and give us the flexibility to control the robot. There will be a connection between the femur joints and the tibia of the spider leg, and we plan on connecting the joints of one leg as a mass point in the mass point system to achieve equilibrium. Therefore, the robot’s momentum applied on each mass end of the single-leg with a fixed axis will equal the velocity used to the same axis by all the forces to create balance once the robot starts moving.
Fig. 6. Belt actuator.
First, for the design of each leg, the microcontroller must know each leg’s orientation so that the servos will be centered. We plan to implement the coxa part of the leg that will directly face the robot’s body. The femur will be flat regarding the ground, while the tibia will be about 80° connected to the femur Fig. 7. We had to ensure that the
Implementation of a Balanced and Fluid Movement
875
servos were centered before mobilizing the legs to achieve this implementation. We had attached the femurs to the joint where the servo and coxa connect. Since the 3D printed femur side will be flat and the other will be curved, the flat side will face down near the joint containing the coxa joint. We then attached the tibia to the femur by making it perpendicular to the tibia, directly vertical to the femur. It means the coxae and femur will make an angle of 90°, while the tibia and femur connection will make an angle of 80°.
Fig. 7. Degree of each joint.
We had utilized high torque servo motors in a closed-loop system with feedback to control its position/angle. These servos can rotate 0 to 180°. To achieve the angle desired for each leg, we adjusted the pulse width by changing the duty cycle. For example, having the pulse width at 1.5 ms will have our servo motor at 90°. Reducing the servo motor to 1.0 ms will move the servo motor to a 0-degree angle. It will be beneficial in our implementation because, since we have three servo motors on each leg, each servo must be aligned at different angles. It will give us flexibility in the adjustment of the legs. The control unit will be the Arduino microcontroller consisting of other ports, for example, CDM1, CDM2, PCA9685, etc. As stated above, PWM signals will be sent from the microcontroller to the servo motors to achieve the desired velocity. A servo motor will be used for our design instead of a stepper or DC for various reasons. First, servo motors are a lot cheaper and lightweight. Second, stepper and DC motors are less efficient and difficult to control because these types of servo motors usually need a regulator to control them [14]. On the other hand, a servo motor is a precision position type that rotates up to 180°, given the type of duty cycle or PWM command it receives. It will be helpful because we plan to have the three servos equipped in each leg at a certain angle to establish the degree of freedom and fluidity [15]. We also used the Arduino microcontroller compared
876
M. Ayad et al.
to other microcontrollers, particularly suitable for implementing and driving the servo motors for numerous explanations. The Arduino microcontroller comes prepared with a servo library for large amounts of servo motors and PWM signals. The servo library is carefully written to use a one-timer to produce servo PWM on about 40 pins at once, which is more than enough for our design and implementation. The above figure is the robot control we had implemented. The user will set the desired position and walking gait where the built-in servo library from the Arduino will accept those parameters to calculate the angle and control the servo’s motor [2]. Also, an Arduino-based Arduino based microcontroller will be easier to implement and much cheaper compared to other microcontrollers. We had implemented a gait walking pattern as briefly described above for our gait planning. Figure 8 shows the flow chart of the hexapod robotic spider movement algorithm.
Fig. 8. Flow chart of hexapod robotic spider movement algorithm.
We had created different functions of the servo motors passing in positional parameters. Arduino microcontroller will be used because the Arduino comes with its carefully written servo library capable of driving large amounts of motors. It will make the implementation more straightforward and quicker. Research has also allowed us to understand what type of gait walking characteristics will be more efficient. Our study concluded that the tripod movement pattern would be beneficial in fluid movement and in creating balance.
4 Design Component In our design, we utilized a ping sensor, SSC-32u controller, and botboarduino.
Implementation of a Balanced and Fluid Movement
877
Fig. 9. Ping sensor.
4.1 Ping Sensor Figure 9 shows the ping sensors. It is placed in front of the robot to detect obstacles and act on their instruction. Sometimes, a ping sensor provides precise distance measurements from about two centimeters to three meters. The ping sensor emits an ultrasound which reflects on a surface then reflects into the sensor. The trigger in the ping sensor sends out that ultrasound signal, and the echo receives it. It is like an input and output function. By measuring the echo pulse width, the distance can easily be calculated. We utilized only one ping sensor in front of the robot in our design. 4.2 Hi-Tec Servo Figure 10 shows one of eighteen Hi-tec servo motors used in the project. The servo movement limitation was −90 to 90. Usually, the RC-servo received a timed pulse every 20 ms. Hence, the duty cycle is around 25%. The Hi-tec accepts a repeated 5 V which is about 500 us. In addition, 500 us is used to rotate it −90°, and 2500 us is used for + 90°. Thus, 1500 us pulse centers the servo. 4.3 SSC-32U Servo Controller SSC-32U servo controller shown in Fig. 11 controls up to 32 servos with GPIO pins. Pins 0 to 15 corresponds to VS1, and pins 16 to 31 represent VS2. VS1 provides power directly to pins 0 to 15. The voltage provided was approximately 6 V. Vs2 provided power to pins 16 to 31. Rx and Tx pins protocol are serial pins on the board to allow communication: TX from the Botboarduino to RX on the SSC-32U; RX from the Botboarduino to TX on the SSC-32U; GND on the Botboarduino to GND on the SSC-32U.
878
M. Ayad et al.
Fig. 10. Hi-Tec servo motor.
Fig. 11. SSC32-U controller.
4.4 Botboarduino Figure 12 shows the BotBoarduino board. #10 label shows this I/O part connected to the RX and TX on the SSC-32U. It Acts as a servo extender because the Botboarduino doesn’t have enough input/output pins for the 18 servo motors. Finally, #15 represents the VS and VL. Vs. has been used to power the servos, which is used in our case.
Implementation of a Balanced and Fluid Movement
879
Fig. 12. BotBoarduino.
5 Discussion Throughout our design and implementation, we were able to apply engineering, science, and mathematics principles. We calculated the inverse kinematics in the mathematics aspects and put them in an array. For engineering, we understood how the SSC-32u servo controller could interface with the Arduino. The purpose of the hexapod was to navigate efficiently through flat and rough terrains where humans could not physically go. This design or specification kept public health safety in mind where the low impact frame of its hexagonal made its movement efficient in moving and navigating. As engineers, our job is to debug and make informed decisions based on the project’s progress. We usually debug our code by running it manually. Determining the weight of the robot was crucial in figuring out efficient movement. Our team executed to the core. We all knew what our tasks were, and that was executed flawlessly. We still stayed confident, established goals, planned tasks, and met our objectives. Our main objective was to move the robot and avoid obstacles. So first, we tested each servo and made sure they worked, and then we set the servo offsets to 0°. It was helpful in the initial setup where the robot will be at a reference point, our baseline. Next, the arrangement of the servos allowed us to send specific servo positions to move the robot. Learning all the Arduino libraries and SSC-32U was needed to complete this project. In addition, we needed to learn how to communicate with the SSC-32U and Arduino using the TX and RX pins to pick the specific baud rate for our case, which is 9600. We achieved our objectives and ran different experiments to test the spider robot. Figure 13 shows our spider robot moving and avoiding obstacles in the experimental field in the lab. We controlled the robot through the android app through a Bluetooth module in terms of autonomy. It was working in excellent condition. It moved fluidly and balanced and balanced on three, four, and six legs, rotating quickly. We tested our robot using different obstacles, including our hands to stop and resume. It was working perfectly. The design is promising to be used in an actual rescue field.
880
M. Ayad et al.
Fig. 13. The spider robot in the experimental field.
6 Conclusion In this paper, our primary focus had been the leg movement and ensuring the fluidity of such activities. Though we still had some issues in slipping and sliding due to the feet of the robotic spider having no grip, the project had come out as a success as the movement of the legs moved to our specifications. The spider robot might be one for the future because it can be applied as a rescue robot for dangerous environments. Other future implementations can further improve the hexapod’s movement patterns, such as jumping or hopping. In addition, the development of another gait pattern is another potential implementation that could be further developed. The potential for this project is endless as there are many ways of application. One implementation can be the addition of sensors on the hexapod spider for maneuvering and navigation when an obstacle is present. Once the sensor detects a heartbeat, it can send a signal to ones’ phone, alerting them. There can also be a camera mounted on top of the spider robot for image processing while it’s navigating. The user can access the camera through an app or other software to view the robotic spider.
Implementation of a Balanced and Fluid Movement
881
References 1. Saenz, A., Santibañez, V., Bugarin, E., Dzul, A., Ríos, H., Villalobos-Chin, J.: Velocity control of an omnidirectional wheeled mobile robot using computed voltage control with visual feedback: experimental results. Int. J. Control Autom. Syst. 19(2), 1089–1102 (2020). https:// doi.org/10.1007/s12555-019-1057-6 2. Mohamad Nor, M.H., Ismail, Z.H., Ahmad, M.A.: Broadcast control of multi-robot systems with norm-limited update vector. Int. J. Adv. Robot. Syst. 17(4), 1729881420945958 (2020) 3. Schneider, A., Schmucker, U.: Force sensing for multi-legged walking robots. In: Theory and Experiments Part 1: Overview and force sensing. In Mobile Robotics, Moving Intelligence, Germany, Austria, pp. 125–174 (2006) 4. McGhee, R.: Control of legged locomotion systems. In: Proceedings of the 18th Automatic Control Conference, San Francisco, CA (1977) 5. Robert, M.: Legged Robots that Balance, Cambridge, London (1986) 6. Karakurt, T., Durdu, A., Yilmaz, N.: Design of six-legged spider robot and evolving walking. Int. J. Mach. Learn. Comput. 5(2), 96–100 (2015) 7. Ramsay, P., Thandiackal, R., Cherney, R.: Climbing favors the tripod gait over alternative faster insect gaits. Nat. Commun. 8(17), 1–14 (2017) 8. Alexander, R.: Exploring Biomechanics-Animals in Motion, New York, NY (1992) 9. Alexander, R.M.N.: The gait of bipedal and quadrupedal animals. Int. J. Robot. Res. 3, 49–59 (1984) 10. Ding, X., Wang, Z., Rovetta, A., Zhu, J.: Locomotion analysis of hexapod robot. In: Climbing and Walking Robots, Rijeka, pp. 291–310 (2010) 11. Chu, S., Pang, G.: Comparison between different models of hexapod robot in fault-tolerant gait. IEEE Trans. Syst. Man Cybern. 32, 752–756 (2002) 12. Helen: SeedStudio (2020). https://www.seeedstudio.com/blog/2019/04/01/choosing-theright-motor-for-your-project-dc-vs-stepper-vs-servo-motors/ 13. Nonami, K, Barai, R., Irawan, A., Daud, M.: Hydraulically Actuated Hexapod Robots, pp. 78– 104. Springer, London (2014). https://doi.org/10.1007/978-4-431-54349-7 14. Choi, B., Song, S.: The optimally stable ranges of 2n-legged wave gaits. IEEE Trans. Syst. Man Cybern. 20(4), 888–902 (1990) 15. Cavas, M., Ahmad, M.B.: A review on spider robotic system. Int. J. New Comput. Archit. Appl. 9(1), 19–24 (2019)
The Applying of Low Order Frequency-Dependent Components in Signal Processing of Autonomous Mobile Robotic Platforms Ivan Afanasyev(B) , Valery Sytnikov, Oleg Strelsov, and Pavel Stupen Odessa National Polytechnic University, Odesa, Ukraine [email protected]
Abstract. In this paper the different types of low order digital filters have been reviewed. The transfer function models of coefficients are introduced. The same type low order cascade modulation results have been described based on obtained coefficient models. Keywords: Frequency-dependent components · Adaptive filters · Cognitive computing
1 Introduction The modern development in smart systems requires the presence components with artificial intelligence elements in robotics and specialize systems both. This development line of technical systems, computer technologies, and computational technics occur in accordance with concept of Industry 4.0. Within the concept, the requirements for internet technologies and their components, software implies the need to improve these components, create new with responsibility for mobility criteria, flexibility and adaptivity of the operation conditions [1]. Today, the technical equipment and production facilities have multifunctional sensors, actuators and controllers. The wireless networks in these conditions contribute to fast data gathering. The network sends data relevant enterprise services after preprocessing. The administrators using this cycle could make correct and informed solutions. The main task is reaching such automatization level of development when the equipment works without stuff intervention on all productions. The role for stuff in these cases comes to monitoring a machines work and intervening in emergency situations [1–3]. These tasks are encountered in cases when the mobile platforms, special computer systems, critical situations resolving systems are used [4–6].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 882–891, 2022. https://doi.org/10.1007/978-3-031-10464-0_61
The Applying of Low Order Frequency-Dependent Components
883
The task of applying and modernizing artificial intelligence underlies in core of development such autonomous mobile systems and related directions. It is easier for humans to explore their own kind. There are a lot of humanistic topics which are dedicated to models, functions, patterns of human brain. The research in these topics is a mind cognition. This process named like metacognitive. There is one problem in that such models are not accurate. It means the transfer on technical system not yet possible or will be incomplete [2, 3]. The cognitive computing topic deals with the questions of realization a human brain models, simulations in technical systems. Any implementation similar systems can have elementary cognition elements, interactive spacetime models, related events in time. Technical realization of cognition model consists of one cognitive process or more and in generally will be nonlinear [7]. The simple association is the human sensory system and any sensors pool of robotic system. Human system is described by neurophysiology. Robotics will be based on special algorithms, hardware, and software. Any sensor presence does not make system like cognitive; it only allows get closer to realization on of the cognition process – perception. The topic of cognitive technical systems is complex and confusing which requires cognition in humanities. During decomposition of topic, it is able to have different directions, each earns individual attention. One of the topics is perception. Transferring a perception on technical systems, the first step can be realization if data processing module. Almost any robot will have several sensors to obtain information from world. For getting information in convenient view it should digitized. It is possible to implement a suchlike handler by using adaptive data processing. The adaptivity is one of the functions of perception [8]. Often, an adaptive processing is necessary in cases when the frequency-depend components are used. The adaptive process takes place by coefficient adjustments in the transfer functions according to input criteria. Perform a task of distinguishing, digitating a signal which will allow getting data in a convenient form. The data processing paths are the best place for using these modules [8]. Describing the requirement for signal distinguish, there is a problem when for efficiently performance need to use much more complex frequency-depend components. The complexity is defined by higher order of component. These components are harder in management, development. They require a lot of computational resources. Considering the operating conditions and limits of autonomous mobile systems, the using of simple elements is advisable. Based on them, it is able to build higher order components which provide flexibility for module. The build result can be configured partially or completely rebuilt. The paper consists of frequency-depend components models of low order. Based on them, there are opportunities for development data processing modules as inputs in system which realizes cognitive processes. The facilities of using the low order components by step-by-step chaining are shown. Should be notices, the frequency-dependent components could be divided on filters and typical components of automatic control systems [10–12].
884
I. Afanasyev et al.
2 The Description of Low Order Frequency-Dependent Components The using of low order frequency-dependent components in autonomous mobile robotic platforms is reviewed in first order digital filters example. For the first order filters could be attributed low pass filters (LF) and high pass filters (HF), and also the bandpass filters (BF) and stopband filters (SF). However, the transfer function for each type has unique features. The transfer functions for LF and HF have following form: H (z) = a0
a1 −1 a0 z 1 + bz −1
1±
= a0
1 ± z −1 1 + bz −1
(1)
where the sign «+» in numerator according to LF, sign «−» for HF, a0 , a1 , b− real coefficients for numerator and denominator and a0 = a1 . Wherein, the coefficients for Butterworth filter are the same with coefficients for Chebyshev filter all types. The first order transfer functions for bandpass and stopband filters described much more difficult. The BF transfer function has following form: H (z) = a0
1 − z −2 1 + b1 z −1 + b2 z −2
(2)
The SF transfer function: H (z) =
a1 −1 + z −2 a0 z a0 1 + b1 z −1 + b2 z −1
1+
= a0
1 + az −1 + z −2 1 + b1 z −1 + b2 z −2
(3)
where a0 , a1 , b1 , b2 - real coefficients for numerator and denominator. a=
a1 a0
(4)
3 The Models of Low Order Frequency-Dependent Components Based on description these components and analyze frequency response (FR) for LF and HF, the math model can be built. After transformation the math model for LF has following form: ⎧ a0 = 1+b ⎪ 2 , ⎨ a1 = a0 , (5) ⎪ ⎩ 4a02 cos2 (ωc ) 2. = c 2 2 (1+b) +4bsin (ωc )
where a0 , a1 , b - real coefficients for numerator and denominator, c – ripple level which defines the cut frequency ωc .
The Applying of Low Order Frequency-Dependent Components
885
Accordingly, after transformation the math model for HF first order filter has following form: ⎧ a0 = 1−b ⎪ 2 , ⎨ a1 = −a0 , (6) ⎪ 4a02 sin2 (ωc ) ⎩ = c2 . (1+b)2 +4bsin2 (ωc )
Based on the common frequency response analyzes for bandpass filter and stopband filter, after transformation, the math model for second order has following form: ⎧ a0 = −a2 , ⎪ ⎪ ⎪ ⎪ ⎪ = 0, a ⎪ ⎪ 1 ⎪ (2a0 sin(ω1 ))2 2 ⎪ ⎪ ⎨ (1−b2 )2 +b21 +2b1 (1+b2 ) cos(ω1 )+4b2 cos2 (ω1 ) = c , (7) (2a0 sin(ωp ))2 ⎪ = 1, 2 +b2 +2b ⎪ 2 ω +4b cos ω cos (1−b ) (1+b ) ⎪ ( p) 2 ( p) 2 1 2 1 ⎪ ⎪ (2a0 sin(ω2 ))2 ⎪ ⎪ = c2 , ⎪ 2 2 ⎪ (1−b2 ) +b1 +2b1 (1+b2 ) cos(ω2 )+4b2 cos2 (ω2 ) ⎪ ⎩ b1 = −(b2 + 1) cos ωp . Likewise, for stopband filter after transformation the math model will has following form: ⎧ a0 = a2 , ⎪ ⎪ ⎪ ⎪ ⎪ a1 ≷ 0, ⎪ ⎪ ⎪ (a1 +2a0 )2 ⎪ ⎪ 2 +b2 +2b (1+b )+4b = 1, ⎪ (1−b ) ⎪ 2 1 2 2 1 ⎪ ⎨ (a1 +2a0 sin(ω1 ))2 = c2 , (8) (1−b2 )2 +b21 +2b1 (1+b2 ) cos(ω1 )+4b2 cos2 (ω1 ) ⎪ ⎪ (a1 +2a0 sin(ω2 ))2 2 ⎪ =c , ⎪ ⎪ ⎪ (1−b2 )2 +b21 +2b1 (1+b2 ) cos(ω2 )+4b2 cos2 (ω2 ) ⎪ 2 ⎪ (a1 −2a0 ) ⎪ ⎪ = 1, ⎪ ⎪ (1−b2 )2 +b21 −2b1 (1+b2 )+4b2 ⎪ ⎩ b1 = −(b2 + 1) cos(ω0 ). where a0 , a1 , a2 , b1 , b2 - real coefficients for numerator and denominator. Obtained math models of these components allows to calculate the necessary coefficients for opportunity the restructuring their properties in dependency from operating conditions.
4 The Transfer Function Coefficients Calculation 4.1 The First Order Low Filters and First Order High Filters For defining the coefficients a0 and b, the researches show that LF will has: ⎛ ⎧ ⎞⎫ ⎨ 2c2 sin2 ω2c cos ω2c 2 ⎬ 1 − c ⎠ ⎝1 − b=− 1− ⎩ ⎭ c2 c2 − cos2 ω2c sin ω2c
(9)
886
I. Afanasyev et al.
a0 =
1+b 2
(10)
In this time, for HF: b=1−
2c2 cos2
c2 − sin2
ωc 2
⎛
ωc 2
⎝1 −
a0 =
sin cos
ωc 2 ωc 2
⎞ 1 − c2 ⎠ c2
1−b 2
(11)
(12)
where ωc – cut frequency, c – ripple level. However, this view of dependency for numerator coefficients from cut frequency ωc and ripple level c is not more convenient in computing resources context. Some additional value ξ is proposed for realization next view. ξ (13) c = cos 2 This additional value is equal to ξ = 2arccos(c), where value c located in range 0 < c < 1. Then, after inserting (7) into (5) and (6), it is possible to obtain much simpler view for realization: ωc − ξ ωc + ξ b = sin / sin (14) 2 2 Should be noticed, that this view matches both LF and HF. Equation (14) reduces computing costs in cases when a coefficients of transfer function should be calculated and more convenient for implementing using software and hardware based on microprocessor. Wherein, it is possible to store values for sinuses and cosines in defined range for cut frequency in each step. In these equations the additional value ξ has introduced which based on (13) could be expressed through ripple level for band width RP and stop width RS. ξ = 2 arccos 1/ 100.1RP (15) where instead of RP can be used a value RS for stop width. 4.2 The First Order Bandpass Filters and First Order Stopband Filters Based on description models for bandpass filters and stopband filters were defined the transfer function coefficients. For the bandpass filters have obtained the next relations using additional value (13)
The Applying of Low Order Frequency-Dependent Components
⎧ b2 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ b1 ⎪ ⎪ ⎪ ⎪ a ⎪ ⎩ 0 a2
= = = =
cos(ω1 )−cos(ω2 −ξ ) , cos(ω 2 )−cos(ω 1 −ξ ) cos ω1 + ξ2 +b2 cos ω1 + ξ2 − , cos ξ2 1−b2 2 , −a0 .
For stopband filters this relation will have following form: ⎧ ξ ⎪ b = −(1 + b cos(ω + − b sin(ω ) ) (1 ) )ctg ⎪ 1 2 1 2 1 ⎪ 2 , ⎪ ⎪ cos(ω1 )+cos(ω2 +ξ ) ⎪ ⎨ b2 = cos(ω )+cos(ω +ξ ) , 2 1 2 , a0 = 1+b ⎪ 2 ⎪ ⎪ ⎪ a1 = b1 , ⎪ ⎪ ⎩ a2 = a0 .
887
(16)
(17)
where ω1 i ω2 – left and right cut frequencies. Obtained relations are much easier for realization and don’t contains hyperbolic functions.
5 The Additional Facilities in Using the Low Order Filters Using base components, it is impossible to resolve a task which requires stricter signal distinguish. For similar, is necessary to use component higher order. Due to their complexity of development and adaptivity, it is more advisable to use cascades of same components. Such chains could increase filter discrimination of amplifier characteristic and change cut frequency for robotic platforms [13]. The transition functions of digital filters in cascades are multiplied, Fig. 1: n Hi (p) (18) H (p) = i=1
where H (p), Hi (p)- final and current transfer function. The impact of multiplying for low frequency filters shifts cut frequency to low frequency domain, for bandpass filters – a cut frequency shifted in middle frequency area. The characteristics becomes more compressed, based on this, the filter discrimination is increased [14].
Fig. 1. The same type frequency-dependent components sequence chain
888
I. Afanasyev et al.
The features analysis of building an adaptive digital processing path will be considered using LF and BF. The research is shown that cascade of ten same type base Butterworth filters are approximated good with following view: F = F0 N −0.27
(19)
where F0 – cut frequency of base low order filter (1000 Hz), F – cut frequency with N cascade chain count. Based on Eq. (19), it is possible to find a relation that allows to define base filter cut frequency using the cut frequency F0 from N connections for low order digital filters. F = F0 N 0.27
(20)
Based on obtained result, it is possible to notice that the cascade chain of same type filters requires using of base filters with flat frequency response in bandwidth. The Butterworth and Chebyshev filters fall within these types. The analyze of using a same type base filters of Butterworth did handle through using acoustic sensor with radiated frequency 40 kHz [15]. The cut frequencies are shifting to central frequency area with 40 kHz with contract the bandwidth and increase a discrimination of frequency response. The shift dependencies of cut frequencies and bandwidth contract are located in Fig. 2 and Fig. 3. Provided dependencies have good approximation by equations with parametric view. The left frequency has following form: F = F0 N 0.081
(21)
Fig. 2. The cut frequencies dependency from filter cascade connection count, 1 – right cut frequency, 2 - left cut frequency
The Applying of Low Order Frequency-Dependent Components
889
The equation with parametric view for right frequency has following form: F = F0 N −0.06
(22)
The bandwidth in cascade chain of base bandpass filters has following approximation: dF = dF0 N −0.55
(23)
where dF – cascade chain bandwidth, dF0 – cascade chain bandwidth of base filter. Should be noticed that all extracted equations have following view: F0 = FN b
(24)
Realization of this formula in mobile and specialized systems with limited computing resources is impossible. Therefore, the relation N b , where N – integer, and 0 < b < 1. will be has following view [16]: N b = eblnN
(25)
The expansion for this equation with exponential function and logarithm into power serials will have following forms [16]: x2 x3 x4 xn x + + + + ... + (26) 1! 2! 3! 4! n! N −1 1 N −1 3 1 N −1 5 N − 1 2n+1 1 ln N = 2 + + + ... + + N +1 3 N +1 5 N +1 (2n + 1) N + 1 (27) ex = 1 +
Fig. 3. The graph of normalized bandwidth dependency from cascade connection count for digital bandpass Butterworth filters
890
I. Afanasyev et al.
6 Conclusions The models of the low order digital filters coefficients have been obtained based on a result of the research. Based on these models, there is a possibility to perform calculations of low order digital filters transition functions. Such approach allows developing modules on the basis of adaptive frequency-depending components. The adaptivity resolves a problem happened in complex hindrance environment with limited resources. The foundation of these modules will be same type base filters. The usage of simple components does not resolve problem that feature will be computationally expensive. For calculating of equations, in this case, the good microprocessor is a need. However, there is a possibility to improve a performance by sacrificing the model accuracy. The main use case for module will be data processing paths. The approach requires performing calculation of coefficients before rebuilt, replacing them in the device memory, and only after that to start rebuilt process. Note: In cascades of same type digital filters a usage of base same type Butterworth or Chebyshev digital filters is appropriative for proving the good filter discrimination of amplifier characteristic and the predictable cut frequency in LF. For BF the good choice is Butterworth filter.
References 1. TheIndustry 4.0 Standards Land scape from a Semantic Integration Perspective Conference Paper DOI: https://doi.org/10.1109/ETFA.2017.8247584 Conference: Conference: 2017 IEEE 22nd International Conference on Emerging Technologies and Factory Automation (ETFA). https://www.researchgate.net/publication/318208930_The_Industry_ 40_Standards_Landscape_from_a_Semantic_Integration_Perspective 2. Industry 4.0: an overview. Conference Paper (PDF Available) at: https://www.researchgate. net/publication/326352993_Industry_40_an_overview 3. Industry 4.0, available: https://www.cognex.com/ru-ru/what-is/industry-4-0-machine-vision/ development 4. Semenov, S., Voloshyn, D., Ahmed, A.N.: Mathematical model of the implementation process of flight task of unmanned aerial vehicle in the conditions of external impact. Int. J. Adv. Trends Comput. Sci. Eng. 8(1), 7–13 (2019) 5. Zhuravska, I., Musiyenko, M.: Heterogeneous Computer Networks of Critical Application: Creation and Functioning of Networks Based on UAVs’ Swarms and Flocks: Monograph, p. 367. LAMBERT Academic Publishing (2018). ISBN 978-613-9-86357-0 6. Stankevich, L.A.: Cognitive concepts and their use in technical intelligent systems, St. Petersburg (2017) 7. Kovalchuk, A.V.: The cognitive system as a system architecture. Center for Optical and Neural Technologies of the Scientific Research Institute for System Research of the Russian Academy of Sciences 8. Gaulin, S.J.C., McBurney, D.H.: Evolutionary Psychology, Chapter 4, pp. 81–101. Prentice Hall (2003). ISBN 978-0-13-111529-3 9. Djigan, V.I.: Recursive least squares—an idea whose time has come. In: Proceedings of the 7-th International Workshop on Spectral Methods and MultirateSignal Processing (Moscow, 1–2 September 2007), Moscow, pp. 255–260 (2007) 10. Dorf, R.C., Bishop, R.H.: Modern Control Systems, p. 831. Pearson Education Inc. ISBN13: 978-0134407623 (2017)
The Applying of Low Order Frequency-Dependent Components
891
11. Ukhina, H., Sytnikov, V., Streltsov, O., Stupen, P., Yakovlev, D.: Transfer function coefficients influence on the processing path bandpass frequency-dependent components’ amplitudefrequency characteristics properties at the NPP TP ACS. In: Conference Proceedings of 2019 10th International Conference on Dependable Systems, Services and Technologies, DESSERT 2019, June 2019, pp. 193–196. 10th International Conference on Dependable Systems, Services and Technologies, DESSERT 2019, Leeds, United Kingdom, 5 June 2019–7 June 2019. https://doi.org/10.1109/DESSERT.2019.8770050 12. Schuster, C., Wiens, A.: Performance analysis of reconfigurable bandpass filters with continuously tunable center frequency and bandwidth. IEEE Trans. Microwave Theo. Tech. 65(11), 4572–4583 (2017) 13. Lutovac, M.D., Tosic, D.V., Evans, B.L.: Filter Design for Signal Processing using MATLAB and Mathematica. New Jersey, USA: Prentice Hall. ISBN 0-201-36130-2 (2001) 14. Mao, J., Choi, W., Tam, K., Che, W., Xue, Q.: Tunable bandpass filter design based on external quality factor tuning and multiple mode resonators for wideband applications. IEEE Trans. Microw. Theory Tech. 61, 2574–2584 (2013) 15. Rais-Zadeh, M., Fox, J., Wentzloff, D., Gianchandani, Y.: Reconfigurable radios: a possible solution to reduce entry costs in wireless phones. Proc. IEEE 103, 438–451 (2015) 16. Korn, G., Korn, T.: Mathematical Handbook For Scientists and Engineers, p. 832. McGraw Hill Book Company, New York (1974)
Run-Time Dependency Graph Models for Independently Developed Robotic Software Components Vineet Nagrath(B) and Christian Schlegel Service Robotics Research Center, Technische Hochschule Ulm, 89075 Ulm, Germany {Vineet.Nagrath,Christian.Schlegel}@thu.de https://www.servicerobotik-ulm.de/ Abstract. Non-functional properties of a system must be parameterized as elemental or compound objects before they can be observed. Dependency objects are entities that encapsulate a distinct system characteristic that is relevant to and influenced by the system’s components and connections. A graph can be drawn where edges represent changes in one or more system-wide properties with software components and communication networks as nodes Dependency graphs are networks of dependencies that emerge from the flow of system-level characteristics, with evidence and business logic controlling them. Functional units are provided by component developers and network experts to describe how the value of a dependency object is transferred across computational or connection nodes. While building or validating a system, run-time dependency graph models for independently developed software components and network connections are integrated, built, and explored. These run-time models are used to graphically and programmatically observe networks of dependencies that emerge from the flow of system-level features. This paper describes the design and engineering anatomy of such a toolchain. Keywords: Model-Driven Engineering (MDE) · Component-based software development (CBSD) · Executable run-time models · System-level requirements · Non-functional properties (NFP) · Dependency graphs · Cause-Effect Chains · System integration and composition
1
Introduction
Robotics in general, and specifically service robots, are software-intensive cyberphysical systems. In real-world open-ended scenarios, the complexity of these This paper was supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 732410, project RobMoSys [12], by the German Federal Ministry for Economic Affairs and Energy (BMWi) in the PAiCE programme under grant agreement 01MA17003D, project SeRoNet [18] and by the EFRE Programme Baden-W¨ urttemberg 2014-2020 and the MWK Baden-W¨ urttemberg (ZAFH Intralogistik) [24]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 892–909, 2022. https://doi.org/10.1007/978-3-031-10464-0_62
Run-Time Dependency Graphs for Robotic Software
893
Fig. 1. Side by side comparison of software development lifecycle and component-based software development (CBSD)
systems grows exponentially as the capabilities and scale of service robotics grow. A comprehensive service robotic ecosystem will need to establish tools and techniques to secure all stakeholders’ intellectual property and business logic, in addition to tackling technological problems connected to robot interoperability. In this light, robotic software developed by a single supplier must have a layer of abstraction that allows software components to interact while keeping their internal mechanics hidden. The physical and software products with which a robot’s capabilities are compatible determine its usability. To accomplish any degree of automation in component matchmaking during system composition or for on-the-fly switching of components at run-time, precise and meta-model based definitions of a component’s functional and non-functional capabilities are required. For complex, mission-critical, and rapidly evolving software systems, component-based software development (CBSD) is crucial, especially when numerous vendors are involved. CBSD attempts to increase software reuse by creating software as a group of reusable things, namely software system components [5]. Seven main steps (see Fig. 1) make up a typical software development lifecycle [2,3]. A model of an entity under investigation is a collection of comments regarding its structure. Components in CBSE are linked with other components during system composition, hence the interfaces between them must be designed to enable this. Component to component connections are commonly specified using blocks, ports, and connectors, in which a component is described as a software body with one or more ports connected to ports on other component blocks via connectors (see Fig. 2). Models that define a system are not constrained by the technology that is used to implement them. Before any software is built, this allows software component manufacturers to agree on the Models of their interfaces. As a result, models make it easier to separate responsibilities among people who play diverse roles, such as interface designers, component developers, and system builders. The separation of roles is a method of segregating duties among distinct jobs, and it is an important aspect of Model-Driven Engineering
894
V. Nagrath and C. Schlegel
(MDE). The separation of roles is managed using a model-driven approach, which makes system composition easier in CBSE. Some mentionable MDE efforts are AADL [6], OMG MARTE [8], SysML [11] and AMALTHEA [1] which provide behaviour semantics for component interactions and execution. CBSE components are selected or generated based on the system’s requirements (see Fig. 1). Software requirements are usually divided into functional and non-functional categories [14,23]. A non-functional requirement is a set of performance-related tests applied to the functional capabilities of a software component or a system made up of multiple components. After a system is built by adopting the most appropriate components from the repository, these requirements become functional and non-functional aspects of the system [17]. In Fig. 3, you can see how System-Level properties are divided into functional and non-functional attributes. System-level attributes can be changed by the system’s constituent components. Any change to a system-level property by a component triggers a chain of changes in that property among its peers, and thus globally across the system (Cause-Effect Chains, [7]). Robotic software vendors currently only provide a limited amount of information about how a component transfers system-level attributes across its ports. These attributes should be monitored during (i) system composition for debugging, (ii) system composition for component selection and verifying that system requirements are met, and (iii) system composition for diagnosis and enforcement during run-time (i.e. ensuring compliance to business contracts). Checking all non-functional claims made by a software component and confidently composing a system is a time-consuming and experimentally non-verifiable task. Non-functional qualities must be monitored at the system level to ensure that they meet system-level criteria, which are now primarily based on the subjective judgment of the system builder. Enforcing the encapsulation of systemlevel attributes within the component’s functional body should be a critical next step in system composition policy. A software component should be able to inform its peers about changes to system-level attributes across the services it delivers.
Fig. 2. Blocks, ports and connectors representation for an interface between software entities.
Tracking system-level properties across components is a desired mechanism to verify compliance of system-level requirements in the EU H2020 RobMoSys:
Run-Time Dependency Graphs for Robotic Software
895
Fig. 3. Functional and non-functional properties of a software system.
Fig. 4. SmartDG in RobMoSys ecosystem
Composable Models and Software for Robotics [12] project. RobMoSys aims to create an ecosystem for Model-Driven, Component-Based software development for robotic systems (RobMoSys:Wiki [13]). Component developers and communication equipment manufacturers provide transfer functions written in a highlevel programming language, distributed along with their software product (see Fig. 4 and Fig. 5). Run-time models for separately developed software components are collectively executed while building or validating a system. These run-time models are used to graphically and programmatically observe networks of dependencies that emerge from the flow of system-level features. In this paper, we discuss the design and engineering related to the run-time aspects of SmartDG, an addition to the RobMoSys [12] compliant SmartMDSD toolchain [15,22] to model system-level dependencies between robotic software components.
2
The SmartDG Methodology
Non-functional properties of a system must first be parameterized as elemental or compound objects before they can be observed. Because the properties of the system are altered by more than one component, the same objects encapsulate dependencies between different components. Such objects are referred to as
896
V. Nagrath and C. Schlegel
Fig. 5. Key SmartDG models, views and transformations.
dependency objects in the SmartDG world (see Fig. 6.1) Dependency Objects are defined as entities that encapsulate a distinct system-characteristic of the system that is relevant to and are altered by components and connectors that compose the system. System components change one or more of these dependency objects as they perform their functions. These objects are altered in sequence by a system’s components as control and/or data flow through the system. Some of these objects may be altered as they pass through a particular network connection, which could be a complete system in itself (see Fig. 6.2) To comprehend how a component or a system modifies a dependency object, consider dependency objects as properties that are independent of the data that passes through the component or the network connection. If a component modifies a data entity passing through its ports, this is the component’s “function” in the traditional sense of “input-process-output”. On the other hand, a transmission error occurs when a “network connection” modifies a data element passing through it. Data items passing through a component or connection are not dependent objects in the traditional sense. Their instantaneous values change as they enter or leave a component or a connection (see Fig. 6.2) Boolean entities like “Privacy” or “Data Security” of data elements as they pass through a component or connection are simple examples of dependency objects. Setting the boolean dependency object “Privacy” to false for a data item “camera image” allows private data (such as camera images) to leave its functional boundary. Similarly, for all data items passing through a secure connection that is not vulnerable to external probing, the boolean dependency object “Data Security” would be set to true. Dependency objects can be complex data structures that capture a wider range of system-level properties than simple boolean or integer variables. Access rights propagation for a large number of users with varying security clearances, loop frequencies and processing/transmission times of individual components/connections as they affect global maximum processing speeds, probabilistic analysis or performance, and error propagation are examples of such applications. In the SmartDG world, we define Dependency Graphs as networks of dependencies that emerge from the flow of system-level characteristics, controlled by evidence and business logic (see Fig. 6.4) Each dependency object has its own dependency graph, which depicts how that dependency object is changed
Run-Time Dependency Graphs for Robotic Software
Fig. 6. SmartDG run-time dependency models
897
898
V. Nagrath and C. Schlegel
throughout the system. For the components they provide, component creators provide dependency transfer functions (TFs and FTs, which will be explained later). These dependency graphs can be probed by component developers to ensure that the transfer functions they write are correct (see Fig. 6.3, Fig. 7.Bottom). The component and its corresponding transfer functions (both black-boxed) are plugged into the system builder’s view, along with other components and their dependency transfer functions, once they’ve been delivered (see Fig. 6.6). The system builder examines dependency graphs to ensure that dependency objects are propagated correctly throughout the system being built. In the system context, the system builder can also probe the transfer functions of individual components to find the most appropriate component (form component repository) for a given role. A graph solver algorithm controls the flow of control among transfer functions (FBL Solvers: Forward business logic solvers, RBL solvers: Reverse business logic solvers). While probing the dependency graphs, the system builder may have access to multiple solvers, each implementing different solver algorithms (see Fig. 6.3). Because a component or connection can affect multiple dependency objects, all of them may have hidden interconnections. (see Fig. 7.Bottom, Fig. 8.Top). An adjustment to one nodal value of dependency object “D” in the dependency graph can affect several nodal values in the dependency graph of object “D” as well as in the dependency graphs of other dependency objects. Thus, a dependency graph visualizer must show several dependency graphs simultaneously (dependency graph system views) as well as all component transfer function activity (dependency graph component views). Figure 7.Bottom shows various dependency graph systems and component views that are influenced by the transfer function and inverse transfer function of constitutive components, and by ‘real’ connectors. There are different types of dependency graph views, such as those operating in isolated bubbles (detached views) and those attached to others (attached views) (see Fig. 8). Several mechanisms exist for sharing dependency object data between attached dependency views. Sync (Time-driven, user-driven, broadcast or sync with specific views) is a data-sharing mechanism implemented in the SmartDG toolchain. Other mechanisms include pulling/push from SmartDG active memory or system storage (Download/update local/global state) and Solver sync (data sync driven by dependency graph’s solver algorithm) (see Fig. 8.Bottom). There are two types of dependency graph probing. Observing the flow of dependency object values in the direction of data and control flow (Forward business) and the flow of dependency object values in the opposite direction of data and control flow (Reverse business) (Reverse business). Both forward and reverse business transfer functions are provided by the component developers (TF: Transfer functions for forward business, FT: Inverse transfer functions for reverse business). In forward business mode, the probe changes the inputs to ensure that the business logic is correct, whereas in reverse business mode, the probe changes the outputs to find the best business logic for achieving the desired results (see Fig. 6.5–6, Fig. 7.Bottom, Fig. 8.Top).
Run-Time Dependency Graphs for Robotic Software
899
Fig. 7. Modularity, distribution and co-evolution characteristics of SmartDG run-time model
Fig. 8. Transactions and connection between SmartDG run-time models.
900
V. Nagrath and C. Schlegel
Section 3 is a walk-through of the SmartDG toolchain and use-cases in light of the methodology discussed above. Because they are tightly connected to running dependency transfer functions (provided by various component vendors), the dependency graph system and component views in forward and reverse modes are run-time models. We will discuss the design features of SmartDG run-time models on various run-time aspects (Connection, Modularity, Distribution and CoEvolution) in the section refSection:DesignFeaturesofSmartDGRuntimeModels.
3
SmartDG Toolchain in Action
SmartMDSD [15,22] and SmartDG are part of the RobMoSys ecosystem [12]. Figure 4 shows elements of SmartDG toolchain in the context of the three-tier structure of RobMoSys ecosystem. Tier-2 domain experts write dependency environment models which are then shared amongst the ecosystem users (component developers and system builders). The ecosystem user can make their one custom tier-3 dependency environment models that refer to and combines elements from several tier-2 dependency environment models (see transformations leading to “DE” and “UE” in Fig. 5). Figure 9 shows a typical dependency environment model that declares dependency objects (“DGPrivacy” and “DGSecurity”) and connectors (e.g. “AF42”) for a domain (“Dependency Base Environment”). Dependency domain/user environment is imported by component developers in a component model, which transforms to components (see transformations leading to “UC” in Fig. 5). Figure 10 shows the inclusion of dependency objects “DGPrivacy” and “DGSecurity” in a component model “TwoMapMaker” using either a graphical or textual model editor. System builders import dependency environments and components to build systems (see transformations leading to “US” in Fig. 5). Figure 11 shows a model for the system “SurveillanceSystem” that is composed of component instances (“Cam1”, “Cam2”, “Mapper”, “Huey”, “Dewey”, “Louie”) and has dependency connections (e.g. “RGBImage” port of “Cam1” component to “Image1” port of “Mapper” component) between components for dependency objects (“DGPrivacy” and “DGSecurity”). For probing, components and systems are transformed to their corresponding SmartDG executable models (see “DG” Executable models in Fig. 5). In the current implementation of the SmartDG toolchain, SmartDG executable models are C++ projects that are imported along with SmartDG run-time enablers (SmartDG C++ library, C++ implementations of Solver algorithms) and dependency object (from C++ classes) to form the run-time SmartDG dependency graph models (see SmartDG component and system views in Fig. 5). Component developers and optionally network managers provide transfer (and inverse-transfer) functions (Currently written in C++, see Fig. 12) along with the component they provide on the repository (see Fig. 6.3). Note that the transfer functions (see Fig. 12) are independent of the view where they will be deployed. The only reference to them can be found in the SmartMDSD component models from which SmartDG executable model (C++ code) and SmartDG run-time model (a running c++ project) (see Fig. 6.7). These functions are accessed and linked when a run-time model is built. The component
Run-Time Dependency Graphs for Robotic Software
Fig. 9. SmartDG dependency base environment.
901
902
V. Nagrath and C. Schlegel
Fig. 10. SmartDG component “TwoMapMaker” with DependencyObjects.
manufacturer may choose to hide its implementation by directly providing a library that includes the transfer function (hiding its source code). The anatomy of a SmartDG run-time model while probing is shown in Fig. 6.6 where transfer functions, solvers and the SmartDG library is linked when a run-time model is built. Figure 13 shows SmartDG run-time models (component and system dependency graph views) for the system project “SurveillanceSystem”, whose model is shown in Fig. 11. SmartMDSD toolchain is used to generate component and system models (see Fig. 13.a,b) that import SmartDG dependency objects (see Fig. 10, Fig. 11). Figure 13.c and f are component dependency graph views used by component developers and system builders. These views are editable by the
Run-Time Dependency Graphs for Robotic Software
903
Fig. 11. SmartDG system “SurveillanceSystem”.
Fig. 12. Transfer function example.
user probing the view. Changes made in nodal values at one view are reflected across all attached views (see Fig. 8.Bottom, Fig. 14 and Fig. 15). Figure 13.e is another editable system dependency graph view attached to other dependency component views. An additional read-only attached view is available on a web interface for viewing over the network through a web browser (see Fig. 13.d). SmartDG increases the ecosystem’s overall composition quality by allowing system builders to track and verify system-level dependencies across components and networks created by various suppliers. It also provides documentation and scrutiny of assertions made by component developers as well as the system as a whole, enhancing the robotic software’s overall reliability. Tutorials, example projects and demo videos are available at [10]. SmartDG methodology, knowledge, and products are open-source and open-access and may be obtained on the
904
V. Nagrath and C. Schlegel
Fig. 13. SmartDG run-time models and views
Service Robotics Ulm website [16] and the SmartDG wiki [10]. A ready-to-use virtual machine is available [21] to try out the presented work. A later version of the SmartDG toolchain is planned to make the web-based view (see Fig. 13.d) editable and the principal SmartDG visualizer.
Run-Time Dependency Graphs for Robotic Software
905
Fig. 14. Editing nodal values of dependency objects in dependency graph system view (run-time model).
Fig. 15. Editing nodal values of dependency objects in dependency graph component view (Run-time model).
4
SmartDG @ Run.time
[email protected] [4,9] is an architecture pattern that supports dynamic adaptive systems by implementing models as operative artefacts. It provides an abstraction of the underlying running system, which makes reasoning, simulation, and enactment of adaptation actions easier. The current system’s model is automatically updated when the running system changes. According to this model, the running system is similarly modified on demand. The system can undergo continuous evolution with no strict boundary between design and run-time activity
906
V. Nagrath and C. Schlegel
Fig. 16. SmartDG @ run.time.
if there is a causal link between the model and runtime model. We’ll talk about causal connections, modularity, distribution, and co-evolution in the context of the SmartDG methodology and toolchain in this section. 4.1
Causal Connection
SmartDG methodology elements are shown in the context of [email protected] layers in Fig. 16. An actor interacts (updates, validates, and tests) with a model of the running entity1 By causal connection, we mean that run-time models reflect the operating system’s current state and behaviour as it runs. Models of SmartDG entities running during various activities are shared by many actors in SmartDG world (see Fig. 16). Toolchain developers provide SmartDG enablers (SmartDG libraries and solvers), and domain experts provide dependency object definitions and run-time objects via dependency environment model “DE”. The “DE” model is shared by developers with network designers and system builders, while run-time “dependency graphs” can be used to manipulate run-time entities “dependency objects”. Network designers and component developers use component dependency graphs to test components and configure networks, respectively. Building and testing transfer functions for components and real connections is part of this activity. After dependency nodes and their equivalent run-time transfer functions are built, system builders use component and system dependency graphs for component selection/validation as well as for system testing. In addition, system auditors use system dependency graphs to verify interdependencies. The models of running entities are all based on different SmartDG meta-models (see Fig. 16). 1
[email protected] uses the terms “running system” and “model of the running system” as conventions. We use the term “entity” instead of the “system” because ‘systems’ and “components” have different meanings in the SmartDG world.
Run-Time Dependency Graphs for Robotic Software
907
Timing. Business logic solvers (see Fig. 6.3) handle the timing of dependency graph nodal updates. Several mechanisms control the data-sync timing between attached SmartDG views. (see Fig. 8.Bottom). Roll-Back Ability. Every nodal change is reversible if the node’s transfer function addresses this. In a well-written transfer function, user actions can be transferred to either the forward or reverse direction of the business direction (see “User Action” directions in Fig. 8.Top). Data Consistency. SmartDG views all have their own data (See “View Data” in Fig. 8.Top), but attached views selectively modify and/or are modified by other views (See Fig. 8.Bottom). 4.2
Modularity
Modularity is a natural by-product of CBSD in SmartDG. Dependency objects are a type of modular entity that encapsulates specific system characteristics (see Fig. 7.Top). The forward and reverse business logic for various dependency objects is encapsulated by transfer functions, which are distinct functional blocks associated with a particular dependency node. A functional piece of software is encapsulated by a software component. SmartMDSD’s service orientation via ports (communication objects [19] and patterns [20]) adds a degree of modularity. SmartDG enablers (SmartDG library and solvers) are modular functional blocks, as well. Model and entity reuse can be seen in the form of systems made up of software components and dependency graphs made up of corresponding dependency nodes. 4.3
Distribution
Because component and system dependency graphs contain multiple interacting dependency nodes, SmartDG run-time models contain multiple interacting systems, each with its own run-time model (see Fig. 7.Bottom). Because component developers sell the same products to multiple system builders, the transfer functions are distributed with the components, making them self-contained functional units (see Fig. 16). SmartDG enablers are shared as separate functional entities instead of components and/or systems that use the SmartMDSD toolchain. SmartMDSD-based nodes can be used in combination with custom nodes to create a hybrid system. 4.4
Co-evolution
In order to synchronize multiple, interacting [email protected] systems, coevolution necessitates a systematic approach. Different system/component dependency graph views can operate independently (detached views) or in conjunction with one or more other views (Attached views) (see Fig. 8).
908
V. Nagrath and C. Schlegel
Several mechanisms exist for sharing dependency object data among attached dependency views. Sync (Time-driven, user-driven, broadcast, or sync with specific views), pull/push from SmartDG active memory or system storage (Download/update local/global state), and Solver sync (data sync driven by the dependency graph’s solver algorithm) are some of the most common data sharing mechanisms implemented in the SmartDG toolchain (see Fig. 8.Bottom)
5
Conclusions and Future Works
For sophisticated, mission-critical, and rapidly developing software systems, component-based software development is critical, especially when multiple vendors are involved. Software components and network connections are the nodes in dependency graphs, which are made up of various cause-and-effect chains of system-level attributes. Changes in system-level properties must be replicated across all of the services provided by a software component. Modeling how system-level features change as communication items travel across a network is critical. SmartDG provides a structural template for modelling system-level requirements, allowing for the creation of a robotic software system composition that can be empirically verified. Software products from component developers and communication equipment manufacturers include transfer functions defined in a high-level programming language. Run-time models for separately produced software components are collectively executed while building or validating a system. These run-time models are used to observe networks of dependencies that emerge from the flow of system-level features graphically and programmatically. SmartDG improves the ecosystem’s overall composition quality by allowing system builders to track and verify system-level dependencies across components and networks created by various vendors. It also provides documentation and scrutiny of claims made by component developers as well as the system as a whole, enhancing the robotic software’s overall trustworthiness. We discuss the design and engineering of the run-time aspects of a toolchain that enables the above-mentioned workflow in this paper. Right now, we’re working on making SmartDG support more complex dependency objects. Automated testing of dependencies against a test document is also planned for the RobMoSys robotic software ecosystem as part of a more comprehensive Test Suite Modeller. The methodology, knowledge, and products developed by SmartDG are open-source and open-access and can be found on the Service Robotics Ulm website [16] and the SmartDG wiki [10].
References 1. Amalthea. Amalthea: an open platform project for embedded multicore systems (2013). http://www.amalthea-project.org. Accessed 15 Mar 2021 2. Benington, H.D.: Production of large computer programs. Ann. Hist. Comput. 5(4), 350–361 (1983)
Run-Time Dependency Graphs for Robotic Software
909
3. United States, Navy Mathematical Computing Advisory Panel: Symposium on advanced programming methods for digital computers. Herbert D. Benington (1956) 4. Blair, G., Bencomo, N., France, R.B.: Models@ run.time. Computer 42(10), 22–27 (2009) 5. Crnkovic, I.: Component-based software engineering - new challenges in software development. In: 2003 Proceedings of the 25th International Conference on Information Technology Interfaces, ITI 2003, pp. 9–18 (2003) 6. Feiler, P., Gluch, D., Hudak, J.: The architecture analysis and design language (AADL): an introduction, p. 145 (February 2006) 7. Lotz, A., Hamann, A., L¨ utkebohle, I., Stampfer, D., Lutz, M., Schlegel, C.: Modeling non-functional application domain constraints for component-based robotics software systems. CoRR, abs/1601.02379 (2016) 8. MARTE: A UML profile for MARTE: modeling and analysis of real-time embedded systems (2011). http://www.omg.org/spec/MARTE/1.1/. Accessed 15 Mar 2021 9. Morin, B., Barais, O., Jezequel, J.-M., Fleurey, F., Solberg, A.: Models@ run.time to support dynamic adaptation. Computer 42(10), 44–51 (2009) 10. Nagrath, V.: SmartDG Wiki: dependency-graph extensions for SmartMDSD toolchain (2021). https://wiki.servicerobotik-ulm.de/tutorials:smartdg: start. Accessed 15 Mar 2021 11. OMGSysML. SysML: OMG systems modeling language (2012). https://www.omg. org/. Accessed 15 Mar 2021 12. RobMoSys. RobMoSys EU H2020 Project (2017–2020): Composable models and software for robtics systems - towards an EU digital industrial platform for robotics (2017–2020). http://robmosys.eu. Accessed 15 Mar 2021 13. RobMoSysWiki. RobMoSys EU H2020 Project Wiki (2017–2020). http:// robmosys.eu/wiki. Accessed 15 Mar 2021 14. Roman: A taxonomy of current issues in requirements engineering. Computer 18(4), 14–23 (1985) 15. Schlegel, C.: Navigation and Execution for Mobile Robots in Dynamic Environments: An Integrated Approach. Ph.D. thesis, University of Ulm (2004) 16. Schlegel, C.: SMARTSOFT: components and toolchain for robotics (2011). http:// www.servicerobotik-ulm.de/. Accessed 15 Mar 2021 17. Sentilles, S.: Managing extra-functional properties in component-based development of embedded systems, Ph.D. thesis (June 2012) 18. SeRoNet. SeRoNet: Eine Plattform zur arbeitsteiligen Entwicklung von Serviceroboter-L¨ osungen (2017–2021). https://www.seronet-projekt.de/. Accessed 15 Mar 2021 19. SRRC. SmartMDSD Communication Objects (2021). Accessed 30 Apr 2021 20. SRRC. SmartMDSD Communication Patterns (2021). Accessed 30 Apr 2021 21. SRRC: SmartMDSD tutorials and ready-to-go virtual appliances (2021). https:// wiki.servicerobotik-ulm.de/tutorials:start. Accessed 18 Jun 2021 22. Stampfer, D., Lotz, A., Lutz, M., Schlegel, C.: The SmartMDSD toolchain: an integrated MDSD workflow and integrated development environment (IDE) for robotics software. J. Softw. Eng. Robot. (JOSER) 7, 1–19 (2016) 23. Yeh, R.T., Zave, P.: Specifying software requirements. Proc. IEEE 68(9), 1077– 1085 (1980) 24. Zafh, I.: ZAFH Intralogistik (2014–2020): collaborative systems to make intralogistics more flexible (2014–2020). http://zafh-intralogistik.de/. Accessed 10 Jan 2022
A Raspberry Pi Computer Vision System for Self-driving Cars Zach Isherwood and Emanuele Lindo Secco(B) School of Mathematics, Computer Science and Engineering, Liverpool Hope University, Liverpool L16 9JD, UK {18004704,seccoe}@hope.ac.uk
Abstract. This paper presents a prototype of a self-driving vehicle that can detect the lane that it is currently in and can aim to maintain a central position within that lane; this is to be done without the use of special sensors or devices and utilizing only a low-cost camera and processing unit. The proposed system uses a hand-built detection system to observe the lane markings using computer vision, then using these given lines, calculate the trajectory to the center of the lane. After locating the center of the lane, the system provides the steering heading that the vehicle needs to maintain to continuously self-correct itself; this process is real-time performed with a sampling frequency of 20 Hz. Due to the increased number of calculations, the heading is smoothed to remove any anomalies in observations made by the system. Since this system is a prototype, the required processing power used in an actual vehicle for this application would be much higher since the budget of the components would be more significant; a higher processing speed would lead to an overall increased frame rate of the system. In addition, a higher frame rate would be required for higher speeds of the vehicle to allow for an accurate and smooth calculation of heading. The prototype is fully operational within an urban environment where road markings are fully and clearly defined along with well-lit and smooth road surfaces. Keywords: Self-driving system · Raspberry Pi · Low-cost system
1 Introduction 1.1 Outline of the Field of Work Driving a vehicle is a complex process due to the number of parameters that need to be evaluated and reacted to accordingly each second. Some examples of this may include but are not limited to (i) the vehicle’s speed compared to that of the vehicle ahead, (ii) offset to the center of the road, (iii) lighting conditions, (iv) stopping distances, (v) road conditions or road occupancy, and so on.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 910–924, 2022. https://doi.org/10.1007/978-3-031-10464-0_63
A Raspberry Pi Computer Vision System for Self-driving Cars
911
The ongoing increase of licensed vehicles entering the road per year as well as the increased population and town sizes results in busier roads and therefore more room for human error, partially due to a lack of observational awareness, device distraction, lack of concentration, or other road users poor driving standards; these facts result in an increased requirement to remain aware of other road users, their actions or indeed inactions. Hence, car manufacturers are seeking out assistive features for some of these processes which can seek to account for any lapses that drivers may incur due to this increased congestion. Tesla is arguably the leading car manufacturer working towards autonomous vehicles; in the UK alone, there is recorded to be 90,900 T vehicles on the road (as of November 2020); this was a 336% increase from 2019, where 27,000 vehicles were recorded [1]. Current Tesla autopilot system can control the car with limited assistance from the driver. The car can maintain central road positioning, avoid collisions with obstacles in its path, adhere to traffic laws such as road markings and signs. The system uses 8 surrounding cameras to provide full 360° of vision with a potential distance of up to 250 m; there are also 12 ultrasonic sensors around the vehicle and a forward-facing radar for low-visibility conditions such as fog or heavy rain [2]. Although Tesla models have the capability of completing an entire journey with little to no intervention from the driver, the autopilot feature is still marketed as an assistive feature to aid with the most burdensome parts of driving and that the car requires active supervision while the autopilot feature is enabled.
Fig. 1. Outline of the proposed system
1.2 Ethics of the System False or poor marketing, or more particularly the poor user understanding of the implemented safety systems, can lead to an increased belief of complete full autonomous travel. These beliefs can, and have, been known to be the root cause of avoidable accidents due to the driver not being fully attentive or engaged, or, in a recent extreme circumstance, not even in the driver’s seat [3].
912
Z. Isherwood and E. L. Secco
The sensationalistic media coverage of these incidents can often misrepresent the assistive features that these vehicles present, most commonly from overestimating their functions and ability to navigate without human aid by seeking to lay blame for these incidents on the vehicle or devices themselves rather than the lack of required interactions or social responsibility that the owner/occupier of the vehicle should have had at the time of the incident, i.e. not being in the driving seat at the time of a crash. The fact that as humans we are already willing to adapt and or surrender our controls within vehicles based on the existence of such devices, and capabilities, is an ongoing conundrum. We like our control, we like to be able to drive where, when and how we like, but conversely, we’re happy to relinquish this control when it suits us. Based on the content of sci-fi films we can see such elements becoming a reality and, ultimately, we are willing to accept these changes. The prototype system detailed within this paper is not designed to drive the whole journey of the vehicle but instead to reduce fatigue on long and relatively unchanging parts of the journey, such as a motorway or long consistent road. When combined with other safety devices on the vehicle it can allow light touch driving rather than full hand on steering and control. As with other manufacturers’ systems, the driver should always be conscious and aware of their surroundings even when the system has control. The advances of such systems in vehicles leads to the ethical dilemmas of moral decision-making being handed to autonomous systems or Artificial Intelligence rather than remaining in the hands of the fully functioning and morally capable individual. These dilemmas are at the heart of today’s developments and will continue to be so as both current and future consumers will not seek to take the lesser of two safe options where choices are available. In the end, such systems should be used correctly and fairly - users should always follow the guidelines set by the manufacturer without fail. This is a point that should be reiterated to vehicle users at the point of sale and in all literature thereafter. Such reiterations would seek to ensure that the user does not become too reliant on the system and start to make poor decisions due to lack of attention as a result. This paper is not seeking to address these here but rather to identify that one step forward leads inexorably to the next but in doing so we must also be aware that the path will contain a large number of obstacles and resistance to change. It does seek to demonstrate the effective use of computer vision for practical everyday use to help with a repetitive task, leading into the work towards self-driving vehicles, capable of adhering to local speeds limits, maintaining road position - even in changing road conditions - and avoiding collisions with fixed or moveable obstacles such as road furniture, pedestrians, cyclists or similar (Fig. 1).
A Raspberry Pi Computer Vision System for Self-driving Cars
913
Fig. 2. The Raspberry Pi 4 Board, the Vitada Webcam and the PiCar-S
2 Materials and Equipment 2.1 Hardware Embedded system - The main controller of the system is a Raspberry Pi model 4 which is a small, lightweight and inexpensive computer. This embedded system will handle all of the computation, from controlling the car itself to detecting and tracking the lane. The installed operating system is Raspbian, which is a free Linux version built solely for the Raspberry Pi line (Fig. 2, top left panel - [4]). Camera - The visual input of the system is implemented by using a Vitade 928A Pro USB Computer WebCam; this device was selected over a standard Raspberry Pi camera due to its more rigid structure and its automatic light correction functionality (Fig. 2, top right panel). This camera has a 80° wide-angle lens and allows video acquisition with a resolution of 1080 pm and a frame rate of 30 fps. Sunfounder PiCar-S - The car used for the prototype model of the system is a PiCar-S from Sunfounder (Fig. 2, bottom panel). This PiCar is designed to run from a Raspberry Pi using a PiCar library which is provided online. The car comes with correct drivers and wires to run the motors safely as well as accurately control the front servo for the steering heading. Three different front modules are also included; these are an ultrasonic sensor and two types of infrared sensors for line following capabilities; however, these additional modules are not used in our design. Power supply - The PiCar system requires two 18650 Lithium-ion Rechargeable Batteries to operate with sufficient power. These batteries are often used for high-drain applications such as this project, these batteries plug into the hat attached to the Pi to power the whole system including the Pi and the USB camera. Each battery supplies 3.7 V and 4800 mAh.
914
Z. Isherwood and E. L. Secco
2.2 Software Python - The programming language used to control the system is Python. Python is a high level and highly versatile programming language that allows for the use of various types of libraries and different programming paradigms. The language’s design tries to help produce logical and easily readable code; this therefore not only helps in the development of a system but also in the debugging and maintenance phases of the system. The version used in the system is Python 3.8, this is the most up to date version compatible with the use of the OpenCV library. OpenCV - OpenCV is an open-source computer vision library that provides many programming functions for the use of real-time computer vision and claims to have more than 2500 optimized algorithms to aid in this process. Typical uses of the library include object and face recognition, object tracking and simple to complex image manipulation [5]. In this work, OpenCV is used to capture a frame from the camera, pass that frame through the image manipulation pipeline to prepare it for line detection [6]. This pipeline is as follows: (i) Load in the frame, (ii) convert the frame from BGR to RGB, (iii) grayscale the frame, (iv) blur the frame and (v) apply edge detection. In addition, functions from the library such as (vi) hough lines and (vii) canny edge detection are also used to find the road’s edges. VNC Connect - Virtual Network Computing (VNC) Connect is remote desktop software that enables the user to control another device/computer over an internet connection as long as that device also has VNC Connect installed and allows remote access. The device that is being connected to has VNC server installed, and the device used to control from has VNC Viewer installed. Sends inputs such as mouse movements and keystrokes over the internet to the other device. The display is also sent over the internet so the user can see in near real-time what is happening on the device. Used to control the pi without need to have a keyboard and mouse along with a monitor constantly plugged in. Allows the project to be constantly tested and improved upon without interference. Pycharm IDE - Before the system was deployed to the Raspberry Pi, it was programmed using the Pycharm IDE. Pycharm was developed specifically for the use of the Python Programming Language and aims to improve the user’s experience in terms of productivity and useability. Functions, classes, loops and conditional statements can all be collapsed to allow a user to traverse a larger program with less hassle.
3 Methods and Results 3.1 Basic Requirements Within a lane assist system, two fundamental processes need to occur. The first is perceiving and tracking the lane lines; the second uses these given lines to calculate the wheel heading to steer the vehicle in the correct direction to remain within the lane. Finally, the system needs to work on a live video feed from the USB webcam. However, throughout the development, the system can be tested on an image frame since live video is just an iterative loop of loading in image frames.
A Raspberry Pi Computer Vision System for Self-driving Cars
915
Fig. 3. Implementation of the ‘loading an image frame’ and of the ‘Gaussian Blur and Canny Edge Detection’ (left and right panels, respectively)
3.2 Implementation – Image Processing Loading an Image Frame - The first step needed for the image processing is to load the image frame into a variable; in this case, the sample image is test.jpg, the frame is loaded and stored in the variable frame using the cv2.imread() function (Fig. 3). The frame can then be converted into a single-color channel using the cv2.cvtColor() function; here, the loaded frame is passed into the function along with the code for the color conversion. The color formatting for OpenCV is Blue, Green, Red (BGR) instead of the standard RGB; therefore, the color conversion required is from BGR to Grayscale using cv2.COLOR_BGR2GRAY. The two frames are then displayed on the screen using cv2.imshow(). Gaussian Blur and Canny Edge Detection - Once the image frame has been loaded and converted into a grayscale format, the next step is to detect any edges that the image can find. Before this can be done, the frame needs to be blurred slightly to remove noisy parts of the image that may interfere with finding the lines (Fig. 3). The frame is blurred using the following function: cv2.GaussianBlur src, ksize, σx , σy , borderType where src - Image source/Frame to be used. ksize - Kernel size, the standard is a 5 × 5 matrix. σ x - Standard deviation of the kernel in the horizontal axis. σ y - Standard deviation of the kernel in the vertical axis. borderType - Applies a border to the boundaries of the image while the kernel is applied. Gaussian Blur - The Gaussian blur function checks each pixel in the image and compares it to the pixels in the surrounding predefined box (kernel). A weight is then
916
Z. Isherwood and E. L. Secco
applied to each pixel within this kernel; more weight is added to the pixel in the center of the kernel compared to the pixels further away from the center [7]. All of the pixels within the kernel are then added together and an average is taken, the central pixel is then replaced with this average value. This is an iterative process that occurs for every pixel in the image. ⎡ ⎤ 11111 ⎢1 1 1 1 1⎥ ⎥ 1⎢ ⎢ ⎥ K= ⎢1 1 1 1 1⎥ ⎥ 25 ⎢ ⎣1 1 1 1 1⎦ 11111 The kernel value can be any number, however, it is strongly recommended to use an odd-square kernel such as (5,5) or (7,7) to allow for a central pixel in the kernel. The σ parameter controls the variation around the mean value of the kernel, larger values of σ allow for more variety around the mean, whereas smaller values allow less variety; if the value is zero, then the kernel is applied to every pixel in the image. If only σ x is specified, σ y is taken as equal to σ x . If both are given as zeros, they are calculated from the kernel size [5]. Once the frame has been blurred, it is then stored in a new variable called blurred_frame that can be used to detect the edges of the lines within the frame.
Fig. 4. Completed edge detected frame
Canny Edge Detection - Once the frame has been blurred, the next step is to detect the edges of any lines using cv2.Canny(). The summary of the multiple stages of the command is as follows [8]: 1. Find intensity Gradient of the frame - identify parts of the frame with the most substantial intensity gradients (using Sobel kernel) 2. Non-maximum Suppression - Thin out edges to removes pixels that might not be part of the edge
A Raspberry Pi Computer Vision System for Self-driving Cars
917
3. Hysteresis thresholding • Accepting pixels as edges if the intensity gradient value exceeds an upper threshold (maxVal). • Rejecting pixels as edges if the intensity gradient value is below a lower threshold (minVal). • If a pixel is between the two thresholds, accept it only if it is adjacent to a pixel that is above the upper threshold. The Canny function takes the following arguments: cv2.Canny(src, threshold 1, threshold 2) where Src - This is the frame to input into the Canny function. Threshold1 - This the minimum value (minVal). Threshold2 - This is the maximum value (maxVal). These two defined thresholds are used in the hysteresis process. Figure 4 shows the result of this image processing.
Fig. 5. Region of Interest (RoI)
Region of Interest (ROI) - The Canny frame is a step closer to detecting lines in the frame; however, the frame is full of lines that the system would find redundant and noisy. The best way to remove this excess noise is to mask out a region of interest; only lines within this region will be detected [9].
918
Z. Isherwood and E. L. Secco
Defining an effective region requires the camera to be calibrated correctly since the region of interest is not dynamic; the region is defined before the program runs. Incorrectly calibrating the camera or defining the region of interest to be too small can result in the system being unable to find the lane lines. On the other hand, making the region too large can cause a noisy estimation of the heading; therefore, this region needs to be fine-tuned to each setup it is used in. We use a function which declares a polygon named mask_area using a NumPy array; anything outside this polygon is disregarded as noise since it is not within the vehicle’s path and could cause complications such as detecting other lanes lines, which would in turn, return an incorrect steering heading. Therefore, it holds: • mask = np.zero_like(frame) sets a NumPy array full of zeros, with the same dimensions as the original frame. • cv2.fillPoly(mask, mask_area, 255) This command fills the empty mask with 1’s instead of the region defined by the mask area, the rest of the array is left as zero’s. Figure 5 shows the result of this further step of the image processing. The next step is to store only the masked area into the masked image frame, this is done using masked_image = cv2.bitwise_and (frame, mask). The bitwise uses the AND operator to detect where edges are detected in both the frame and the mask.
Fig. 6. Adding lines to the original frame
3.3 Implementation – Line Definitions Hough Lines Transform - The Hough line transform is an extraction method used to detect straight lines in an image frame, which works perfectly with the previously
A Raspberry Pi Computer Vision System for Self-driving Cars
919
created isolated region from the Canny frame. OpenCV has two methods of using the hough lines transform, the first is the standard cv2.HoughLines, the second is the cv2.HoughLinesP() which is the Probabilistic Hough Lines Transform. The proposed system uses the cv2.HoughLinesP() as it finds the extremities of each line which gives a much more accurate reading to be used to calculate the heading required to correct the current heading of the vehicle. The values minLineLength and maxLineGap need to be adjusted depending on the camera resolution, distance to the road and the size of the region defined. If the values are too large, no lines will be found, if the values are too small too many lines will be detected as well as many false. This function does not display the lines on the image, it instead finds the coordinates of any line that is found and stores it in a NumPy array called lines. Draw Lines - The lines found in the previous function are passed through, along with the empty frame to allow for the lane lines to be calculated and drawn onto the empty frame. This returned frame can then later be overlapped onto the original image [10]. Drawing Lane Lines – Then the first step is to allocate each line found in the array to either the right or left lanes. This can be done using a nested for loop that separates the x 1 , y1 and the x 2 , y2 points in each line from the array; these points are then passed through the np.polyfit() function to get the gradient and y-intercept of the line. If the gradient of the line is negative, the line is from the left lane; otherwise, the line is from the right lane. After all the lines in the frame have been allocated to a lane, the next step is to check there are both lane lines found, as, for the current test frame, both lanes are present. To do this, the system checks that there are coordinates stored in the left_gradient and right_gradient arrays. The maximum and minimum values of the Y-axis are already known and are defined by the region of interest. All that is required to find the points of each line is to find the x 1 and x 2 values, this can be done by rearranging the y = mx + c equation into x = (y-c)/m. Once these points have been calculated, they can then be drawn onto the image using the cv2.line() function which takes the arguments: (x 1 , y1 ), (x 2 , y2 ), color, thickness. Add lines to the original frame - Once the lines have been drawn onto the empty frame, they can be overlapped onto the original frame using cv2.addWeighted(). The weighted frame overlays the two frames with the opacity of the original frame reduced slightly in order for the yellow lane line drawn on the image to show through the white of the actual lane line (Fig. 6). Heading Line - Adding a trajectory line is a fairly simple process once the lane lines have been detected, the lower point of the line will always come from the center of the frame, this is assuming the camera is calibrated to the center of the vehicle. The upper point of the line takes the average between the right and left lane lines as this is the center of the lane (regardless of where the vehicle is positioned within the lane). This line has no other purpose than a visual representation of the desired vehicle path for the user [11]. Calculating Heading - The heading function reads in four parameters, the first being the line_frame; this is the frame that the heading text will be displayed on. The upper_left_width and upper_right_width are the x2 values for the right and left lane. Finally, min_height is the highest y-coordinate value recorded for the lines. The front wheels of the PiCar are controlled by a servo with a range of motion from 0° to 180°,
920
Z. Isherwood and E. L. Secco
therefore when the line is straight, the servo control should receive a 90°, a left turn should be between 0° and 89°, whereas a right turn should be between 91° and 180°. The steering angle is calculated using trigonometry with the offset of both the x-axis and y-axis. Detecting Only One Lane - Until this point, the system has only been able to calculate headings and line trajectories if both lanes were detected. If only one lane was found, the system would stop giving a steering heading and break the lane tracking, which is not optimal and should be amended [12, 13]. The approach is taken to this with the theory that if only one lane is detected, the vehicle is on a curve/turn as shown in the images above and should send a sharp steering heading to the wheels to straighten the vehicle along the current path. Once straightened, the system should then successfully detect both lanes again (Fig. 7).
Fig. 7. Left turn (i.e. right lane found) and right turn (i.e. left lane found) on the left and right panels, respectively.
3.4 Implementation – Steering and Self-driving Using Webcam as Video Feed - In OpenCV using a live video feed is the same as reading a new image frame from the camera, therefore, to read the frame each time is it available a while loop is used since there is no time limit to this loops operation, however, with this method, the loop cannot be stopped without closing the entire script which is why the cv2.waitKey command is included. This checks every loop if the user has pressed the q key, if they have, the loop breaks. The frame can be processed between each cycle by calling the process_frame function for the current image; this begins the image processing pipeline outlined throughout the previous parts of this documentation [14]. Controlling the PiCar - The PiCar system runs off the Raspberry Pi using two driver modules and a Pi Hat to provide the system with enough power to run using the two 18650 rechargeable 3.7 V Lithium-ion batteries. The files were transferred from the computer they were created on, to the pi using VNC viewer. The Raspberry Pi comes with VNC viewer automatically installed which allows the remote connection process without the prior use of a screen and keyboard. After installing all the correct packages
A Raspberry Pi Computer Vision System for Self-driving Cars
921
required for the system to work, running the code on the Pi gave the following results as reported in Fig. 8. Now the code successfully runs on the Raspberry Pi, all that is left to do is output the calculated heading to the servo motor and move the PiCar forward. During the PiCar setup, the picar library is installed onto the Raspberry Pi to allow easy control of the vehicle. The front wheels are declared as fw using the front_wheels class from the library. The wheels are then turned from 45° to 135° one step at a time and then back to 45°, through testing this function it was found that the servo would not update its position if the previous command did not have a delay of at least 0.05 s. Since the front wheels are able to turn to the selected heading, once the heading calculation is completed in the code, the system will turn the servo to that heading, the time delay is then added into the while loop to ensure the servo position changes each time the system updates.
Fig. 8. Determination of the steering angles according to 4 different scenarios as captured by the camera and processed with the Raspberry Pi board
Steering Smoothing - The final addition needed to allow the system to work correctly is to smooth the heading value using previously calculated values. Without heading smoothing the steering can be very violent as any anomalies can create drastic changes in the value calculated. For example, the vehicle can be driving in a straight line and suddenly only detect one line for a few frames, it would cause the heading value to change from 90° to 135° (or 45°) without warning, which in a real-world vehicle would be a safety hazard. To combat this issue, the system only allows the angle to deviate from its current heading by a maximum of 5° if both lanes are detected and only 1° if a single lane is detected. When the system detects both lanes, it is reasonably confident the new calculated angle is correct; therefore, the deviation from the current value can be higher, whereas if only one lane is detected, the steering angle will change to the
922
Z. Isherwood and E. L. Secco
extremity point value based on which lane is found. To prevent this from being a drastic change to the current heading, the deviation is set to 1° per cycle to prevent overshooting the lane once the opposite line has been detected. If the new angle does not deviate more than the maximum assigned deviation, that heading is output to the servo without being smoothened. The smoothing method works by adding the required angle change to the current heading. The required angle change is calculated by subtracting the current heading value from the new heading value: change = desired position - current position. The direction of the required change should return a value of 1 or -1, which is handled by: Direction of change = (actual_angle_change/abs(actual_angle_change)) where the abs() function turns the number positive regardless of its previous state. This direction value is then multiplied by the maximum deviation and added to the current heading, this final value is the new heading that the vehicle will turn to in order to stay on track.
4 Discussion This work is a preliminary attempt to show how designing low-cost embedded systems for self-driving car is feasible, even if an analysis of how to scale the proposed model to a large-scale system should be considered in the future. A major drawback of the current system is that it would not work in an environment that has a lack of well-defined road markings/lane edges. Future iterations of the system could use a deep learning machine model to handle the steering inputs for lane control instead of the current hand-coded model; this, in turn, allows for the implementation of object detection for road signs and obstacle avoidance. Detecting road signs could allow the vehicle to comply with speed limits and road traffic law automatically.
5 Conclusion This work only preliminary analyses computer vision application in an industrial environment and real-world scenarios [15–17]; many improvements could be made to the system to make it more reliable, since current system is capable of tracking straight lines, but curved lines are not accounted. At present (i) the system detects both lane lines and calculates an offset heading to the right or left and (ii) detects a single lane line and turns the right or left extremity point. With the smoothing of the heading calculations, this is not a significant issue; however, it is still not practical given that the tracking of lane lines on a curve can be inconsistent. The vehicle can still successfully remain within its lane given this flaw, but currently, this is limited to reasonably slow speeds since the frame rate is locked at 20 given the 0.05 s delay per cycle. However, this is not a significant concern as with a higher processing speed and improved equipment, it is expected that this limitation would be solved. Another benefit of a machine model would be the handling of curves as these can be taught during the training stage of the system [18, 19]. The use of a machine model
A Raspberry Pi Computer Vision System for Self-driving Cars
923
would also allow for the system to be used on urban roads and in environments where lane lines are less visible, however it would require a large dataset to be able to train such a model to successfully identify all these opportunities, but in such a rapidly expanding industry, these issues could be quickly solved. Acknowledgment. This work was presented in dissertation form in fulfilment of the requirements for the BEng in Robotics for the student Zach Isherwood at the School of Mathematics, Computer Science & Engineering, Liverpool Hope University.
References 1. Statista: Number of cars in the UK 2000–2016—Statista (2021). https://www.statista.com/ statistics/299972/average-age-of-cars-on-the-road-in-the-united-kingdom/. Accessed 23 Apr 2021 2. Tesla.com: Autopilot (2021). https://www.tesla.com/en_GB/autopilot. Accessed 27 Apr 2021 3. The MagPi magazine: Raspberry Pi 4 specs and benchmarks — The MagPi magazine (2021). https://magpi.raspberrypi.org/articles/raspberry-pi-4-specs-benchmarks. Accessed 24 Apr 2021 4. The Verge: Two people killed in fiery Tesla crash with no one driving (2021). https://www. theverge.com/2021/4/18/22390612/two-people-killed-fiery-tesla-crash-no-driver. Accessed 1 May 2021 5. Opencv24-python-tutorials.readthedocs.io: Smoothing Images — OpenCV-Python Tutorials beta documentation (2021). https://opencv24-python-tutorials.readthedocs.io/en/latest/ py_tutorials/py_imgproc/py_filtering/py_filtering.html. Accessed 25 Mar 2021 6. OpenCV: Canny Edge Detection in OpenCV (2021). https://docs.opencv.org/master/da/d22/ tutorial_py_canny.html. Accessed 25 Mar 2021 7. Datacarpentry.org: Blurring images – Image Processing with Python (2021). https://dataca rpentry.org/image-processing/06-blurring/. Accessed 26 Ma 2021 8. Sahir, S.: Canny Edge Detection Step by Step in Python — Computer Vision (2021). https:// towardsdatascience.com/canny-edge-detection-step-by-step-in-python-computer-vision-b49 c3a2d8123. Accessed 20 Apr 2021 9. Wang, Z.: Self Driving RC Car (2021). https://zhengludwig.wordpress.com/projects/self-dri ving-rc-car/. Accessed 5 Jan 2021 10. Medium: Tutorial: Build a lane detector (2021). https://towardsdatascience.com/tutorialbuild-a-lane-detector-679fd8953132. Accessed 11 Feb 2021 11. Arduino Project Hub. Lane Following Robot using OpenCV (2021). https://create.arduino. cc/projecthub/Aasai/lane-following-robot-using-opencv-da3d45. Accessed 15 Feb 2021 12. Hassan, M.: self-driving-car-using-raspberry-pi (2021). https://www.murtazahassan.com/cou rses/self-driving-car-using-raspberry-pi/. Accessed 15 Feb 2021 13. Desegur, L.: A Lane Detection Approach for Self-Driving Vehicles (2021). https://med ium.com/@ldesegur/a-lane-detection-approach-for-self-driving-vehicles-c5ae1679f7ee. Accessed 22 Mar 2021 14. Tian, D.: DeepPiCar — Part 4: Autonomous Lane Navigation via OpenCV (2021). https://tow ardsdatascience.com/deeppicar-part-4-lane-following-via-opencv-737dd9e47c96. Accessed 24 Mar 2021 15. Assets.publishing.service.gov.uk: Reported road casualties in Great Britain: 2019 annual report (2021). https://assets.publishing.service.gov.uk/government/uploads/system/uploads/ attachment_data/file/922717/reported-road-casualties-annual-report-2019.pdf. Accessed 18 May 2021
924
Z. Isherwood and E. L. Secco
16. Buckley, N., Sherrett, L., Secco, E.L.: A CNN sign language recognition system with single & double-handed gestures. In: IEEE Signature Conference on Computers, Software, and Applications (2021) 17. Sharma, H., Saraswat, M., Kumar, S., Bansal, J.C. (eds.): CIS 2020. LNDECT, vol. 61. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-4582-9 18. McHugh, D., Buckley, N., Secco, E.L.: A low-cost visual sensor for gesture recognition via AI CNNS, Intelligent Systems Conference (IntelliSys) 2020. The Netherlands, Amsterdam (2020) 19. Maereg, A.T., Lou, Y., Secco, E.L., King, R.: Hand gesture recognition based on nearinfrared sensing wristband. In: Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020), pp. 110–117 (2020)
Correction to: Intelligent Computing Kohei Arai
Correction to: K. Arai (Ed.): Intelligent Computing, LNNS 507, https://doi.org/10.1007/978-3-031-10464-0
For chapter 6 In the original version of the chapter, the following belated correction has been incorporated: The affiliation of the author “Tianpei Liao” has been changed from “Department of Electrical Engineering and Computer Science, Cambridge, USA” to “Department of Electrical Engineering and Computer Science, York University, Toronto, Canada” in Chapter 6. The correction/erratum chapter and the book have been updated with the change. For chapter 17 In the original version of the chapter, the following belated corrections have been incorporated: The author name “Marcos Bautista L. Aznar” has been changed to “Marcos Bautista López Aznar” in the Frontmatter, Backmatter and in Chapter 17. The correction/erratum chapter and the book have been updated with the change.
The updated original version of these chapters can be found at https://doi.org/10.1007/978-3-031-10464-0_6 https://doi.org/10.1007/978-3-031-10464-0_17 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, p. C1, 2022. https://doi.org/10.1007/978-3-031-10464-0_64
Author Index
A Abomhara, Mohamed, 627 AbouGrad, Hisham, 404 Afanasyev, Ivan, 882 Alamleh, Dalia, 359 Alférez, Germán H., 531 Allauzen, Alexandre, 760 Almaliki, Malik, 278 Alzahrani, Amani, 214 Amira, A., 436, 504 Angarano, Simone, 814 Antonov, Anatoliy, 747 Ardila, Ana María Martínez, 531 Arias, Xavier, 845 Atlam, Elsayed, 278 Ayad, Mustafa, 868 B Baabdullah, Tahani, 214 Babu-Saheer, Lakshmi, 452 Balarezo, Sandro, 845 Beknazaryan, Aleksandr, 445 Bharwani, Nashwin, 548 Bioco, João, 378 Boating, Kwabena, 868 Bosquez, Saulo, 531 Brenner, Paul, 685 Burrows, Holly, 452 Byadarhaly, Kiran, 727 Byerly, Adam, 88 C C. S. Feitosa, Ingrid Saiala, 322 Cacho, Jorge Ramón Fonseca, 672
Canovas, Fernando, 378 Castaldi, Paolo, 392 Ceccaroni, Marta, 484 Chatterjee, Niladri, 579 Chen, Yu, 520 Cheon, Manri, 340 Chiaberge, Marcello, 814 Címbora Acosta, Guillermo, 257 Clausen, Benjamin L., 531 D Dandona, Sangeet, 548 Dessureault, Jean-Sébastien, 230 Diveev, Askhat, 294 Durdevic, Petar, 855 E Eibensteiner, Florian, 306 Esmaeilzadeh, Armin, 672 Espín, Kevin, 845 Essafi, Hassane, 760 F Faghihi, Usef, 190 Fantin, Giovanni, 814 Farsoni, Saverio, 392 Fazendeiro, Paulo, 378 G Gadea, Walter Federico, 257 Gandini, Dario, 814 Garbagna, Lorenzo, 452 Garcia-Reyero, Natàlia, 420
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): SAI 2022, LNNS 507, pp. 925–927, 2022. https://doi.org/10.1007/978-3-031-10464-0
926 H Hajiali, Mahdi, 672 Han, Sangjun, 59 Happonen, Ari, 29, 158 Haridevan, Amal, 79 Hur, Taeil, 59 Hur, Youngmi, 59 Hurley, Ted, 129 I Isherwood, Zach, 910 Iyer, Vasanth, 203 J Jagadale, Samiksha, 613 Jaleel, M., 436 Johnson, Neil F., 564 Joshi, Raviraj, 613 K Kalantarpour, Cyrus, 190 Kalganova, Tatiana, 88 Kambar, Mina Esmail Zadeh Nojoo, 672 Kartikeya, Arnav, 353 Kashyap, Harish, 727 Kim, Yejin, 340 Kölbl, Max, 774 Kucukler, O. F., 504 Kushner, Warren, 548 Kyogoku, Yuki, 774 L López Aznar, Marcos Bautista, 257 La Salandra, Marco, 117 Ladkat, Arnav, 613 Lafiosca, Pasquale, 484 Langer, Josef, 306 Lazo, Macrina P., 656 Le, Phuong Dong Tan, 1 Lee, Junwoo, 340 Li, Hui, 520 Li, Jianan, 520 Li, Jinfen, 787 Liao, Tianpei, 79 Lieberherr, Karl, 45 Liu, Chunmei, 214 Liu, Houjun, 605 Liu, Yibo, 79 Luong, Tho Chi, 594 Lupu, Yonatan, 564 M Maktab-Dar-Oghaz, Mahdi, 106 Malekmohamadi, H., 436, 504 Martínez Calomardo, Ángeles, 9
Author Index Massicotte, Daniel, 230 Mazzia, Vittorio, 814 Mehmood, Asif, 203 Menghini, Massimiliano, 392 Mensah, Humphrey, 800 Mincheva, Zheni, 747 Miniello, Giorgia, 117 Miyajiwala, Aamir, 613 Mu, Feng, 520 N Nagrath, Vineet, 892 Nguyen, Dong Quan Ngoc, 1 Nguyen, Nha Nam Ngoc, 1 Nieto-Chaupis, Huber, 247 Nikolov, Ventsislav, 747 Nor, Norma Mohamad, 718 O Oghaz, Mahdi Maktab Dar, 469 Ortiz-Arroyo, Daniel, 855 Ott, Richard, 88 P Passi, Kalpdrum, 701 Peng, Peiran, 520 Petz, Phillip, 306 Philipp, J. Nathanael, 774 Piat, Guilhem, 760 Prata, Paula, 378 R Rahman, Noorihan Abdul, 718 Rakhymzhan, Tomiris, 106 Ramos, Christine Diane, 656 Rawat, Danda B., 214 Restrepo, Nicholas Johnson, 564 Ribeiro Carpinetti, Luiz Cesar, 322 Richter, Michael, 774 Ruvinsky, Alicia, 420 S Saheer, Lakshmi Babu, 106, 469 Saki, Amir, 190 Salter, R. Cody, 420 Salvetti, Francesco, 814 Sayyah, Zachary, 605 Schlegel, Christian, 892 Schreiber, Ben, 548 Seale, Maria, 420 Sear, Richard, 564 Secco, Emanuele Lindo, 910 Semmar, Nasredine, 760 Shaikh, Sarang, 627 Shan, Jinjun, 79
Author Index Shen, Zhe, 825 Sikdar, Abhinava, 579 Simani, Silvio, 392 Simard, Jonathan, 230 Song, Liqiang, 520 Soundarajan, Sucheta, 787 Sticha, Abigail, 685 Strelsov, Oleg, 882 Stupen, Pavel, 882 Sytnikov, Valery, 882 T Taghva, Kazem, 672 Tourille, Julien, 760 Tran, Oanh Thi, 594 Tsuchiya, Takeshi, 825 U Usmani, Usman Ahmad, 29, 158 V Vaghasia, Shreya, 701 Vasilev, Nikola, 747 Vino, Gioacchino, 117
927 W Warwick, Jon, 404 Watada, Junzo, 29, 158 Wdowiak, Eryk, 739 Wohlrab, Stefan, 306 Wu, Qiyi, 787 X Xiao, Lu, 787, 800 Xu, Ruiyang, 45 Xu, Tingfa, 520 Y Yaakub, Mohd Ridzwan, 718 Yan, Peilin, 520 Yayilgan, Sule Yildirim, 627 Yousef, Tariq, 774 Yousefi, Mahsa, 9 Z Zarrin, Javad, 106, 452, 469 Zhang, Waley, 868 Zoto, Erjon, 627 Zukarnain, Zuriani Ahmad, 718