147 20 105MB
English Pages 899 [901] Year 2021
Lecture Notes in Networks and Systems 296
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 3
Lecture Notes in Networks and Systems Volume 296
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/15179
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 3
123
Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-82198-2 ISBN 978-3-030-82199-9 (eBook) https://doi.org/10.1007/978-3-030-82199-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
We are very pleased to introduce the Proceedings of Intelligent Systems Conference (IntelliSys) 2021 which was held on September 2 and 3, 2021. The entire world was affected by COVID-19 and our conference was not an exception. To provide a safe conference environment, IntelliSys 2021, which was planned to be held in Amsterdam, Netherlands, was changed to be held fully online. The Intelligent Systems Conference is a prestigious annual conference on areas of intelligent systems and artificial intelligence and their applications to the real world. This conference not only presented the state-of-the-art methods and valuable experience, but also provided the audience with a vision of further development in the fields. One of the meaningful and valuable dimensions of this conference is the way it brings together researchers, scientists, academics, and engineers in the field from different countries. The aim was to further increase the body of knowledge in this specific area by providing a forum to exchange ideas and discuss results, and to build international links. The Program Committee of IntelliSys 2021 represented 25 countries, and authors from 50+ countries submitted a total of 496 papers. This certainly attests to the widespread, international importance of the theme of the conference. Each paper was reviewed on the basis of originality, novelty, and rigorousness. After the reviews, 195 were accepted for presentation, out of which 180 (including 7 posters) papers are finally being published in the proceedings. These papers provide good examples of current research on relevant topics, covering deep learning, data mining, data processing, human–computer interactions, natural language processing, expert systems, robotics, ambient intelligence to name a few. The conference would truly not function without the contributions and support received from authors, participants, keynote speakers, program committee members, session chairs, organizing committee members, steering committee members, and others in their various roles. Their valuable support, suggestions, dedicated commitment, and hard work have made IntelliSys 2021 successful. We warmly thank and greatly appreciate the contributions, and we kindly invite all to continue to contribute to future IntelliSys. v
vi
Editor’s Preface
We believe this event will certainly help further disseminate new ideas and inspire more international collaborations. Kind Regards, Kohei Arai
Contents
LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thanh Thieu, Ha Do, Thanh Duong, Shi Pu, Sathyanarayanan Aakur, and Saad Khan The Potential of Machine Learning Algorithms for Sentiment Classification of Students’ Feedback on MOOC . . . . . . . . . . . . . . . . . . . Maryam Edalati, Ali Shariq Imran, Zenun Kastrati, and Sher Muhammad Daudpota Towards an Automated Language Acquisition System for Grounded Agency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . James R. Kubricht, Sharon Small, Ting Liu, and Peter H. Tu Text-Based Speaker Identification for Video Game Dialogues . . . . . . . . Dušan Radisavljević, Bojan Batalo, Rafal Rzepka, and Kenji Araki
1
11
23 44
Automatic Monitoring and Analysis of Brands Using Data Extracted from Twitter in Romanian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucian Istrati and Alexandra Ciobotaru
55
Natural Language Processing in the Support of Business Organization Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leszek Ziora
76
Discovering Influence of Yelp Reviews Using Hawkes Point Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yichen Jiang and Michael Porter
84
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Monirah Ali Aleisa, Natalia Beloff, and Martin White
vii
viii
Contents
Chat-XAI: A New Chatbot to Explain Artificial Intelligence . . . . . . . . . 125 Mingkun Gao, Xiaotong Liu, Anbang Xu, and Rama Akkiraju Global Postal Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 Aimee Vachon, Leslie Ordonez, and Jorge Ramón Fonseca Cacho Automated Corpus Annotation for Cybersecurity Named Entity Recognition with Small Keyword Dictionary . . . . . . . . . . . . . . . . . . . . . 155 Kazuaki Kashihara, Harshdeep Singh Sandhu, and Jana Shakarian Text Classification Using Neural Network Language Model (NNLM) and BERT: An Empirical Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 175 Armin Esmaeilzadeh and Kazem Taghva Past, Present, and Future of Swarm Robotics . . . . . . . . . . . . . . . . . . . . 190 Ahmad Reza Cheraghi, Sahdia Shahzad, and Kalman Graffi Flow Empirical Mode Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 Dário Pedro, R. T. Rato, J. P. Matos-Carvalho, José Manuel Fonseca, and André Mora Cost-Effective 4DoF Manipulator for General Applications . . . . . . . . . . 251 Sandro A. Magalhães, António Paulo Moreira, Filipe Neves dos Santos, Jorge Dias, and Luis Santos Design of a Granular Jamming Universal Gripper . . . . . . . . . . . . . . . . 268 Anglet Chiramal Jacob and Emanuele Lindo Secco Benchmarking Virtual Reinforcement Learning Algorithms to Balance a Real Inverted Pendulum . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Dylan Bates and Hien Tran Configuring Waypoints and Patterns for Autonomous Arduino Robot with GPS and Bluetooth Using an Android App . . . . . . . . . . . . . . . . . . 304 Gary H. Liao Small Scale Mobile Robot Auto-parking Using Deep Learning, Image Processing, and Kinematics-Based Target Prediction . . . . . . . . . 313 Mingxin Li and Liya Grace Ni Local-Minimum-Free Artificial Potential Field Method for Obstacle Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 Abbes Tahri and Lakhdar Guenfaf Digital Transformation of Public Service Delivery Processes in a Smart City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 Pavel Sitnikov, Evgeniya Dodonova, Evgeniy Dokov, Anton Ivaschenko, and Ivan Efanov
Contents
ix
Prediction of Homicides in Urban Centers: A Machine Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 José Ribeiro, Lair Meneses, Denis Costa, Wando Miranda, and Ronnie Alves Experimental Design of Artificial Neural-Network Solutions for Traffic Sign Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Dylan Cox, Arkadiusz Biel, and Faisal Hoque Potholes Detection Using Deep Learning and Area Estimation Using Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 Subash Kharel and Khaled R. Ahmed Exploiting Deep Learning Algorithm to Understand Buildings’ Façade Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Luca Rampini, Ania Khodabakhshian, and Fulvio Re Cecconi An On-Device Deep Learning Framework to Encourage the Recycling of Waste . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Oluwatobi Ekundayo, Lisa Murphy, Pramod Pathak, and Paul Stynes Spatial Modelling and Microstructural Modulation of Porous Pavement Materials for Seepage Control in Smart Cities . . . . . . . . . . . . 418 Zhexu Xi Deep Learning-Based Vehicle Direction Detection . . . . . . . . . . . . . . . . . 423 Nashwan J. Sebi, Kazuyuki Kobayashi, and Ka C. Cheok EMG Controlled Electric Wheelchair . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 Jacob Vigliotta, Joshua Cipleu, Alexander Mikell, and R. Alba-Flores A Low-Cost Human-Robot Interface for the Motion Planning of Robotic Hands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 Alice Miriam Howard and Emanuele Lindo Secco Ensemble UNet++ for Locating the Exponential Growth Virus Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 Yahe Yu, Ruian Ke, and Hien Tran ViewClassifier: Visual Analytics on Performance Analysis for Imbalanced Fatal Accident Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 Gulsum Alicioglu and Bo Sun Comparative Analysis of Machine Learning Algorithms Using COVID-19 Chest X-ray Images and Dataset . . . . . . . . . . . . . . . . . . . . . 502 Abraham Kumah and Osama Abuomar A Critical Evaluation of Machine Learning and Deep Learning Techniques for COVID-19 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 Kainat Khero and Muhammad Usman
x
Contents
Automatic Estimation of Fluid Volume Intake . . . . . . . . . . . . . . . . . . . . 536 Eman A. Hassan and Ahmed A. Morsy A Deep Learning-Based Tool for Automatic Brain Extraction from Functional Magnetic Resonance Images of Rodents . . . . . . . . . . . 549 Sidney Pontes-Filho, Annelene Gulden Dahl, Stefano Nichele, and Gustavo Borges Moreno e Mello Classification of Computed Tomography Images with Pleural Effusion Disease Using Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . 559 David Benavente, Gustavo Gatica, and Ivan Derpich Creating a Robot Brain with Object Recognition Using Vocal Control, Text-to-Speech Support and a Simple Webcam . . . . . . . . . . . . . . . . . . . 566 Andrei Burta, Roland Szabo, and Aurel Gontean Detection of Health-Preserving Behavior Among VK.com Users Based on the Analysis of Graphic, Text and Numerical Data . . . . . . . . . . . . . . 574 Dmitry Stepanov, Alexander Smirnov, Egor Ivanov, Ivan Smirnov, Maksim Stankevich, and Maria Danina WearMask in COVID-19: Identification of Wearing Facemask Based on Using CNN Model and Pre-trained CNN Models . . . . . . . . . . . . . . . 588 Abrar Hussain, Golriz Hosseinimanesh, Samaneh Naeimabadi, Nayem Al Kayed, and Romana Alam Predicting Falls in Older Adults Aged 65 and up Based on Fall Risk Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 Lisa Her, Jinzhu Gao, Lewis E. Jacobson, Jonathan M. Saxe, Kathy L. Leslie, and Courtney Jensen Detection of the Inflammatory Bowel Diseases via Machine Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 Elliot Kim, Valentina L. Kouznetsova, and Igor F. Tsigelny An Attention-Based Deep Learning Model with Interpretable Patch-Weight Sharing for Diagnosing Cervical Dysplasia . . . . . . . . . . . 634 Jinyeong Chae, Ying Zhang, Roger Zimmermann, Dongho Kim, and Jihie Kim Towards a Computational Framework for Automated Discovery and Modeling of Biological Rhythms from Wearable Data Streams . . . 643 Runze Yan and Afsaneh Doryab Towards a Novel Architecture of Smart Campuses Based on Spatial Data Infrastructure and Distributed Ontology . . . . . . . . . . . . . . . . . . . . 662 Viet Nguyen Hoang, Phieu Le Thanh, Linh Ong Thi My, Loc Cu Vinh, and Viet Truong Xuan
Contents
xi
Course Recommendation System for a Flexible Curriculum Based on Attribute Selection and Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 674 Christian Sánchez-Sánchez and Carlos R. Jaimez-González Advancing Adaptive Learning via Artificial Intelligence . . . . . . . . . . . . 691 Kamal Kakish, Cindy Robertson, and Lissa Pollacia Technology of Creation and Functioning of a Multimedia Educational Portal for Distance Learning of School Children in the Republic of Kazakhstan Under Pandemic Conditions . . . . . . . . . . . . . . . . . . . . . . 719 Askar Boranbayev, Seilkhan Boranbayev, Malik Baimukhamedov, Baurzhan Sagidolda, and Askar Nurbekov Detecting CAN Bus Intrusion by Applying Machine Learning Method to Graph Based Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730 Rafi Ud Daula Refat, Abdulrahman Abu Elkhail, Azeem Hafeez, and Hafiz Malik Information Security Awareness Evaluation Framework and Exploratory Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 749 Evgeniya Nikolova and Veselina Jecheva Multilayer Security for Facial Authentication to Secure Text Files . . . . 761 Jeffrey Tagoc, Ariel M. Sison, and Ruji P. Medina sBiLSAN: Stacked Bidirectional Self-attention LSTM Network for Anomaly Detection and Diagnosis from System Logs . . . . . . . . . . . . 777 Chenyu You, Qiwen Wang, and Chao Sun A Multi-agent System with Smart Agents for Resource Allocation Problem Solving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 794 Alexey Goryashchenko Search Methods in Motion Planning for Mobile Robots . . . . . . . . . . . . . 802 Laura Paulino, Correy Hannum, Aparna S. Varde, and Christopher J. Conti Application of Adversarial Domain Adaptation to Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823 TaeSoo Kim and Jong Hwan Ko More Than Just an Auxiliary Loss: Anti-spoofing Backbone Training via Adversarial Pseudo-depth Generation . . . . . . . . . . . . . . . . . . . . . . . 830 Chang Keun Paik, Naeun Ko, and Yongjoon Yoo Emergence and Solidification-Fluidisation . . . . . . . . . . . . . . . . . . . . . . . 845 Bernhard Heiden and Bianca Tonino-Heiden Prediction of Isoflavone Content in Soybeans with Sentinel-2 Optical Sensor Data by Means of Regressive Analysis . . . . . . . . . . . . . . . . . . . . 856 Kohei Arai, Osamu Shigetomi, Hideki Ohtsubo, and Eri Ohya
xii
Contents
A Neuro-Fuzzy Model for Fault Detection, Prediction and Analysis for a Petroleum Refinery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866 Peter Omoarebun, David Sanders, Favour Ikwan, Malik Haddad, Giles Tewkesbury, and Mohamed Hassan Impact of Interventional Policies Including Vaccine on COVID-19 Propagation and Socio-economic Factors: Predictive Model Enabling Simulations Using Machine Learning and Big Data . . . . . . . . . . . . . . . . 877 Haonan Wu, Rajarshi Banerjee, Indhumathi Venkatachalam, and Praveen Chougale Correction to: Automatic Monitoring and Analysis of Brands Using Data Extracted from Twitter in Romanian . . . . . . . . . . . . . . . . . Lucian Istrati and Alexandra Ciobotaru
C1
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity Thanh Thieu1(B) , Ha Do2 , Thanh Duong1 , Shi Pu4 , Sathyanarayanan Aakur1 , and Saad Khan3 1
4
Oklahoma State University, Stillwater, OK 74078, USA [email protected] 2 University of Louisville, Louisville, KY 40292, USA 3 FineTune Learning, Boston, MA, USA http://languageandintelligence.cs.okstate.edu Educational Testing Service, Toronto, Ontario, Canada
Abstract. We present a novel method that automatically measures quality of sentential paraphrasing. Our method balances two conflicting criteria: semantic similarity and lexical diversity. Using a diverse annotated corpus, we built learning to rank models on edit distance, BLEU, ROUGE, and cosine similarity features. Extrinsic evaluation on STS Benchmark and ParaBank Evaluation datasets resulted in a model ensemble with moderate to high quality. We applied our method on both small benchmarking and large-scale datasets as resources for the community.
Keywords: Monolingual rewriting quality
1
· Lexical diversity · Paraphrasing
Introduction
In linguistics, lexical complexity is a multidimensional measure encompassing lexical diversity, lexical density, and lexical sophistication [13,16,22]. Modern natural language processing (NLP) adopted a bag-of-features approach on lexical complexity for paraphrase simplification. The general strategy is to perform complex word identification (CWI) [1,14,17,25] and then substitute those with simpler words. Four categories of features used in CWI were: (1) word-level features such as word length, syllable counts, (2) morphological features such as part-of-speech, suffix length, noun gender, (3) semantic features derived from WordNet or cosine similarity between word embedding vectors, and (4) corpusbased features such as word frequencies, n-gram frequencies, or topic distribution in some reference corpora. These strategies measured single word and short phrase complexity, thus rendering them unsuitable for measuring complexity of complete sentences. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 1–10, 2022. https://doi.org/10.1007/978-3-030-82199-9_1
2
T. Thieu et al.
In sentential monolingual rewriting, most modern NLP methods focused on semantic similarity between a reference sentence and its paraphrases [5,9,24]. Recent work sought to improve the lexical diversity of paraphrases by adding heuristic lexical constraints to the decoder [8,10]. However, these works resulted in most highly ranked paraphrases that were almost lexically identical to the references. Thus, paraphrase generation became a trivial task unusable for practical purposes such as: content generation in education, data augmentation in language modeling, question answering, textual entailment, etc. Table 1 shows examples of top ranking paraphrases from two human annotated datasets: STS Benchmark1 and ParaBank Evaluation2 . Table 1. Example top ranking reference/paraphrase pairs in semantic similarity by humans in STS benchmark and ParaBank evaluation datasets. Datasets Top examples STS
R: A man with a hard hat is dancing P: A man wearing a hard hat is dancing R: A man is feeding a mouse to a snake P: The man is feeding a mouse to the snake
ParaB
R: You weigh a million pounds P: You weigh one million pounds R: Ladies and gentlemen, young people P: Ladies and gents, young people
In this study, we present a learnt quality measure of paraphrases that addresses the low lexical diversity issue in sentential paraphrasing. Our method not only aligns with semantic similarity, but also significantly enhances the difference in lexical use between a paraphrase and its reference. Such desideratum was referred to as quality or fluency of paraphrases [8]. We also adopted a bag-offeatures approach but did not use the four feature categories of CWI since these features were developed for single-word complexity while our task aims to measure sentential complexity. We modeled paraphrase quality as a learning-to-rank problem on a controlled corpus generated by educational specialists and annotated by Amazon Mechanical Turk workers. We then used the trained model to re-rank paraphrases in STS Benchmark and ParaBank Evaluation datasets and showed that our model picked paraphrases with superior quality. Hereinafter, we detail data collection, feature engineering, and measure modeling in the Method section. Extrinsic evaluation is presented in the Results section, and contribution together with model characteristics are explained in the Discussion section. Lastly, we highlight the novelty and impact of this study in the Conclusion section. 1 2
https://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark. https://github.com/decompositional-semantics-initiative/ParaBank-Eval-Data.
LexDivPara
2
3
Method
We define quality as the holistic fusion between semantic similarity and lexical variation of a paraphrase compared to its reference sentence. 2.1
Data Collection
Our dataset comprises of 5 documents, totaling 92 English sentences, from ACT Test Preparation textbooks. Topics includes sport, history, biological science, tourism, and geography. To generate paraphrases, we used Google Cloud Translation to translate each reference English sentence into 10 foreign languages, then back-translate these 10 foreign sentences into English. The 10 foreign languages included Japanese, Korean, Chinese, German, Spanish, Portuguese, Greek, Arabic, Slovenian, and Turkish. This process generated 10 paraphrases per reference, resulting in a dataset of 920 paraphrase/reference pairs. We hired English speakers on Amazon Mechanical Turk to annotate quality scores for the paraphrases. We restricted annotators to at least had graduated from a U.S. high school and possessed excellent reviews by previous requesters. Given the ACT documents were for college entrance exam, the selected annotator population qualified to perform our task. We adopted the EASL framework [23] to increase annotation efficiency by presenting all ten paraphrases (of the same reference sentence) simultaneously in one page so that annotators could compare between items while giving scores. We re-used the HTML template from EASL to generate task pages on Amazon MTurk. The annotators were asked to give a score in a range [0, 100] to each paraphrase/reference pair. Additionally, each pair was annotated by 10 different annotators. Score 100 corresponds to same meaning and different wording, while the contrasted score 0 is hypothetical and corresponds to different meaning and same wording. Thus, the score measures quality of the paraphrase. In total we obtained 9,200 paraphrase/reference rankings. 2.2
Feature Engineering Table 2. Summary of semantic and lexical features
Type
Category
Metric
Semantic Sentence embedding Cosine similarity Lexical
Edit distance BLEU ROUGE
Constituent tree, word sequence, character sequence 1-gram, 2-gram, 3-gram, 4-gram 1-gram, 2-gram, 3-gram, 4-gram, longest common subsequence (LCS), weighted LCS
To model the fusion between semantic similarity and lexical difference, we combined cosine similarity with edit distance and machine translation scores. In
4
T. Thieu et al.
total there were one semantic feature and thirteen lexical features. Table 2 gives a summary of the features. In following description, we refer to two sentences as a reference/paraphrase pair. Cosine Similarity: We invoked the universal sentence encoder [6] on deep averaging network (DAN) [12] to generate embeddings of the two sentences, then calculated cosine distance between the two embedding vectors to represent semantic similarity. Tree Edit Distance: We invoked the Stanford CoreNLP toolkit [18] to parse constituent trees of the two sentences, then used Zhang-Shasha algorithm [27] to compute the edit distance between the two trees. In addition, we normalized the distance by the total number of nodes in both trees. Tree edit distance represents the difference in grammatical structure between the two sentences. Word and Character Edit Distances: We used NLTK [3] implementation of Levenshtein edit-distance [19] with substitution cost set to 2 to compute the transformation cost between the two sequences of words (or characters) of the two sentences. We also normalized the cost by the total number of words (or characters) of both sentences. This normalization engulfs the substitution cost of 2, which represent the removal of one word (or character) from a sentence while adding another word (or character) into the other sentence. Sequence edit distances represent the ordinal difference in vocabulary use between the two sentences. BLEU Scores: We used NLTK [3] implementation of bilingual evaluation understudy [20] to compute modified precision of overlapping n-grams for individual orders of n-gram. BLEU scores represent the precision of paraphrase n-grams that match the reference sentence. ROUGE Scores: We used PyPI’s py-rouge package [2] to compute recall-oriented understudy [15] of overlapping n-grams for individual orders of n-grams, longest common subsequence (LCS), and weighted LCS. ROUGE scores represent the recall of reference sentence’s subsequences that match the paraphrase. 2.3
Learning to Rank Paraphrase Quality
We formulated paraphrase quality as a learning-to-rank problem in information retrieval. The reference sentence serves as a query; paraphrases serves as retrieved documents; and paraphrase quality score serves as the relevance score. The learning-to-rank formulation optimizes relative orders of paraphrases; thus, it is robust to both the inconsistency in score ranges and distances. We utilized XGBoost gradient boosted trees [7] to train our ranking model. XGBoost was successfully used by multiple winners in machine learning challenges and Kaggle competitions. We parametrized XGBoost to use LambdaMART [4] to perform list-wise ranking with mean average precision (MAP) objective function. Learning rate was set to 0.1; minimum loss reduction was set to 1.0; minimum sum of instance weights in a child was set to 0.1; maximum depth of a tree was set to 6; number of trees was set to 10. To evaluate model
LexDivPara
5
performance, we implemented five-fold cross-validation with 80% data for training and 20% data for validation, then used Scikit-learn’s implementation [21] of normalized discounted cumulative gain (NDCG) for evaluation. We experimented with some variation of above parameter settings and found insignificant NDCG gain/loss. 2.4
Label Smoothing Regularization
We observed that annotators diverted to different score ranges and scales. For example, ten paraphrases of a sentence “The Fulton fish market” were given scores [60, 61, 60, 77, 50, 50, 50, 60, 70, 60] by one annotator and scores [42, 45, 51, 57, 49, 55, 55, 53, 45, 51] by another annotator. Thus, the first annotator preferred scores in range [50, 80] while the second annotator preferred scores in range [40, 60]. Calculating Spearman’s rank correlation coefficient between annotation scores of paraphrases from the same reference sentence resulted in mean = 0.24 and standard deviation = 0.40. Agreement between annotators were low and spreading. We hypothesizes that quantification of paraphrase quality was affected by annotators’ individual bias. Since we pioneered the study of sentential paraphrase’s holistic quality in this work, we were not aware of any prior formal composition of paraphrase quality, nor a proven scale to reduce annotator bias. To smooth annotator bias, we first experimented with z-score normalization in various scopes (e.g. per annotator, per reference sentence, and per paraphrase), but they all resulted in the same rank correlation coefficient. By trials and errors, we discovered that the scores could be smoothed using their sorted indices. Specifically, we sorted the original scores and then substituted them by their indices in the sorted list. Thus, scores [80, 89, 60, 78, 76, 74, 63, 32, 72, 70] becomes [1, 0, 8, 2, 3, 4, 7, 9, 5, 6]. When ties occured, the earlier item in the list was aribitrarily assigned a smaller index score. Hereinafter, we denote models using the smoothed labels as Index models, and models using the original annotator scores as Raw models. Spearman’s rank correlation coefficient on Index scores agreement reached higher means = 0.28 and smaller standard deviation = 0.38 compared to Raw scores agreement. 2.5
Augmenting Semantic Similarity
Measuring semantic similarity based on sentence embedding is a challenging task [5,26], and we expected our semantic feature to be a weak one. To offset this weakness, we experimented augmentation of our paraphrase quality score (Q) with the human annotated semantic similarity score (S) in benchmarking datasets (STS Benchmark and ParaBank Evaluation). We experimented with linear combinations of Q and S and found the best linear coefficient performed equally well as a harmonic mean F1 combination. We picked the balanced F1 as the combined score to simplify hyper-parameter tuning. F1 (Q, S) = 2 × Q × S/(Q + S)
6
T. Thieu et al.
Table 3. Top ranking examples from the STS-l. Raw and index (Idx) models were ranked by Q. Augmented models (prefix A-) were ranked by F1 . S is ground truth for semantic similarity. Q reflects lexical diversity. STS score range is [0, 5]; ParaBank score range is [0, 100]. Model Reference(R)-Paraphrase(P) sentence pair
S
Q
STS benchmark Raw
R: A man is cutting a potato P: A man is slicing some potato
4.4 3.16
ARaw R: A man is playing the drums P: A man plays the drum
5.0 1.47
Idx
R: A man plays an acoustic guitar P: A woman and dog are walking together
0.0 4.51
AIdx
R: I realized there is already an accepted answer But I figure I would add my 2 cents P: I know this is an old question But I feel I should add my 2 cents R:You may have to experiment and find what you like P:You have to find out what works for you
5.0 4.01
5.0 4.49
ParaBank evaluation Raw
3
R: I’ve known Miguel since childhood P: I knew Miguel from childhood
87
36
ARaw R: You’re confusing humility, with humiliation P: I think you mistake humility with humiliation
100 24
Idx
R: I am at your service P: Dyce’s here to see you
20
AIdx
R: One doesn’t detect the tiniest trace of jealousy, does one? 99 90 P: I don’t hear a tiny undertone of jealousy in your voice? R: Let me check once again 100 87 P: I’ll look again
90
Results
Our models were evaluated on two extrinsic benchmarking datasets: STS Benchmark and ParaBank Evaluation. In each dataset, we applied both Raw and Index models together with their augmented versions (Sect. 2.5). We then sorted the datasets in descending order of computed scores and compared top ranking paraphrases. In Table 3, each pair of reference/paraphrase sentences was accompanied by a semantic score S, and a lexical diversity score Q. We chose score Q of the Index model to represent lexical diversity because it best reflected the lexico-grammatical difference between the two sentences. Results showed that the Raw model preserved the reference meaning well but only performed moderately on lexical diversity. Paraphrases of the Raw
LexDivPara
7
model repeated key phrases from the reference sentences. Augmented Raw model gave higher semantic similarity but at the cost of reduced lexical diversity. Index model failed at preserving reference meaning, but prevailed at promoting lexical diversity. Paraphrases found by the vanilla Index model showed large lexico-grammatical difference, but also carried almost different meaning from the reference sentences. The Augmented Index model was the most interesting one as it not only throve at semantic similarity, but also showed high lexical diversity. Paraphrases found by the Augmented Index model expressed significant lexico-grammatical difference while preserving the original meaning of the reference sentences. In addition to Table 3, we made the full ranking of STS Benchmark and ParaBank Evaluation datasets publicly available for community investigation. To gain insights and intuition about model behaviors, we analyzed models’ feature importance based on decision trees’ node split gain (Fig. 1). The Raw model prioritized cosine similarity for sentence meaning and ROUGE-1 for single-word lexical difference. This explained its tendency to keep almost identical meaning and picked paraphrases with few single-word difference. In contrast, the Index model prioritized tree edit distance for difference in grammatical structure and ROUGE-L for variation of long sub-phrases. The Index model was the opposite of the Raw model. The Index model’s features favored lexico-grammatical difference at the expense of reference meaning. When augmented with a strong semantic similarity signal S, the Augmented Raw model inclined even more toward preserving reference meaning, while the Augmented Index model achieved a rare equilibrium that produced both significant lexicogrammatical difference and strong similarity with reference meaning. In our study, Augmented Index was the highest quality ranking model for monolingual paraphrasing.
Fig. 1. Feature importance of XGBoost ranking models. Suffixes: -S: similarity, -ED: edit distance.
8
T. Thieu et al.
We call our method LexDivPara for lexical diversity in paraphrasing. Our experiment and evaluation suggested that the Augmented Index model should be used when a strong feature for semantic similarity is available. Otherwise, the Raw model should be used to deliver a moderate quality device for paraphrase ranking. In addition to scoring paraphrase quality of STS Benchmark and ParaBank Evaluation datasets, we have scored and sorted the ParaBank 2.0 dataset [11] comprising of 19 million reference sentences and made it publicly available as a large-scale resource for researchers interested in training good quality paraphrase generative models.
4
Discussion
Measuring quality of monolingual paraphrasing is a challenging task, as it struggles to balance between two conflicting desiderata: semantic preservation and maximal lexical variation. In this work, we projected a holistic quality score for paraphrases and factorized it into two contradicting components: semantic similarity, and lexical diversity. While semantic similarity had been studied well in the computational linguistics literature [5,9,24], sentential lexical diversity was mostly un-explored. A recent work close to ours is the development of ParaBank [8,10] that used heuristic lexical constraints to encourage diverse use of words in the decoding sequence. However, this work only used one single measure, BLEU without length penalty, to evaluate how different the paraphrases are to the reference sentences. Furthermore, their use of an evaluation data set that contained many of mostly identical paraphrase/reference pairs made the treatment of lexical diversity subjective and incomplete. To comprehensively assess lexical diversity at sentential level, our work expanded the set of features to encompass multiple machine translation measures that reflected precision, recall, and edit distance statistics. Our contribution involves three folds: (1) quantify lexical diversity measure between two sentences, (2) annotate a sentential paraphrase quality corpus, and (3) evaluate learnt models on extrinsic datasets. Nevertheless, our work is restricted to a limited training dataset and our top-performing model relied on the availability of high-quality semantic similarity scores. Interested researchers could overcome this challenge by either leveraging STS Benchmark and ParaBank Eval in semi-supervision learning, or utilizing multiple embedding methods to improve semantic similarity representation. To expedite research on sentential paraphrasing, we release the scored and sorted, large scale ParaBank 2.0 dataset that contains millions of sentences.
5
Conclusion
We presented a novel, accurate method to measure quality of paraphrases. Our work incorporated lexical diversity at the sentential level, in contrast to existing work in computational linguistics that was constrained to single-word and phrasal levels. We established the first learnt measure for paraphrase quality
LexDivPara
9
using supervised learning. Our machine learning models were free from heuristic rule construction and lexical choice guess work. We expect this study to provide resource and methodology to the under-active lexical diversity aspect of language generation. Potential future work includes investigating alternative high-quality semantic similarity scores, filtering high quality bitext corpora for machine translation, or embedding quality measure into end-to-end language generation models. Our source code, feature set, annotated data, and ranked datasets are freely available at: http://languageandintelligence.cs.okstate.edu/ tools. Acknowledgment. The authors would like to thank ACT for assisting with collection of the original text and annotation on Amazon Mechanical Turk. This work is partly supported by the first author’s start-up fund, the first author’s OSU ASR FY22 summer program, NSF CISE/IIS 1838808 grant, and NSF OIA 1849213 grant.
References 1. Alfter, D., Volodina, E.: Towards single word lexical complexity prediction. In: Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, New Orleans, Louisiana, pp. 79–88. Association for Computational Linguistics (2018) 2. Diego Antognini. Py-rouge (2018) 3. Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media, Inc., Newton (2009) 4. Burges, C.J.C., Svore, K.M., Wu, Q., Gao, J.: Ranking, boosting, and model adaptation (2008) 5. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: SemEval-2017 Task 1: semantic textual similarity multilingual and crosslingual focused evaluation, Vancouver, Canada, pp. 1–14. Association for Computational Linguistics (2017) 6. Cer, D., et al.: Universal sentence encoder for English. In: 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 169–174. Association for Computational Linguistics (2018) 7. Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System (2016) 8. Hu, J.E., Rudinger, R., Post, M., Van Durme, B.: ParaBank: monolingual bitext generation and sentential paraphrasing via lexically-constrained neural machine translation. In: AAAI 2019, Honolulu, Hawaii (2019) 9. Ganitkevitch, J., Van Durme, B., Callison-Burch, C.: PPDB: the paraphrase database, pp. 758–764. Association for Computational Linguistics (2013) 10. Hu, J.E., et al.: Improved lexically constrained decoding for translation and monolingual rewriting. In: NAACL 2019, Minneapolis, Minnesota (2019) 11. Hu, J.E., Singh, A., Holzenberger, N., Post, M., Van Durme, B.: Large-scale, diverse, paraphrastic bitexts via sampling and clustering. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 44–54. Association for Computational Linguistics (2019) ` III, H.: Deep unordered com12. Iyyer, M., Manjunatha, V., Boyd-Graber, J., DaumE position rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1: Long Papers), Beijing, China, pp. 1681–1691. Association for Computational Linguistics (2015)
10
T. Thieu et al.
13. Johansson, V.: Lexical diversity and lexical density in speech and writing: a developmental perspective. Lund Work. Papers Linguist. 53, 61–79 (2009) 14. Kriz, R., Miltsakaki, E., Apidianaki, M., Callison-Burch, C.: Simplification using paraphrases and context-based lexical substitution. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), New Orleans, Louisiana, pp. 207–217. Association for Computational Linguistics (2018) 15. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Association for Computational Linguistics (2004) 16. Xiaofei, L.: The relationship of lexical richness to the quality of ESL learners’ oral narratives. Mod. Lang. J. 96(2), 190–208 (2012) 17. Maddela, M., Xu, W.: A word-complexity lexicon and a neural readability ranking model for lexical simplification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3749–3760. Association for Computational Linguistics (2018) 18. Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., McClosky, D.: The stanford CoreNLP natural language processing toolkit, pp. 55–60. Association for Computational Linguistics (2014) 19. Miller, F.P., Vandome, A.F., McBrewster, J.: Levenshtein Distance: Information theory, Computer science, String (computer science), String metric, Damerau?Levenshtein distance, Spell checker, Hamming distance. Alpha Press, Orlando (2009) 20. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pap. 311–318. Association for Computational Linguistics (2002) 21. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 22. Read, J.: Assessing Vocabulary. Cambridge University Press, Cambridge (2000) 23. Sakaguchi, K., Van Durme, B.: Efficient online scalar annotation with bounded support, Melbourne, Australia, pp. 208–218. Association for Computational Linguistics (2018) 24. Wieting, J., Gimpel, K.: ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations, pp. 451–462. Association for Computational Linguistics (2018) 25. Wilkens, R., Vecchia, A.D., Boito, M.Z., Padr´ o, M., Villavicencio, A.: Size does not matter. frequency does. a study of features for measuring lexical complexity. In: Bazzan, A.L.C., Pichara, K. (eds.) IBERAMIA 2014. LNCS (LNAI), vol. 8864, pp. 129–140. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12027-0 11 26. Yang, Y., et al.: Multilingual universal sentence encoder for semantic retrieval, pp. 87–94 (2019) 27. Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)
The Potential of Machine Learning Algorithms for Sentiment Classification of Students’ Feedback on MOOC Maryam Edalati1(B) , Ali Shariq Imran1 , Zenun Kastrati2 , and Sher Muhammad Daudpota3 1
3
Department of Computer Science (IDI), Norwegian University of Science and Technology (NTNU), 2815 Gjøvik, Norway [email protected], [email protected] 2 Department of Informatics, Linnaeus University, 351 95 V¨ axj¨ o, Sweden [email protected] Department of Computer Science, Sukkur IBA University, Sukkur 65200, Pakistan [email protected]
Abstract. Students’ feedback assessment became a hot topic in recent years with growing e-learning platforms coupled with an ongoing pandemic outbreak. Many higher education institutes were compelled to shift on-campus physical classes to online mode, utilizing various online teaching tools and massive open online courses (MOOCs). For many institutes, including both teachers and students, it was a unique and challenging experience conducting lectures and taking classes online. Therefore, analyzing students’ feedback in this crucial time is inevitable for effective teaching and monitoring learning outcomes. Thus, in this paper, we propose and conduct a study to evaluate various machine learning models for aspect-based opinion mining to address this challenge effectively. The proposed approach is trained and validated on a large-scale dataset consisting of manually labeled students’ comments collected from the Coursera online platform. Various conventional machine learning algorithms, namely Random Forest (RF), Support Vector Machine (SVM), and Decision Tree (DT), along with deep-learning methods, are employed to identify teaching-related aspects and predict opinions/attitudes of students towards those aspects. The obtained results are very promising, with an F1 score of 98.01% and 99.43% achieved from RF on the aspect identification and the aspect sentiment classification task, respectively. Keywords: Aspect extraction · Aspect sentiment classification · Sentiment analysis · E-learning · Students’ feedback · Deep learning Machine learning · 1D-CNN · BERT · MOOC
1
·
Introduction
Digital learning platforms became popular with the launch of massive open online courses (MOOCs), interactive multimedia platforms [1], and hyper-interactive c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 11–22, 2022. https://doi.org/10.1007/978-3-030-82199-9_2
12
M. Edalati et al.
systems [3] a few years ago. However, digital teaching and online learning importance have increased manifold due to the ongoing COVID-19 pandemic in today’s era. Many educational institutes worldwide shifted on-campus physical classes to online classes utilizing various e-learning platforms as a result. MOOCs are one of the first e-learning platforms providing open-access online courses that allow for unlimited participation [14] and small private online courses (SPOCs) [15]. However, the dropout rates between 85 and 95% [4,5] is one main drawback for such platforms that increased the demand for analyzing students’ feedback. Many institutions felt the need to collect students’ feedback for upholding quality and ensuring a successful delivery of content to students online via various platforms. MOOCs offer a great platform to collect students’ feedback on a massive scale and train and build models. Many higher education institutes and experts have had a strong interest in extracting aspects, and their related sentiment from these feedback [6,7] and using NLP techniques to create effective learning management systems and elearning platforms [8]. Manual extraction of the aspects and their related sentiment is a timeconsuming task due to a large number of data. Therefore developing a reliable automated method to extract aspects and related sentiment of the aspect is necessary [9]. Opinion mining (OM) or Sentiment Analysis (SA) is a suitable substitute for the traditional feedback analysis to extract students’ opinions from the feedback and classify it in appropriate sentiment polarity. This study aims to utilize sentiment analysis techniques to evaluate students’ feedback collected from a MOOC platform to build, train, and test various conventional machine learning/deep learning models, which could be useful in predicting students’ sentiments towards a course. For this purpose, we present a comparison between the three most commonly used conventional machine learning algorithms that show the best results in the state-of-the-art on sentiment analysis of students’ feedback along with two automated machine learning models based on 1D-CNN and BERT embedding. The rest of the article is as follows. In Sect. 2, the most recent work is presented. Section 3 presents the dataset along with the techniques and the approaches used to conduct the aspect category identification and the aspect sentiment classification. Results and their analysis are provided in Sect. 4 followed by Sect. 5 that concludes the paper.
2
Related Work
In recent years, SA has not been only applied to students’ feedback sentiment analysis, but it also has been applied to various tasks, including examining the spreading pattern of information and tracking/understating public reaction during a given crisis on social media [10,11]. SA categorizes into three-level including document-level, sentence-level, and entity or aspect-level [12,13]. The document and sentence level SA is based on the assumption that only one topic is expressed, while in many situations (students’ feedback), this is not the case
Sentiment Classification of Students’ Feedback
13
and a precise analysis also requires investigation [6]. Aspect-level SA is divided into two steps: first, different aspects are extracted and classified into similar classes, then sentiment related to each aspect is determined [2,16,17]. Kastrati et al. [2] used the real-life dataset containing more than 21 thousand reviews from Coursera to evaluate their proposed models for aspect-based opinion mining. The authors used two representation techniques including term frequency (tf), and term frequency-inverse document frequency (tf*idf), and three pre-trained word embedding models (FastText, Word2Vec, GloVe). They first classified the comments based on the five aspects, including Instructor, Content, Structure, Design, and General. Each of the samples within these aspects was classified into one of the polarity categories (Positive, Negative, and Neutral). Four conventional machine learning classification algorithms, namely Decision Tree, Na¨ıve Bayes, SVM, Boosting, and an 1D-CNN model were used. Their results show that conventional machine learning techniques achieved better performance than 1D-CNN. In [9] the authors proposed a supervised aspect-based opinion mining system based on a two-layered LSTM model so that the first layer predicts six categories of aspects (Teaching Pedagogy, Behavior, Knowledge, Assessment, Experience, and General) and the second layer predicts polarity (positive, negative, and neutral) of the aspect. The authors in [7] took advantage of the weak supervision strategy to train a deep network to automatically identify the aspects present within MOOC reviews by using either very few or even no manual annotations. Besides, the proposed framework examines the sentiment towards the aspects commented on a given review. The study in [16] proposed a method for the aspect-based sentiment analysis for the Serbian language at the sentence segment level. They used a dataset that contains both official faculty and online surveys. The dataset was divided into seven aspect classes (professor, course, lectures, helpfulness, materials, organization, and other) and two polarity classes (positive, negative). The authors used tf ∗ idf as a representation technique. For classification, they used three standard machine learning multi-class classification models (Support vector machine, k-nearest neighbors (k-NN), and multinomial NB (MNB)), and a cascade classifier including a set of SVM classifiers organized in a cascade structure. A two-step strategy based on machine learning and Natural Language Processing (NLP) techniques to extract the aspect and polarity of the feedback is proposed in [17]. The study used 10,000 labeled students’ feedback collected at Sukkur IBA University Pakistan. The method is divided into three main steps. In the first step, the student feedback is classified into the teacher or course entity using the Naive Bayes Multinomial classifier. Once the entity has been extracted, a rule-based system was developed to analyze and extract the aspects and opinion words from the text by using predefined rules. In the final step, the authors used SentiWordNet to extract the sentiment regarding extracted aspects. In [18] the authors presented a comparison between eight conventional machine learning (Bernoulli, Multinomial Na¨ıve Bayes methods, k-nearest neighbors (KNN), Support Vector Machine, Linear Vector Machine, Decision Trees, Random Forest, B4MSA) and five different deep learning architectures (two CNN models with different layers, one LSTM model, one hybrid between a CNN,
14
M. Edalati et al.
and a LSTM model, and a BERT model) with an evolutionary approach called EvoMSA for the classification of students’ feedback. EvoMSA is a multilingual sentiment classifier based on Genetic Programming. Their result shows EvoMSA algorithm generated the best results among other classifiers. The authors in [19] experimented on 16,175 Vietnamese students’ feedback to classify their sentiments (positive, negative, and neutral). They converted the dataset to the English language for polarity classification. In their proposed method, input sequences of sentences are processed parallel across the multihead attention layer with fine-grained embedding (GloVe and CoVe). The model was tested with different dropout rates to achieve the best possible accuracy. The information from both deep multi-layers is fused and fed as input to the LSTM layer. They compared their proposed method with the other baseline models (LSTM, LSTM + ATT, Multi-head attention). Their proposed methods indicated better results. In [20], the author presented a recurrent neural network (RNN) based model for polarity classification of students’ feedback. The proposed model was evaluated on a dataset containing 154000 reviews that were collected from the ratemyprof essors.com website. RNN is compared with conventional machine learning algorithms (Na¨ıve Bayes, SVM, logistic regression, K-nearest neighbor, and random forest), ensemble learning methods, and deep learning architectures. Three conventional text representation schemes (term-presence, term-frequency (tf), and tf*idf) and four word-embedding schemes (word2vec, GloVe, fastText, and LDA2Vec) have been taken into consideration. The results indicated RNN with GloVe word embedding with an accuracy of 98.29% gave the best results.
3
Experimental Settings
In this section, we describe the dataset along with the classification models used to conduct experiments including conventional machine learning algorithms and deep neural networks. 3.1
Dataset
To validate the proposed classification models for aspect-level sentiment analysis, we used a real-life dataset introduced by Kastrati et al. [2]. The dataset contains students’ reviews gathered from 15 different computer science courses on Coursera online learning platform. All reviews were in English language. Each student feedback is labeled in one of the five aspect categories (Instructor, Content, Structure, Design and General) and in one of the three polarity classes (Positive, Negative and Neutral). Some statistics of the target dataset are depicted in Table 1. Distribution of reviews in both the aspect categories and the sentiment polarity classes is highly imbalanced. More specifically, 84.22% of reviews are labeled as positive, 10.56% as negative, and 5.21% of them are labeled as neutral. In the aspect category, 57.42% of reviews belong to the Content category whereas the
Sentiment Classification of Students’ Feedback
15
Table 1. Dataset statistics Data
Value
No. of reviews 21,940 No. of aspects 5 No. of polarity 3 Max. length
554 words
Min. length
1 word
Avg. length
25 words
rest of reviews are distributed across the four other categories including Instructor, Design, General and Structure with 19.36%, 9.96%, 9.58%, 3.65% of the reviews, respectively. 3.2
Preprocessing
We applied few preprocessing steps to the dataset before feeding it to the classifiers. In particular, we removed all irrelevant symbols like html tags, punctuation, and stop words and converted text to lowercase. Machine learning/deep learning algorithm could not be fed with the text data so there is a need to convert text to an appropriate format that can be supported by them - the numerical format (vector). The study conducted by Kastrai et al. [2] demonstrated that using term frequency (tf ) as a term weighting scheme led to a lower classification accuracy compared to input features generated by tf ∗idf weighting scheme. Therefore, we used the term frequency inverse document frequency – tf ∗idf as a representation technique. tf ∗idf measures the relevance of words using two components, tf and idf where tf reflects the importance of the words and idf shows the distribution of those words among the collection of documents. Since the dataset is highly imbalanced, we used the synthetic minority over-sampling technique (SMOTE) as a class balancing method for conventional machine learning algorithms. For training, all classifiers used in this research, we divide the dataset arbitrary into training 70% and testing 30%. 3.3
Model Architectures and Parameter Settings
To obtain the best architectures of deep neural networks, we used AutoKeras1 . AutoKeras is an auto machine learning system based on Keras that automatically searches for the best architectures and hyperparameters for deep learning models. We conduct the classification experiments on the original dataset (imbalance dataset) with the default parameters. By default, AutoKeras uses 100 different models however due to the limited memory the maximum number of different models (max trials) is set to 10. 1
https://autokeras.com/.
16
M. Edalati et al.
The validation dataset consisted of 15% of training data and epoch is set to 9. An 1D-CNN deep learning model was selected by AutoKeras for the aspect category classifications. The model architecture is shown in Fig. 1 and it is composed of eight layers including one embedding layer, two dropout layers, one convolutional layer, one maxpooling and two dense layers. Specifically, the embedding layer takes 512-D feature vector built of students’ reviews and convert each word to a 64-D embedding vector. The output of the embedding layer is fed to a dropout layer and create the input of the 256-unit convolution layer containing 1D convolution filter. An 1D global maxpooling operation is applied in the maxpooling layer to calculate the maximum value of each features’ patch. Those outputs are then fed into a 256-unit fully-connected layer with a relu activation function. The output of the dense layer serves as input to the a maxpooling layer. Finally, output of the maxpooling layer is fed into a dense layer with sof tmax activation function to compute a discrete probability distribution over the five aspect categories.
Fig. 1. 1D-CNN model architecture for the aspect category classification.
In the same fashion, we used AutoKeras to obtain the best network architecture for the aspect sentiment classification. The selected network is a BERT model, as illustrated in Fig. 2 and it is composed of three layers including berttokenizer, bert-encoder and a dense layer with sof tmax activation function to
Sentiment Classification of Students’ Feedback
17
compute a discrete probability distribution over the three aspect sentiments categories.
Fig. 2. BERT model architecture for the aspect sentiment classification.
Like with deep neural networks, we used Auto-sklearn2 toolkit to obtain the best conventional machine learning algorithms. Auto-sklearn is an automated machine learning toolkit written in Python that automatically searches for the best machine learning algorithm for any dataset. The Auto-sklearn classifier model built with default parameters and excluding preprocessing. The classifier successfully run five algorithms including AdaBoost, SVM, RF, DT, stochastic gradient descent (SGD) and the best models was selected based on maximum mean test score.
4
Results and Discussion
Supervised conventional machine learning algorithms and deep neural networks are used to predict the aspect categories and classify the student’s opinions towards these aspects. In particular, we used three different conventional machine learning algorithms including: Random Forest (RF), Support Vector Machine (SVM) and Decision Tree (DT). All these algorithms are implemented using scikit-learn3 library written in Python. A grid search technique is performed to fine-tuning parameters and obtain the best classification results. Two automated machine learning methods are used for searching best architectures of deep neural networks and conventional machine learning.
2 3
https://automl.github.io/auto-sklearn/master/. https://scikit-learn.org/stable/.
18
4.1
M. Edalati et al.
Aspect Category Classification
Conventional machine learning classifiers are trained on the output of SMOTE. In order to overcome the stochastic nature of the algorithms, each classifier is run three times and the average of the outcomes is presented as final results. Information retrieval based metrics including Precision, Recall and F1 score are used to measure the performance of all classifiers. The performance of five different classification algorithms on the aspect category classification task with respect to precision, recall and F1 score is shown in Table 2. Table 2. Performance of ML algorithms on the aspect classification ML algorithm
P (%) R (%) F1 (%)
RF
98.09
97.99
98.01
SVM
88.67
88.10
88.20
DT
79.05
78.99
78.88
SVM (Auto-sklearn)
77.71
78.29
77.39
Conv1D (AutoKeras) 64.03
66.89
65.82
For DT classifier after fine-tuning, parameters random state and max depth are set to zero and 100, respectively. All the other parameters are set to default values. For RF classifier’s parameters max f eatures and n estimators are finetuned after a grid search with cross-validation. The number of features is one of the important parameters that need to be set. The max f eatures argument sets the number of the features that are randomly sampled for each split point and by default it is set to the square root of the number of input features. The parameter n estimators indicates the number of trees. The default value for this parameter is set to 100 but it may not lead to the optimized model. The number of trees should increase until no more changes in the model result are observed. After fine-tuning parameters max f eatures is set to 10 and n estimators is set to 150. All the other parameters are set to default values. The SVM is a binary classification algorithm. In order to use SVM for multiclass classification, the dataset with multiple classes needs to be divided into binary datasets. There are two main strategies for doing this: – One versus rest (ovr): the multi class classification is divided into a binary classification for each class. – One versus one (ovo): the multi class classification is divided into a binary classification for each pair of classes. The scikit-learn that we were using for the implementation of the SVM algorithm supports ovo approach by a SVC class. Other parameters include: kernel is Radial Basis Function (RBF), C set to 10, and gamma to 1. The best selected
Sentiment Classification of Students’ Feedback
19
classification algorithm by Auto-sklearn was liblinear svc with parameters C, and decision f unction shape set to 5.29 and ovr, respectively. Table 3 shows class-wise performance of RF on the aspect category classification with respect to precision, recall and F1-score. Table 3. Classification report of RF on the aspect category classification
4.2
Classes
P (%) R (%) F1 (%)
Content
99.11
94.24
96.61
Instructor 99.71
99.53
99.62
General
92.32
99.04
95.56
Structure 99.76
99.79
99.77
Design
97.39
98.42
99.49
Aspect Sentiment Classification
The next task is to examine the performance of conventional machine learning algorithms on the aspect sentiment classification task. Table 4 depicts performance of the ML algorithms with respect to precision, recall, and F1-score. Table 4. Performance of ML algorithms on the aspect sentiment classification ML methods
P (%) R (%) F1 (%)
RF
99.43
99.43
99.43
SVM
96.38
96.33
96.34
DT
89.17
89.10
89.08
RF (Auto-sklearn)
91.69
91.64
91.62
BERT (AutoKeras) 91.13
92.25
92.00
To obtain the results, we fine-tuned the parameters in the same fashion as in Sect. 4.1: For DT classifier, parameters max depth is set to 1500. For RF classifier, max f eatures and n estimators is set to 50 and 200, respectively.For SVM: kernel is Radial Basis Function (RBF), C parameter is set to 10. All the other parameters were set to default values. The best selected classification algorithm by Auto-sklearn was RF classifier with the following parameters: bootstrap, max f eatures, min samples leaf , min samples split, are set to True, 0.499, 2 and 13, respectively. Table 5 shows the class-wise performance of RF in terms of precision, recall and F1-score on the aspect sentiment classification task.
20
M. Edalati et al. Table 5. Classification report of RF on the aspect sentiment classification Classes
P (%) R (%) F1 (%)
Negative 99.63
98.86
99.24
Neutral
99.80
99.70
99.75
Positive
98.86
99.71
99.29
As can be seen from Table 2 and Table 4, RF outperforms the other techniques, achieving an F1 score of 98.01% on the aspect category classification and 99.43% in the aspect sentiment classification task. One explanation for this could be associated to the randomness property of the RF classifier. RF searches for the best feature among a random subset of features while splitting a node. This generates many classifiers and sums their results to increase the accuracy. Since RF generates many classifiers, it has better accuracy on the test set compared to DT. Table 2 and Table 4 also demonstrate that conventional machine learning techniques achieved better performance than the deep learning models. One explanation for this is the fact that we used a class balancing strategy called SMOTE to overcome obstacles due to imbalance in the dataset. Although SMOTE is very useful for conventional machine learning techniques, it has shown to be not very useful with complex networks like 1D-CNN and BERT deep learning models. Since the overlapping between polarity categories is less than the aspect categories, conventional machine learning/deep learning techniques have better performance on the sentiment classification than on the aspect classification task.
5
Conclusion
This study attempted to analyze students’ feedback employing natural language processing and opinion mining approaches. The contribution of this article is at two distinct levels - the aspect category classification and the aspect sentiment classification. We trained and evaluated three state-of-the-art machine learning and two deep learning models on student reviews collected from MOOC courses consisting of 21,940 feedbacks in the English language. We selected the best model architectures for deep learning utilizing the AutoKeras utility. Our results indicated that despite network optimization for the 1D-CNN and stateof-the-art BERT model, the performances achieved in these deep learning models were less than the conventional models on the given dataset. Random Forest, which outperformed the other algorithms, achieved a 98.01% F1-score for the aspect category classification and 99.43% F1-score for the aspect sentiment classification. This is approximately 22% more for the aspect category classification than using 1D-CNN and 8% more for aspect sentiment classification than using the BERT model. As feature works, we are planning to use text generation techniques in order to balance the dataset and test other contextual word embeddings and deep neural networks.
Sentiment Classification of Students’ Feedback
21
References 1. Imran, A.S., Faouzi, A.C.: Multimedia learning objects framework for e-learning. In: International Conference on E-Learning and E-Technologies in Education (ICEEE) (2012). https://doi.org/10.1109/ICeLeTE.2012.6333417 2. Kastrati, Z., Blend, A., Arianit, L., Fitim, G., Engj¨ell, N.: Aspect-based opinion mining of students’ reviews on online courses. In: Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence (2020). https:// doi.org/10.1145/3404555.3404633 3. Imran, A.S., Stewart, J.K.: HIP–a technology-rich and interactive multimedia pedagogical platform. In: International Conference on Learning and Collaboration Technologies, pp. 151–160 (2014). https://doi.org/10.1007/978-3-319-07482-5 15 4. Fisnik, D., Shariq, I.A., Zenun, K.: MOOC dropout prediction using machine learning techniques: review and research challenges. In: IEEE Global Engineering Education Conference (EDUCON), New York (2018). https://doi.org/10.1109/ EDUCON.2018.8363340 5. Shariq, I.A., Fisnik, D., Zenun, K.: Predicting student dropout in a MOOC: an evaluation of a deep neural network model. In: Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence, New York (2019). https://doi.org/10.1145/3330482.3330514 6. Hai, H.D., Prasad, P.W.C., Angelika, M., Abeer, A.: Deep learning for aspect-based sentiment analysis: a comparative review. Exp. Syst. Appl. 118, 272–299 (2019) 7. Zenun, K., Shariq, I.A., Arianit, K.: Weakly supervised framework for aspect-based sentiment analysis on students’ reviews of MOOCs. IEEE Access, 106799–106810 (2020). https://doi.org/10.1109/ACCESS.2019.2928872 8. Shariq, I.A., Laksmita, R., Faouzi, A.C., Sule, Y.Y.: Semantic tags for lecture videos. In: 2012 IEEE Sixth International Conference on Semantic Computing, Italy (2012). https://doi.org/10.1109/ICSC.2012.36 9. Irum, S., Sher, M.D., Kamal, B., Maheen, B., Junaid, B., Mohammad, N.: Aspectbased opinion mining on student’s feedback for faculty teaching performance evaluation. IEEE Access, 108729–108741 (2019) 10. Shariq, I.A., Sher, M.D., Zenun, K., Rakhi, B.: Cross-cultural polarity and emotion detection using sentiment analysis and deep learning on COVID-19 related tweets. IEEE Access 181074–181090 (2020) https://doi.org/10.1109/ACCESS. 2020.3027350 11. Rajesh, P.K., Prasad, A.V.K.: Informational flow on Twitter - Corona Virus outbreak - topic modelling approach. Int. J. Adv. Res. Eng. Technol. (IJARET), 128–134 (2020) 12. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Association for Computing Machinery, Seattle, pp. 168–177 (2014). https://doi.org/10.1145/ 1014052.1014073 13. Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5, 1–167 (2012) 14. Andreas, M.K., Michael, H.: Higher education and the digital revolution: about MOOCs, SPOCs, social media, and the Cookie Monster. Bus. Horizons 59, 441– 450 (2020) 15. Zenun, K., Arianit, K., Johan, H.: The effect of a flipped classroom in a SPOC: students’ perceptions and attitudes. In: Proceedings of the 2019 11th International Conference on Education Technology and Computers (ICETC 2019), pp. 246–249 (2019). https://doi.org/10.1145/3369255.3369304
22
M. Edalati et al.
16. Nikola, N., Olivera, G., Aleksandar, K.: Aspect-based sentiment analysis of reviews in the domain of higher education. Electron. Libr., 44–46 (2020). https://doi.org/ 10.1108/EL-06-2019-0140 17. Sarang, S., Sher, D.: Aspects based opinion mining for teacher and course evaluation. Sukkur IBA J. Comput. Math. Sci., 34–43 (2019) https://doi.org/10.30537/ sjcms.v3i1.375 18. Barr´ on, E.M.L., Zatarain, C., Oramas, B.R., Mario, G.: Opinion mining and emotion recognition applied to learning environments. Expert Syst. Appl. (2020). https://doi.org/10.1016/j.eswa.2020.113265 19. Sangeetha, K., Prabha, D.: Sentiment analysis of student feedback using multihead attention fusion model of word and context embedding for LSTM. J. Ambient Intell. Human. Comput. 12(3), 4117–4126 (2020). https://doi.org/10.1007/s12652020-01791-9 20. Onan, A.: Mining opinions from instructor evaluation reviews: a deep learning approach. Comput. Appl. Eng. Educ. 28, 117–138 (2019). https://doi.org/10.1002/ cae.22179
Towards an Automated Language Acquisition System for Grounded Agency James R. Kubricht1 , Sharon Small2(B) , Ting Liu2 , and Peter H. Tu1 1
General Electric Research, Niskayuna, NY 12309, USA 2 Siena College, Loudonville, NY 12221, USA [email protected]
Abstract. The Automated Language Acquisition System focuses on the fundamental question of grounding; that is, how an agent acquires and represents the meaning of concepts. We take the view that prior to acquiring natural language, an agent has experienced the world in a largely private manner. The agent has visual experiences from which it has learned that objects and object categories exist, they persist over time and possess attributes. The agent also understands that events take place in a physical world and that agents exist and have purpose. We show that Emergent Languages, which can be constructed in an unsupervised manner, can be thought of as a private language. A natural language such as English can then be viewed as a repository of concepts. By applying unsupervised clustering to images described via an Emergent Language, we show that an agent can then analyze such clusters and map them to their associated natural language concepts with the aid of a Natural Language expert. Another form of experience is exposure to spoken speech, the idea being that syntax and statistical frequencies can be observed. This allows for a form of inductive learning where an agent can expand its conceptual knowledge. Having discovered the existence of novel concepts and forming mappings to associated Emergent Language description through machine translation, relationships between concepts are then constructed. The hypotheses being that the meaning of a concept emerges from its connections with other concepts. Relationships that are investigated include: physical and causal, metaphorical and pragmatic. Keywords: Grounding
1
· Emergent languages · Natural languages
Introduction
The Automated Language Acquisition System (ALAS) focuses on the grounding problem in the context of Natural Language (NL) acquisition based primarily on visual stimuli. We view a grounded agent as having acquired the meaning of concepts that manifest in the physical world and are maintained in natural language. We argue that the grounding of a symbol is tantamount to the capacity for accurately detecting instances of symbols when presented with physical c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 23–43, 2022. https://doi.org/10.1007/978-3-030-82199-9_3
24
J. R. Kubricht et al.
stimuli (i.e., a form of indexing). We argue that in addition to indexing capabilities, the meaning of a symbol emerges from connections with other symbols which may be statistical, physical, metaphorical, pragmatic or driven by utility. The current work attempts to instantiate this perspective in the context of related research. Core to our approach is the construction of language-like representations of visual stimuli which is achieved using referential game training methods employed in recent work on emergent languages [1–7]. We also take as motivation prior work which has explored the symbol grounding problem to support robot-human dialogue [8]. In order to understand novel commands to the robot, word meanings were learned from corpora using a hierarchical model of natural language phrases and probabilistic inference. However, different from supervised [9] and unsupervised machine learning mechanisms [10], children are able to learn language through self-training from a small set of learned knowledge. Therefore, we expand a previous data-driven approach [11] to simulate a learning process that can expand knowledge from a small amount of training data by building an induction algorithm on top of an unsupervised grammar induction approach [12]. With this motivation in mind, we explore the means by which grounded knowledge can be acquired in an automatic fashion. Specifically, we consider i) the possibility of unsupervised experience of physical stimuli as a form of priming, ii) subsequent interaction with a natural language as a form of concept discovery and iii) tasks associated with problem solving as the means by which meaningful connections between concepts are established. In terms of priming, a child will have many years of unsupervised exposure to visual stimuli. Knowledge derived from such exposure may be encapsulated in private language. Then, once exposed to a natural language, the child must reconcile its private language with formal words. Interactions with experts–as well as general exposure to language’s syntax–allows for discovery and induction processes followed by knowledge of affordances and causal relationships. The technical architecture of the ALAS system is composed of a variety of computational modules: i) Emergent Languages (EL) provide an unsupervised architecture for the construction of a symbolic vocabulary for describing images, videos and other forms of physical stimuli; ii) Machine Translation (MT) allows for direct mapping of EL codes to NL symbols given a modest amount of annotation in terms of natural language (NL); iii) an Agent Knowledge Base (AKB) clusters imagery based on unsupervised EL expressions, allowing agents to engage in dialogue with experts to discover novel concepts and directly annotate unseen imagery; iv) a Phrase Dependency Structure (PDS) emulates the child’s capacity for acquiring syntactic and statistical properties of NL, which is achieved by applying unsupervised methods for NL tuple parsing and induction; and v) an Artificial Intentionality Engine (AIE) uses learned NL concepts (i.e., entities, attributes, relations and actions) to describe the state of the world in an evolving relational graph, thereby allowing for planning and policy generation as a means for representing complex behavior. From a processing point of view, the ALAS system can be characterized as follows: i) given a corpus of imagery, the EL module produces a private descriptive language; ii) given a modest amount of NL annotation, MT is used to
Automated Language Acquisition System
25
provide a mapping between the EL codes and a small set of NL concepts; iii) the AKB then performs unsupervised clustering on the EL, and by considering EL cluster trees and initial annotations, the AKB supports expert dialogues to discover novel NL concepts; iv) the PDS then uses induction to expand the set of discovered concepts; v) internet scraping is then used to collect samples of new concepts which are processed by the EL and MT modules; and vi) The NL concepts derived by the AKB and PDS modules are used to represent the state of the world in terms of relational graphs which are used by the AIE to generate plans and policies which can be performed, recognized and described. In summary, the ALAS system supports a form of curriculum-based learning, where each lesson builds on previously learned representations.
2 2.1
Lessons Lesson I: Entities
Background. To effectively represent the world, an agent must possess some sort of class hierarchy or ontology. While class instances will in general exhibit variation, class boundaries as well as ontological structure itself may vary between cultures and individuals. Faced with such ambiguity, we consider the process by which a child may form its own object class ontology through a private language. We therefore explore the utility of Emergent Languages, which are based on Ludwig Wittgenstein’s theory of referential games. This unsupervised learning approach allows for the construction of a coding system that can be used to describe visual stimuli in a referential game. It is argued that the structure of this private language encapsulates an implicit object ontology and other linguistic elements. Having constructed an initial EL coding system, the first step towards grounding NL object or entity symbols is through translation. Like a Rosetta stone, a modest amount of annotation of stimuli samples in terms of NL object classes allows for the application of modern Machine Translation methods resulting in a mapping of EL codes to NL symbols. Unsupervised clustering methods are applied directly to EL codes, resulting in AKB cluster trees. By considering how image samples and their associated NL annotations are distributed amongst the EL cluster trees, the ALAS agent can begin to compare and contrast EL/NL ontologies. In addition to the use of MT methods to convert EL codes to NL symbols, a kind of nearest neighbor classification system using the EL cluster trees can be used to predict the NL description of novel imagery that has been described using the EL coding system. Emergent Languages. Unlike traditional supervised language representation methods, the Emergent Language (EL) approach [3,5] does not require explicit language descriptions when training. Instead, it has been shown that structured compositional languages naturally emerge when artificial agents learn to communicate in image-based referential game environments from an initial blank
26
J. R. Kubricht et al.
Fig. 1. (Left) sender and receiver LSTM architecture (see Havrylov & Titov, 2017). CNN embeddings are passed to a sender LSTM followed be categorical sampling and a receiver LSTM. Reconstructed images are compared to originals; error propagates through network. (Right) Comparison of emergent language symbols used to describe categories of images in MSCOCO dataset. Higher dissimilarity corresponds with brighter cells. Categories which are intuitively similar (or dissimilar) are darker (or brighter).
slate. The significance of the agent’s learned communication protocol is that semantic interpretability and explainability are associated with patterns in raw data via grammatical representation of the symbolic language. This is achieved through an EL architecture built on paired sender-receiver LSTMs (see Fig. 1, left panel). The architecture takes as input embeddings from the top layer of a convolutional neural network and passes them through a sender LSTM which uses categorical sampling to generate tokens at each step. These tokens are then passed to a receiver LSTM which transforms tokens back to an embedding. This embedding is then compared with the true embedding (along with a large set of distractors) until an agreed-upon language is developed. Initially, we trained an EL to encode images from the MSCOCO dataset on sentences consisting of ten tokens/integers (1,000 available tokens). We then compared sentences used to describe each of the super-categories in the dataset. We found that sentences used to describe scenes that were semantically similar (e.g., vehicle and outdoor) were more similar than sentences used to describe disparate scenes (e.g., animal and electronic; see Fig. 1, right panel). Following this encouraging result, we extended the EL methodology to an initial dataset consisting of 50 images of 14 different entity types (see Fig. 2, left panel). Given EL expressions (i.e., sentences with ten tokens/integers) for each of the shown entities, a benchmark test was conducted to determine how well EL expressions captured semantic information. This was achieved using a random forest classifier (RFC) model (see Fig. 2, right panel). The RFC achieved an overall test accuracy of 83%. Given evidence that EL symbols effectively carry information regarding entity class, we further developed an Agent Knowledge Base (AKB) to connect the private language and natural language.
Automated Language Acquisition System
27
Fig. 2. Initial fourteen entities described using EL system in GAILA phase I. Approximately 50 images per entity were used in a self-supervised training paradigm.
Agent Knowledge Base. The Agent Knowledge Base (AKB) is constructed by first clustering EL codes and then generating a probabilistic decision tree. The clustering module processes the EL codes from the EL Module. This module uses a standard K-Means clustering approach which is known for its ability to perform well on smaller data sets. The first two EL codes in generated sequences were utilized as a 2-g while the next 3–5 codes were treated as a Bag of Words (BOW). This clustering technique and corresponding restrictions on 2-g and BOW were selected after baseline experiments utilizing various clustering techniques while also varying N-grams and BOW windows. We found this configuration gave the best performance of clustering on related entities. Using this method, we first experimented with a 14 Entity 50 Images per entity collection. The resulting clusters (see Fig. 3) demonstrate that EL codes are indeed sensitive to relationships among entities (i.e., Cluster 1: Transportation – Horse, Motorcycle, Car; Cluster 2: Eating Utensils – Cup, Cylinder, Plate, Spoon, etc.). Experiments were then conducted to evaluate how well this process scales when an EL is learned on one set of images and then applied to generate EL code sequences for a new set of unseen images. We chose seven of the existing entities and collected an additional new 50 images for each category. Figure 4 shows the set of Clusters generated from an Emergent Language learned on a 7 Entity 50 Images collection from the initial dataset and applied to a newly collected set of the same entities. The AKB was then extended to further capture knowledge learned from the EL and clustering steps. We generated an AKB tree for each cluster using the contained EL codes and the corresponding images. Representative patterns were saved as a Probabilistic Decision Tree (see Fig. 5) with wildcards being used when the corresponding codes do not distinguish well. The probabilities were subsequently generated via a count of the different number of entities existing for that code sequence in the training data. The NL words and phrases on the leaf nodes are simply the labels of the images.
28
J. R. Kubricht et al.
Fig. 3. Initial fourteen entities described using EL system. Approximately 50 images per entity were used in a self-supervised training paradigm.
Clusters and trees were evaluated by dividing the 14 Entity 50 Images collection into a 80% training and 20% testing split. We generated clusters and built AKB trees on the 80% training set and evaluated on the 20% testing set to determine whether entity predictions from EL codes were accurate based on the AKB. After this, for each run we first compute the centroid coordinates for each cluster and place them into a multi-dimensional space. Next, the feature set of each test image is converted into a series of vectors that are placed in the same multi-dimensional space. The feature set includes all of the one-grams that were in positions 3–5 and all of the two grams that were in position 1 and 2. The distances between the new image coordinates and the centroids are then calculated using the Euclidean Distance formula, L2 Normalization. The AKB tree for the best cluster match(es) was then searched to find the best NL Description. When matching against the probabilistic AKB, we used a threshold of 70% as our confidence metric. We matched to entities correctly 62% of the time, with only 14% incorrect and 24% ambiguous for the 14 Entity 50 Images collection. 2.2
Lesson II: Entities and Attributes
Background. Next, we consider attributes that can be associated with object/entity classes by considering four key processes: i) description, ii) definition, iii) discrimination and iv) detection. In certain cases, a sufficiently salient list of attributes could be used to define an object class, discriminate between classes and perform detection. One shot learning methods, e.g., [13,14], have explored this paradigm in earlier research. This work proposes that a set of attributes or affect classifiers can be learned over a large set of object classes, e.g., an image of a duck could exhibit “featheriness” or “floatyness”. These traits
Automated Language Acquisition System
29
Fig. 4. Related entities discovered when clustering EL codes of our 7 entity 50 images per entity collection.
Fig. 5. Segment of a cluster’s probabilistic agent knowledge base (AKB) tree built from the 14 entity 50 images collection.
can then be used to classify a novel image, although this approach is more suited for some entities over others thereby limiting its viability as a generalized method. Instead, we argue for an Aristotelian perspective whereby entities are defined through two key attributes, i.e., those required for within-class membership and those used to distinguish between subclasses. In addition to Aristotle’s key attributes, we explore the idea of shared attributes between classes which allow for various forms of metaphor and analogy. Having constructed the EL cluster trees in Lesson I, an Agent may now analyze their structure. For example, if the majority of wooden chairs are in one branch of a cluster tree and most of the soft chairs are found in another, given such observations the agent can enter into a dialogue with an expert and learn that there is an attribute known as “cushioning”. The agent can then infer
30
J. R. Kubricht et al.
Fig. 6. (Left) ALAS dialogue interface with (middle) actual expert interaction and (right) knowledge learned.
that cushioning may be a common or shared attribute between certain chairs and certain sofas. Once the concept of a wooden chair has been found, various forms of PDS induction allow for the hypothesis of novel concepts, e.g., rocking chairs and wooden tables. These forms of discovery and induction are formally explored later in this lesson. Dialogue with Natural Language Experts. Our Expert module simulates a parent/teacher interaction by allowing the Agent to ask questions to learn more about the concepts uncovered in the AKB. Operating from our discovery that entities in the same AKB trees possess similar attributes, we have implemented dialogue to learn about these attributes. We began by exploring common parts and their uses for entities in the same AKB trees, in addition to their common attributes. In Lesson II, we experimented with the 14 Entity 50 images collection. From the above interaction example (see Fig. 6), we learn that people, chairs, tables and horses all have a common attribute: legs. We also learn the concept of support and movement and that the part required for both people and horses to be able to select the option of movement is legs. We also learn that while the class of objects that use legs for support includes tables and chairs, these do not belong to the movement class. Therefore, there is no option learned for tables and chairs to move using their legs as represented in our learned Knowledge Base (see Fig. 6, right). As knowledge is acquired during this process, information is saved in a Propositional Logic format in a MySQL database. Phrase Dependency Structure. The 2008 LENA (Language ENvironment Analysis) Study showed that, “the number of words spoken by ‘talkative parents’ to their children averaged 20,000 words per day”, which include rich explanations from different aspects about the concept(s) that parents teach to their children. The children may not be able to understand everything their parents say to them, but there are at least two benefits: i) children can build their own parser intuitively from their parents’ repetitive explanation of new entities/concepts and ii) they can boost their knowledge through induction. Inspired by how children absorb knowledge from their environment, we aim to teach an agent to learn automatically from a small set of knowledge from the
Automated Language Acquisition System
31
Fig. 7. A tree example where the parser did not perform well.
AKB by using unannotated data. To do so, we propose the Phrase Dependency Structure (PDS) for learning patterns regarding how attributes and actions are utilized in language around specific entities. New attributes, actions, and entities are then discovered by applying learned patterns to the original training data. Furthermore, the agent will explore in unseen data that contains the new entities, attributes, and actions for more expansion. In this way, the natural language is seen as a repository of concepts that can be mined by the agent. Through this semi-supervised learning process, the agent can acquire its conceptual knowledge while demanding only minimal assistance from human experts. To represent the most common usage of language, we use the Corpus of Contemporary American English (COCA)–the “only large, genre-balanced corpus of American English” available online–as a primary data resource and focus on the 4-g (13.5 GB) extracted from the corpus, which can be treated as either phrases or short sentences, to meet the agent level. For those containing the entities that the agent has learned, we run an unsupervised parser [12] to generate syntactic trees (see Fig. 7 for an example). In a tree, we call the word syntactically closest to the entity sibling, then cousin, and second cousin. For example, given the tree in Fig. 7, ball’s sibling is “the”, cousin is “reaches”, and second cousin is “if”. Then we define a distribution structure, Left/Right (LR) – Sibling/Cousin/2ndCousin (SCS), to check whether an entity’s attributes have different distributions compared with non-attribute words (see Table 1). Table 2 shows an example of the distances (Euclidean) among all attributes learned from entity “ball” and some words occurred frequently around ball in 4-g. We can see that the distribution is able to distinguish the attributes from non-attributes and those unknown attributes (crystal and white) have a similar distribution to the learned ones. Therefore, LR-SCS can be a good indicator for the agent to find more attributes. We utilize this indicator in two ways for new attribute extraction: Table 1. The distribution of both ball learned attributes and non-attributes. Attributes L-2nd-Cousin L-Cousin L-Sibling R-Sibling R-Cousin R-2nd-Cousin Red
4.56%
15.01%
69.71%
0.54%
6.17%
4.02%
Round
6.52%
11.41%
66.85%
1.63%
5.98%
7.61%
Soccer
1.20%
0.24%
97.48%
0.00%
0.36%
0.72%
and
0.40%
10.34%
14.08%
0.26%
75.03%
0.02%
of
0.08%
23.05%
11.95%
0.24%
64.67%
0.00%
32
J. R. Kubricht et al.
Table 2. The distances between words using their LR-CSC distribution for the entity “ball”. Red Round Soccer Orange Yellow Leather Medicine Avg. Red
0.00 0.06
0.32
0.08
0.07
0.07
0.34
0.16
Red
0.00 0.06
0.32
0.08
0.07
0.07
0.34
0.16
Round
0.06 0.00
0.34
0.09
0.12
0.13
0.36
0.18
Soccer
0.32 0.34
0.00
0.31
0.26
0.32
0.02
0.26
Orange
0.08 0.09
0.31
0.00
0.09
0.14
0.33
0.17
Yellow
0.07 0.12
0.26
0.09
0.00
0.08
0.27
0.15
Leather
0.07 0.13
0.32
0.14
0.08
0.00
0.33
0.18
Medicine 0.34 0.36
0.02
0.33
0.27
0.33
0.00
0.27
and
1.02 1.00
1.20
1.03
1.06
1.05
1.21
1.08
White
0.16 0.18
0.19
0.14
0.10
0.17
0.20
0.16
of
0.87 0.85
1.10
0.89
0.92
0.89
1.11
0.95
Crystal
0.35 0.37
0.03
0.34
0.29
0.34
0.01
0.25
Player
1.13 1.11
1.32
1.14
1.17
1.16
1.33
1.19
Punch
0.78 0.79
1.08
0.83
0.84
0.78
1.09
0.88
– Generalization - Apply learned attributes from one entity to other entities and seek support from PDS for validation. As long as the words occur frequently around other entities, they can be their attributes as well. – Expansion - For unknown words that occur frequently, we compute the distances between them and the known attributes using LR-SCS. To decide how similar the distribution of an unknown word is compared with the distribution of attributes, we use the longest average distance of the attribute words as threshold. The generalization process finds 616 new attributes with 83% accuracy and the expansion extracts 973 new attributes with 76% accuracy. One major error type of both processes is from parsing error, e.g., “fatal” is marked as sibling of “car” in “the fatal car accident” , which makes fatal’s LR-SCS wrong. We also experiment with entity expansion by looking for new entities around the 4-g containing 14 new attributes using LR-SCS distribution. A total of 932 new entities were detected with 77% accuracy. We believe that this self-learning accuracy will increase significantly with expert assistance and additional accumulated knowledge. One type of dialogue teaches the agent new categories with examples, e.g., red, blue and green as color. However, LR-SCS will not be applicable to such kinds of expansion, since the example words (instances) can occur anywhere around the category, e.g., “color red”, “the beautiful blue color”, “its color is green”. Therefore, we employ another bootstrapping approach (TL citation) to create patterns by normalizing the words in the path between an instance and the category. For instance, we can create a pattern, “## is * color”, from a
Automated Language Acquisition System
33
list of 4-g, “red is my color”, “red is her color”, “red is the color”, and “red is his color”. Then we calculate two indicators, precision (P) and recall (R), of each pattern and compute its score using the following F-measure formula: F = 2 × (P × R)/(P + R). To avoid introducing too much noise, we only keep the patterns with F scores above 80%. When new instances of the category are found through the patterns, we generate more patterns from the new instances. This bootstrapping process continues until either no new instances are extracted, or the loop reaches the 5th iteration. Given the four categories with three or four examples per category introduced by the expert, the agent can find many more instances: e.g., “attire” category, 495 new findings 64% accuracy; “color” category, 297 new findings, 86% accuracy; “clothing” category, 288 new findings, 66% accuracy; and “material” category, 76 new findings, 70% accuracy. One reason that the range of performances is quite broad is the complexity of the language associated with each category. 2.3
Lesson III: Composed Imagery, Attributes and Relations
Background. In previous lessons we consider stimuli in terms of object centric imagery. We now consider the topic of composed imagery with multiple objects, each with differing attributes. In addition to objects and attributes, we also encounter spatial relationships such as near, above and below. We explore the possibility of using Emergent Languages to capture such concepts. In addition to an NL description of the scene, such knowledge can also be represented by logical statements composed of predicates (near(object 1, object 2)) and propositions (near(my house, my car)). A set of propositions can be represented as a relational or scene graph. Relational graphs can then be viewed as a description of state which will be used in subsequent lessons. Composed Imagery and Machine Translation. A primary aim in Lesson III was constructing EL descriptions that can effectively encode semantic attributes; specifically, we investigated color and relative position. To determine whether an EL could capture the color of an object, we took the 7 Entity 100 Image dataset and colored the foreground object either green or blue. To determine whether an EL could capture relative position, we took the same dataset and generated composed imagery by randomly sampling entity pairs and placing them side by side or above one another either nearby or far away. Size was also manipulated between small and large. The EL model was then trained such that one item from each combination was sampled at each training step. Although the EL model converged with high accuracy (i.e., the sender and receiver could play their referential game without error), we could not effectively classify color or relative position using the standard RFC technique. The RFC accuracy for color classification was approximately 60%, and the RFC accuracy for relative position classification was approximately 63%. This performance was much worse than expected since chance accuracy is 50%. It therefore appears that
34
J. R. Kubricht et al.
Fig. 8. Indicates (Left) method for NMT training from EL expressions and (right) translation accuracy with respect to semantic attributes of interest.
the EL system takes advantage of some sort of emergent grammar, which was not captured in the feature vector used in the RFC. To investigate this hypothesis, we trained a neural machine translator [15] to transform EL descriptions to natural language. This was achieved by first preprocessing the symbols by transforming sequences into an embedding, then plugging that embedding into the translator LSTM which outputs tokens associated with each word, e.g., “ball”, “chair”, “red”, “above”. The translator was able to predict the correct natural language description from EL codes (object type and color/position) with a test accuracy of 85% and 80% (98% and 96% training accuracy), respectively (see Fig. 8). This result indicates that the private language developed by the EL is effectively capturing attribute information, even though it is not evident in traditional classification and clustering method that do not account for sentence structure. 2.4
Lesson IV: Entity Modes, Affordances and Actions
Background. In Lesson IV, we consider the topic of modes, affordances and actions. Each object class may exist in various modes, e.g., a tree can be transformed into logs and then fire. In terms of affordances, an object can facilitate certain state transitions, e.g., a vehicle can move from location A to location B. If an agent is present, then certain actions can be performed. One representation of an action is in terms of a list of requirements and a list of results. A requirement is a logical statement (propositions or predicates) associated with the current state of the environment that must be true in order for the agent to execute the action. For example, an agent must be near a car in order for the agent to drive the car. Results are logical statements about the state of the environment that will be true once the action has been executed. The means for discovering such concepts is described in the following section; this process then sets the stage for the final lesson in the curriculum: behavior.
Automated Language Acquisition System
35
Fig. 9. (Left) Unification categories are shown with examples and the first three symbols of their emergent language descriptions; natural language descriptions from neural machine translation are also provided (green is correct and red is incorrect). (Right) Confusion matrix using RFC model is provided. Labels are hidden since there were too many to plot.
Discovery. Two new corpora were utilized in Lesson IV. We created our Unification Corpus which consisted of 30 entities in various modes, i.e. Chair Empty, Chair Full; Water Frozen, Water Vapor, Water Solid; Oven Hot; Oven Cold, etc. We had a total of 62 different Entity Modes and we collected 10 images for each state for a total of 620 images. We also built a new Event collection consisting of seven entities from our original corpus: Standing Person, Standing Horse, Cup, Cylinder, Plate, Spoon, Table and four events: Kick, Running Horse, Running Person and Throw. We collected 50 images of each type for the four new events. Following data collection, we constructed an emergent language to describe each image using ten-symbol sentences. An example of images from each category along with their emergent language expressions are provided in the left panel of Fig. 9. Once EL descriptions were generated, we assessed whether they could be associated back to natural language using an RFC and NMT model. Consistent with previously observed trends, RFC performance was substantially lower than NMT performance. In this case, the NMT method achieved a translation accuracy of approximately 79%. We further developed an additional type of dialogue discovery: Relationships. We discuss here our results using our Unification Corpus during experiments with this new dialogue. Learning about relationships between our entities in various modes inherently allowed us to learn about entity affordances. As before, the first step was to cluster EL codes generated for this collection. In order to make subsequent results more interpretable, we divided this much more diverse collection into related subsets. The cluster results for the Nature subset (Agent Sitting, Agent Sleeping, Agent Standing,
36
J. R. Kubricht et al.
Fig. 10. Clustering results for the unified corpus nature collection.
Agent Walking, Boat NoSail, Boat Sail, Plant, Rock, Tree Fire, Tree Log, Tree Natural, Water Frozen, Water Liquid, Water Vapor) is shown in Fig. 10. It is important to note that our clusters continue to support sensical dialogue with Experts. For example, Cluster 0 and Cluster 2 support dialogue to learn that the natural state of a Tree needs the liquid state of Water for growth. Cluster 7 supports dialogue to learn that a person can ride on a boat. Cluster 3 dialogue is shown in Fig. 11. It is important to note that the Expert did not distinguish between boat-sail and boat-nosail but he did provide the state of the water. Therefore, we can learn that the two states of the boat do not matter but the state of the water does, providing us with the extra affordances added to our MySQL Knowledge Base in a Propositional Logic (see Fig. 11, right panel). This knowledge could then be utilized as a Common Ground to understand an action that an Agent may take given that certain requirements are met; e.g., sitting Agents can ride on Boats. Furthermore, these boats can be used for movement provided that the requirement of the liquid mode of water is met. We are able to understand the requirements for the mode that water must be in for the proposition moveon to be positive because of the explicit propositions: not-move-on(boats-sail, water-frozen), not-move-on(boats-sail, water-vapor), etc. After a single interaction with an Expert on 75% of the Nature Collection, we tagged an unseen 25% with applicable Propositions from our Knowledge Database. Matching was completed using just the first 2-g of EL codes on our unseen images against the first 2-g saved for each proposition learned. We found that our propositions were correctly applied to the unseen images 86% of the time. With new learned actions through the interactions between the Agent and Experts, we then utilized the PDS expansion modules to discover more actions. We first utilized LR-SCS threshold to make sure the candidates have similar distributions around the entity as the sample actions do. This process aims to
Automated Language Acquisition System
37
Fig. 11. Depicts (Left) AKB used for dialogue to learn a boat can move on liquid water. (Right) knowledge captured from the (bottom) dialogue interaction.
separate the actions based on their relations, subject-verb or verb-object, with the entity in the text. For example, a horse can run and jog, and a person can also ride a horse; mixing actions together introduces a large quantity of noise during the expansion. From the previous analysis, it was determined that the position of an action is quite flexible (i.e., it can occur at sibling, cousin, or second cousin positions). Therefore, in addition to LR-SCS distribution, we built rules for action expansion using the same approach from the section of category completion. Overall, we found 1082 new action candidates during the expansion and 523 (48.3%) of them were correct. Out of the 1082, the highest number found was for the “horse” entity (127 new actions, 50.4% accuracy) and the lowest was for the “cube” entity (18 new actions, 83.3% accuracy). The performance of action expansion is the lowest when compared with entity expansion and attribute expansion. This is due to the fact that i) actions have much more flexible syntactic structure when compared to attributes and entities and ii) grammar induction near actions is more error prone. Therefore, improvement on both the parsing process and the new action discovery method is needed to achieve stronger performance. 2.5
Lesson V: Behavior
Background. Armed with a representation of state in terms of a relational graph and an associated action space, we can now consider the possibility of behavior recognition. A purely machine learning approach can be taken where sets of annotated behaviors (sequences of observed actions or changes of state) can be directly classified. RNNs and LSTMs have proven useful for such tasks; however, we consider a different approach that was developed in prior work [16]. Based on the idea of mirror neurons, it was argued that a biproduct of learning
38
J. R. Kubricht et al.
how to perform a behavior is the ability to recognize a behavior. In this previous work, a simulator with an articulated human-like agent is allowed to interact with various objects. The agent is given a set of tasks, and a form of reinforcement learning is used to discover policies that can accomplish these tasks. When exposed to real world imagery, an articulated model is fitted to an observed person and recognition of a sequence of actions is evaluated by considering the probability of whether or not such a sequence could have been produced by a given policy. Shifting away from the topic of articulated motion, we have developed the Artificial Intentionality Engine (AIE) environment which allows an agent to learn how to solve problems using elements of the relational graph and action space technologies developed during previous lessons. We argue that with increasingly complex state representations and action spaces, various System 1 mechanisms must be incorporated into the AIE formulation. System 1 Mechanics. Building on the state and action representations of the previous lessons, we have developed the Artificial Intentionality Engine (AIE). We have argued that the grounding of a concept may in some ways be established based on connections between concepts. Methods for constructing such connections can be achieved via analogy, metaphor and memory. Another approach is to consider the utility or pragmatism of conceptual knowledge. From this point of view, each concept can be thought of as being decorated by various actions or options that can be executed by one or more agents. However, if all concepts are active simultaneously, planning and inference can quickly become overwhelmed. To this end, Dan Kahneman [17] argues for a System 1 and System 2 architecture, where System 1 oversees activating the appropriate concepts based on observed context allowing for System 2 to perform planning and reasoning in an efficient manner. An approach to such a System 1 mechanism is the idea of cascading concept activation, where environmental cues can activate concepts which in turn may activate other concepts as well as analogies and memories. In this paradigm, both analogies and memories can themselves be sources of concept activation. We argue that the AIE can be viewed as a form of contextual reasoning as well as the Conceptual Common Ground (CCG) needed for multiple agents to solve problems in a cooperative framework. Figure 12 illustrates the means by which concepts can be used to activate other concepts, analogies and memories. As illustrated, analogies and memories can in turn activate other concepts. Once a desire emerges, actions or options that are associated with active concepts can be used to satisfy desires using standard planning methods. By sampling from the concept lists associated with observed objects and currently active concepts, new concepts can be activated. The top panel of Fig. 13 illustrates how objects in the physical world can activate concepts resulting in a form of internal experience. Objects are associated with various modes of existence; they have attributes as a function of mode, and they have physical states such as location, temperature and velocity. The bottom panel of Fig. 13 illustrates the means by which actions associated with activated concepts can be used to satisfy desires. In this example, an agent must utilize various available actions to satisfy a desire to drink.
Automated Language Acquisition System
39
Fig. 12. Visualization of the conceptual activation paradigm. Note that the P symbols represent possible actions or options.
We have developed the capacity to automatically incorporate a given object ontology into the AIE. We assume a list of objects and associated modes (e.g., tree: living, logs, fire). We also consider three basic types of actions: Travel, Transport and Transform. Various relations can be defined such as an agent is holding an object (holding(a,o)) or two objects are near each other (near(o1,o2)). An action is defined by a set of relations that must be true in order to perform an action as well as a set of results that will be true once the action has been performed. In addition, an action maintains a set of instructions that are performed upon execution of the action. Within the AIE environment, sample planning can be used to construct a sequence of actions that transforms an arbitrary initial state into a desired goal state. Given the ability to produce such plans, we then shift our attention to the concept of policy generation. For a given goal state, a policy π(x) is a distribution over the action space as a function of current state x, such that if the policy is followed, the agent will generally arrive at the desired goal state. In the AIE environment, multiple goal states were selected; for each goal state, Q-Learning is used to produce a policy. Sample sequences or trajectories for each behavior were produced by randomly selecting an initial state and then applying the policy. Taking a more direct approach, given access to the behavior generating policies themselves, we can reduce the recognition process to simply asking “what is the probability that the observed sequence could have been produced by following a given policy π(x)?” . Thus, given the ability to produce both plans and hence policies, the act of recognition and annotation becomes tractable. However, as the environment becomes increasingly complex, the curse of dimensionality quickly hinders the agent’s ability to construct plans and policies. As previously argued, the AIE must incorporate various System 1 technologies in order to address the issue of complexity. Attention. Given a large set of trajectories through the state space, we argue that such data can be used for the purpose of developing attention modules which can be leveraged to address issues associated with high dimensional state spaces. Attention modules have had a strong impact on the task of translation. Given
40
J. R. Kubricht et al.
Fig. 13. Illustrates (top) how objects in the environment can be used to activate concepts, which can then activate other concepts, memories and analogies. (Bottom) Demonstrates how actions associated with active concepts can be used to satisfy desires.
an input sentence in English (X), the goal of machine translation is to generate a corresponding sequence in French (Y ). Many original approaches involved an LSTM formulation, which can be viewed as a form of sequence completion. It was then shown that as Y was formed, an attention module could be used to emphasize certain elements of the input sequence and that this attention module could be learned while developing the general translation capability. We argue that while learning how to perform actions in order to achieve goals, transformations of state can be viewed as a form of translation. Thus, the task of predicting future states can be used to construct an attention model: given a goal and the current state, which state variables are the most important? Given such a distribution, planning processes such as sample planning can be biased towards actions that directly affect state variables that are the current focus of attention. Genetic Algorithms. In order to optimize the values (0 or 1) associated with a concept activation matrix (i.e., a matrix describing linkages between concepts), a genetic algorithm was developed. The concept activation matrix can then be considered in terms of genetic material or DNA. In order to assess the merit of a given trial that is associated with the concept activation matrix, ten tasks/desires were established. Each task requires that at some point in time, the appropriate concepts are active simultaneously. During a given lifetime, the number of accomplished tasks multiplied by 100 is the merit given to an individual trial.
Automated Language Acquisition System
41
Fig. 14. Results from Genetic Algorithms Experiment.
Fig. 15. Child as Programmer (CAP) Pipeline.
For the purpose of experimentation, a genetic algorithm consisting of 20 generations and 100 individuals was constructed. Merit-based reproduction consisted of genetic cross-over as well as random mutation. Figure 14 shows the median merit as well as the maximum merit achieved during each generation. Metaphors. In addition to activation based on genetic algorithms, we also consider the process of activation based on metaphors. In this section we focus on concepts associated with specific objects. To this end we consider the Child as Programmer paradigm. The Child as Programmer (CAP) module was inspired by the work of Michael Tomasello [18], who argued that the use of the human body for signaling and pantomime was a precursor for spoken language. To this end, we have constructed the means for producing a metaphor between an arbitrary object and the human body. The CAP approach assumes that a metaphor can be constructed by an expert using a set of image processing modules with the goal of segmenting an image into parts and labelling those parts with respect to a known parts library (see Fig. 15). By providing a set of image/program pairs, machine learning methods–similar to those used for image captioning–can then be used to automatically produce a metaphor generating program for an arbitrary image. Given the ability to produce metaphors between an object class and the human body, metaphors or analogies between object classes can be constructed using what can be thought of as a form of graph isomorphism between the part labels assigned to each object class.
42
3
J. R. Kubricht et al.
Conclusions
Across five lessons, we have utilized emergent language systems to solve some of the key problems underlying the symbol grounding problem. Specifically, we have explored languages associated with static images, image compositions, and images demonstrating affordances and actions. Across all of these experiments, we found superb performance in the EL pipeline, where encoding networks converged in referential game training and EL symbols correspond with semantic attributes in input data streams to a high degree. This is evidently due to information encoded in learned grammars. Taken together, our efforts provide a wealth of evidence that symbolic representations are effective in capturing image semantics and demonstrate compositional characteristics that are critical for current efforts in AI explainability. In our work we have consistently found that the EL codes cluster on related entities and attributes. Additionally, we have found that branches in our AKB trees shine a light on entirely new semantic concepts. The information captured by the EL codes and organized/visualized in AKBs allows us to engage in sensical dialogue with an Expert. This dialogue enables our discovery of new attributes, affordances, events and actions. Each step in the learning process has provided insights into the concept space of the EL. That is, we learn what EL code sequences are representative of different affordances, events, etc. Therefore, when novel images are observed, we can extend prior learnings encoded in the AKB to generate a conceptual understanding based on visual features. We have also demonstrated how an unsupervised machine learning technique can be used to generate syntactic parse trees based on an NL corpus. Our PDS module then captures the distributions of attributes/actions occurring around entities, in addition to contextual patterns. attributes/actions across entities and also discover new attributes, actions, and entities. In the first-round of PDS analysis, we found 2745 new attributes with 75.3% accuracy, 1637 new actions with 34.5% accuracy, and 932 new entities with 76.8% accuracy. Our efforts show how Experts can identify and address errors through a visualization tool, resulting in new training data that can be presented to the EL and MT modules. One key premise behind the ALAS system is that a biproduct of learning how to perform a behavior is the ability to recognize instances of that behavior. However, with increasingly complex domains, the curse of dimensionality quickly renders traditional planning and policy learning methods intractable. To this end, we have hypothesized that a form of cascading concept activation, which can be viewed as a System 1 capability, allows for necessary context-driven dimensionality reduction. Methods such as genetic algorithms, metaphor generation via the Child as Programmer paradigm and attention learning mechanisms support this vision.
Automated Language Acquisition System
43
References 1. Chowdhury, A., Kubricht, J.R., Sood, A., Tu, P., Santamaria-Pang, A.: Escell: emergent symbolic cellular language. In: IEEE ISBI, pp. 1604–1607 (2020) 2. Devaraj, C., Chowdhury, A., Jain, A., Kubricht, J.R., Tu, P., Santamaria-Pang, A.: From symbols to signals: symbolic variational autoencoders. In: IEEE ICASSP, pp. 3317–3321 (2020) 3. Havrylov, S., Titov, I.: Emergence of language with multi-agent games: learning to communicate with sequences of symbols. arXiv preprint arXiv:1705.11192 (2017) 4. Kubricht, J.R., Santamaria-Pang, A., Devaraj, C., Chowdhury, A., Tu, P.: Emergent languages from pretrained embeddings characterize latent concepts in dynamic imagery. Int. J. Seman. Comput. 14(03), 357–373 (2020) 5. Lazaridou, A., Peysakhovich, A., Baroni, M.: Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182 (2016) 6. Santamaria-Pang, A., Kubricht, J.R., Chowdhury, A., Bhushan, C., Tu, P.: Towards emergent language symbolic semantic segmentation and model interpretability. In: MICCAI, pp. 326–334 (2020) 7. Santamaria-Pang, A., Kubricht, J.R., Devaraj, C., Chowdhury, A., Tu, P.: Towards semantic action analysis via emergent language. In: IEEE AIVR (2019) 8. Tellex, S., et al.: Approaching the symbol grounding problem with probabilistic graphical models. AI Mag. 32(4), 64–76 (2011) 9. Liu, Y., Lapata, M.: Learning structured text representations. In: Proceedings of TAC (2017) 10. Kim, Y., Rush, A.M., Yu, L., Kuncoro, A., Dyer, C., Melis, G.: Unsupervised recurrent neural network grammars. In: Proceedings of NAACL (2019) 11. Liu, T., Strzalkowski, T.: Bootstrapping events and relations from text. In: Proceedings of EACL, pp. 296–305 (2012) 12. Jin, L., Song, L., Zhang, Y., Xu, K., Ma, W.Y., Yu, D.: Relation extraction exploiting full dependency forests. Proc. AAAI Conf. Artif. Intell. 34(05), 8034–8041 (2020) 13. Li, F.-F., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006) 14. Lake, B., Salakhutdinov, R., Gross, J., Tenenbaum, J.: One shot learning of simple visual concepts. In: Proceedings of the Annual Meeting of the Cognitive Science Society (2011) 15. Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: Opennmt: open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810 (2017) 16. Tu, P., Sebastian, T., Gao, D.: Action recognition from experience. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, pp. 124–129 (2012) 17. Kahneman, D.: Thinking, Fast and Slow. Macmillan, Basingstoke (2011) 18. Tomasello, M.: Origins of Human Communication. MIT press, Cambridge (2010)
Text-Based Speaker Identification for Video Game Dialogues Duˇsan Radisavljevi´c1(B) , Bojan Batalo2 , Rafal Rzepka1 , and Kenji Araki1 1
2
Faculty of Information Science and Technology, Hokkaido University, Sapporo, Japan Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba, Japan http://arakilab.media.eng.hokudai.ac.jp/
Abstract. Speaker identification presents a challenging task in the field of natural language processing and is believed to be an important step in building believable human-like systems. Although most of the existing work focuses on utilizing acoustic features for the task, in this paper we propose a text-dependent transformer-based machine learning approach. Our research shows promising results improving on the existing work that utilized textual features for the task, while simultaneously introducing a new dataset acquired from a commercial video game. To our knowledge, this is the first time text-dependent speaker identification was performed in the domain of video games. Keywords: Speaker identification
1
· Video games · Dialogue · BERT
Introduction
Speaker identification is the task of connecting an utterance to the speaker that produced it. It can be used in the development of dialogue systems with some authors citing it as a pivotal step in the development of believable human agents ([6], 2003). Given a set of possible speakers, the problem of speaker identification can be formulated as a simple classification problem: which utterance belongs to a certain class, or rather, to a certain speaker. Humans tend to perform speaker identification unconsciously, gathering audiovisual information, such as the direction from which the sound came or even vocal timbre, in order to connect utterances to a speaker. In modern days, when using textual communication through chat applications or text messages, information like emojis and stickers can sometimes help us identify the sender just from the content of the message. Additionally, characteristics such as style of speech or common phrases are also used to determine the speaker. It should be noted that when analyzing daily conversations, we do require a large amount of information to identify a certain speaker from a typical phrase, e.g. to establish their style of speech. However, storytelling mediums, like movies, TV shows, books or video games, especially those set in the realm of fantasy, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 44–54, 2022. https://doi.org/10.1007/978-3-030-82199-9_4
Text-Based Speaker Identification for Video Game Dialogues
45
often accentuate these idiosyncrasies in order to increase the immersion experience and make the story more believable. Here we present our work that consists of performing two different subtasks of speaker identification: the first being the classification of utterances to a single speaker, focusing on character-specific traits, and the second being the mapping of utterances to a class of speakers, in turn detecting class-specific traits. For this we used two datasets, one that is publicly available as part of the LIGHT research platform ([16], 2019), and another one gathered from a commercial video game Dragon Age: Origins using publicly available tools to acquire the said data. Due to the large amount of utterances that are available per speaker, we decided to use the dataset acquired from Dragon Age: Origins to identify a single speaker, and LIGHT dataset to identify the class of the speaker. Our approach showed an accuracy of 90.27% when identifying class of speaker and of 74.86% on a character identification level, outperforming methods used as a baseline. The contributions of our work are listed as follows: – We proposed a simple yet effective transformer-based approach for speaker identification that utilizes textual data only. – To our knowledge, our work is the first of its kind to examine the possibility of using video games as a source of data for the speaker identification task, further opening the possibility of using interactive storytelling mediums for the said task. – Although there has been some previous work in creating a video game related corpora for NLP tasks from commercial video games ([1], 2020), we created the first dataset centered around dialogues in this domain which can be found at the following GitHub link: https://github.com/dradisavljevic/ DAODataset. – The proposed method is transferable to transcripts of real life conversations. The paper is structured as follows. Section 2 presents previous work done in the area of text-based speaker identification. In Sect. 3 we give an overview of the datasets used for our evaluation. Section 4 presents the methods used in this paper while Sect. 5 describes the experiments and lists the results. Section 6 discusses both our findings and results listed in Sect. 5. Finally, Sect. 7 concludes this paper and provides our plans for future work.
2
Related Works
Most of the work in the area of speaker identification is centered around speech or signal processing ([14], 2019), however, the work of [10] (2007) has shown that textual features can also reflect the personality of the speaker, suggesting that a text-dependent approach can be beneficial to speaker identification. To the best of our knowledge, experiments that utilize textual features only for the task of identification are few in number ([12], 2015; [8], 2012; [9], 2017). [12] proposed using logistic regression and recurrent neural networks for the task of detecting change in speaker through turn taking in movie dialogues.
46
D. Radisavljevi´c et al.
The task itself is different from ours as it focuses on detecting turn changes rather than identifying actual speakers. Another work that focuses on movie dialogues is that of [8], which was used as a starting point for our experiments. They have suggested using various classification algorithms, such as Naive Bayes (NB), K-Nearest Neighbors (KNN) and Conditional Random Forest (CRF) on vectors formed out of certain stylistic features, such as the number of adverbs or adjectives per words in each utterance, in order to identify the speaker on a dataset acquired from the movie script database1 . The work of [9] used these results as a baseline and conducted additional experiments with Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN) on transcripts from the TV show Friends, reporting improved results with CNN. However, to the best of our knowledge, no research has been conducted on text-dependent speaker identification in the domain of video games.
3
Datasets
For the task of speaker identification, two video game-related datasets are used: 1. Dataset composed of textual dialogues extracted from the Dragon Age: Origins 2 video game developed by Bioware studios. 2. Dialogue data extracted from a large scale, crowdsourced, text adventure game dataset, developed as part of the LIGHT research platform which comes from the work of [16]. 3.1
Dragon Age: Origins Dialogue Dataset
Dragon Age: Origins is an action role-playing video game developed by Bioware studios in 2009. It features a large amount of dialogue between player and NPCs3 that have different personalities, thus making it suitable for the task of speaker identification. Text data was obtained using two resources: Dragon Age Wiki 4 a fan website that contains information related to the video game series and Dragon Age Toolset 5 , software released by developers of the game in order to allow modification of in-game content. Dragon Age Toolset comes with the Microsoft SQL Server relational database containing all the video game resources, including dialogue utterances in textual form. Utterances were extracted from the database, preprocessed and stored into a dialogue file. Each utterance has a speaker and a main listener (the persona to whom the utterance is spoken), as well as an identifier of the setting under which the utterance occurs (part of a quest, a random encounter, etc.). All the missing information was filled in manually by referencing the Dragon Age Wiki website. 1 2 3 4 5
http://www.imsdb.com/. https://www.ea.com/games/dragon-age/dragon-age-origins. Non Playable Character. https://dragonage.fandom.com/. https://dragonage.fandom.com/wiki/Toolset.
Text-Based Speaker Identification for Video Game Dialogues
47
We separated and labeled utterances that are spoken by the player and ten of their NPC companions. Utterances that belong to remaining characters in the game are all labeled as ‘Others’, as they are too few in number per character. Figure 1 gives a graphical representation of the dialogue distribution within the dataset.
Fig. 1. Number of utterances obtained from the dragon age: origins per character
3.2
LIGHT Research Platform Dataset
LIGHT is a fantasy text adventure game research platform designed for studying grounded dialogue. The dataset contains a large set of crowdsourced interactions (about 11,000 episodes) which are made up of actions and dialogue utterances. Data is publicly available through the ParlAI6 platform. For this work we only considered the episodes that are part of the original LIGHT dataset. All the utterances from the episodes are extracted, preprocessed and stored into a dialogue file. LIGHT dataset contains a large amount of characters (940 characters with utterances), and unlike the other dataset we utilized, lacks a clear main character as it does not follow a linear story. Additionally, most of the characters present in the dataset have a character class assigned to them, with about 20% of them belonging to an undefined class. For unlabeled characters, we have manually assigned them one of the three existing classes: object, creature or person, based on their persona description which is also given in the dataset. Due to the above mentioned reasons we turned to predicting character class from utterance rather than predicting a single character for this dataset. Figure 2 shows utterance distribution per character class in the LIGHT dataset. 6
http://parl.ai/projects/light.
48
D. Radisavljevi´c et al.
Fig. 2. Number of utterances obtained from the LIGHT project dataset per character class
4
Methods
We have evaluated four algorithms for the task of speaker identification. Methods proposed in the work of [9] and [8] were used as baseline for our approach that is based on Bidirectional Encoder Representations from Transformers (BERT) ([4], 2019). 4.1
K-Nearest Neighbors
In the work of [9] KNN was used as a baseline approach for speaker identification, while in the work of [8] it was the best performing algorithm for the same task. Although other approaches proposed by [9] outperform the KNN method, we have still decided to use it as a baseline due to its stability and simple implementation. The algorithm has been implemented using the Faiss7 library in Python programming language for speed and efficiency. The implementation followed the specifications given by [8], with same stylistic features being used in combination with cosine similarity as a distance function. 4.2
Convolutional Neural Network
[9] have reported satisfactory results using CNN model proposed by [7] (2014) with minor modifications. This approach is selected as another baseline for our experiments. The model consists of a single convolution layer, followed by a global max-pooling layer and a fully-connected layer with a softmax function 7
https://github.com/facebookresearch/faiss.
Text-Based Speaker Identification for Video Game Dialogues
49
used to normalize the output. For the sake of simplicity, we have decided not to enforce the Euclidean (L2 ) norm constraint as the work of [17] (2016) found that it has little effect on the final result. Unlike the previous work, we have decided to add an embedding layer instead of using the pretrained word embeddings, due to a large amount of colloquialisms and fantasy-related vocabulary present in the data, as well as the size of the data itself. This has made the training phase slower compared to previous work, as we train the embeddings as well. In order to avoid possible overfitting, we used early stopping ([11], 1998). 4.3
Convolutional Neural Network with Utterance Concatenation
The final baseline method is also based on the proposal of [9] and is an approach that involves using CNN while grouping the utterances of a single character within a single scene, in chronological order. [9] have reported that this approach achieves improved F1 and accuracy score on the speaker identification task. We have replicated the process by using a single episode from the LIGHT dataset as an equivalent to one TV series scene. For the dataset that comes from the Dragon Age: Origins video game, we considered utterances that have the same ‘speaker’ and ‘settingname’ attribute. Utterances that share the same value for the ‘settingname’ attribute belong to the same dialogue tree8 , which makes them an equivalent to a TV series scene. [9] have also improved the method by restricting the prediction search space to the set of speakers present in the scene. However, due to the majority of video game dialogues being limited to only 2 participants, we have decided not to follow this approach. 4.4
BERT
BERT is a transformer-based model that was published in 2018 and has been used ever since for a variety of NLP tasks, achieving considerably better results than other approaches, especially for classification problems ([13], 2019; [5], 2020). The work of [2] (2019) reported satisfactory results when it comes to multi-label text classification on documents, which inspired us to test it on a single utterance level. For both the speaker and speaker class identification, we used uncased BERT base model that utilizes Adam optimization algorithm with weight decay as an optimizer. The experiments were run on a single utterance level and with concatenated texts for both of the datasets through 5 epochs. The dropout was set to 0.1 and learning rate to 1e-5. Due to memory constraints, we used a batch size of 3.
5
Experiments
For our experiments, all the utterances extracted from both datasets are first cleaned and tokenized. With KNN-based and CNN-based methods we have used a 8
https://en.wikipedia.org/wiki/Dialogue tree.
50
D. Radisavljevi´c et al.
custom tokenizer that is a minor modification of that used in the work of [7], while for the BERT model we have used a BERT tokenizer. For the KNN method, 70% of the data was taken as a training set and 30% as an evaluation set. For other approaches, 70% of the data was used for training, while 15% was used for validation and the remaining 15% for evaluation set. Due to the nature of the data split, experiments were conducted over 10 trials, taking the mean value as the result. We have experimentally established that for KNN approach, results are best for k=13 on Dragon Age: Origins dataset, while for LIGHT dataset the results are best for k=15. Figures 1 and 2 show that there is a severe class imbalance present in both of the datasets. This led us to taking an oversampling approach ([3], 2002) for KNN and CNN based methods in order to prevent classes with a large amount of examples skewing the classifier output. However, since BERT has been shown to perform well even on an imbalanced dataset ([15], 2019), in this work we report the results it achieved without the use of any data augmentation techniques. For the training phase of the CNN we used a batch size of 100 and used early stopping to determine the correct number of epochs needed for the training process without overfitting. For the BERT approach, we used 5 epochs. Table 1 displays our results with the upper half showing results on the Dragon Age: Origins dataset for speaker identification task and the lower half reporting results on the LIGHT dataset for the speaker class identification task.
6
Discussion
In the work of [8] KNN was the best performing algorithm on a movie script dataset, with an accuracy of 30.39%, while other metrics were not reported. The experiments of [9] had the most success using CNN in combination with Table 1. Performance per model. MF1 stands for macro F1 score and WF1 for F1 score weighted by number of true labels per class. Model
Precision Recall Accuracy MF1
WF1
Dragon age: Origins dataset KNN CNN CNN-Concatenation BERT BERT-Concatenation
3.60 43.66 36.33 46.43 60.58
10.33 36.25 27.96 52.76 64.42
20.05 39.66 45.67 74.86 70.21
4.61 36.79 24.85 49.06 50.55
8.07 37.30 34.78 74.23 69.3
38.58 76.02 53.17 48.83 70.67
33.74 52.68 49.34 63.67 73.02
76.33 79.91 83.29 83.42 90.27
30.12 49.85 50.68 50.88 71.71
67.00 72.48 62.12 82.59 90.19
LIGHT dataset KNN CNN CNN-Concatenation BERT BERT-Concatenation
Text-Based Speaker Identification for Video Game Dialogues
51
utterance concatenation while also restricting the set of prediction labels to speakers present in the scene on the dialogues acquired from the transcripts of the TV show Friends. They reported an accuracy of 46.48% and an F1 score of 44.19%. Using KNN and CNN on the Dragon Age: Origins dataset proved to be less successful, with CNN yielding the best results out of all the baseline methods. It should however be noted that, surprisingly, concatenated utterances led to an increase in the number of false positives and false negatives (lower recall and precision values) rather than improving the basic CNN approach. After looking at the results, we believe that this is due to concatenated utterances being more difficult to discern one from another, as they share from a common pool of words and topics. When it comes to the experiments we conducted on the LIGHT dataset, the results achieved are better than those using the same methods on the Dragon Age: Origins dataset. This is to be expected due to the smaller number of labels present for this subtask, making it easier to connect an utterance to character class than to a single character. We would like to note that, even though using CNN with concatenated utterances led to higher accuracy, it also led to an increase in both false positives and false negatives (reduction in precision and recall). Although this was surprising, since we tried to balance out the classes for the experiment, it is our understanding that synthetic examples led to generalization which in turn resulted in a slightly worse than expected performance. According to Table 1, BERT outperforms baseline methods without any data augmentation approaches. However, BERT still produces a considerable amount of false positive and false negative predictions per class, as indicated by the low precision and recall. We believe that this can be improved by using data augmentation techniques in the future alongside the BERT model. Figure 3 is a confusion matrix for characters present in the Dragon Age: Origins dataset. We can see from the confusion matrix that despite them originating from the class with fewest utterances, predictions for the ‘Dog’ character label achieve an accuracy of almost 100%, due to these utterances being specific. We believe that this demonstrates good performance of the model on sets of utterances that are less generic in nature. It is interesting to note that characters that this model tends to confuse the most are those that express some shared personality traits. For example, both the ‘Leliana’ and ‘Alistair’ are stereotypical good characters that disapprove of ‘Player’ character acting in an evil way, and the model has proportionally shown the second highest confusion in prediction between the two of them. The highest confusion for the model is between the ‘Player’ and ‘Others’ character label. We believe that this is due to a wide variety of utterances that are available as dialogue options for the ‘Player’ character. Some of them are mutually exclusive, meaning that if a player was to select one utterance as a dialogue option, other utterances that appeared as options would not come up in the game. This gives the players possibility of reflecting different personality traits through dialogue. However, since these utterances were not separated in
52
D. Radisavljevi´c et al.
the dataset for the experiments performed in this work, the ‘Player’ character most likely expressed multiple personality traits, some even conflicting. This most likely led the model to confusing the ‘Player’ and ‘Others’ category, as the characters in the ‘Others’ category come from different backgrounds and themselves exhibit a wide variety of personality traits.
Fig. 3. Confusion matrix for dragon age: origins dataset on a single utterance level
7
Conclusion and Future Work
In this paper we have proposed a new, transformer-based, machine learning approach for the speaker identification task. Even though most of the works for the task deal with audio and signal processing, for our work we opted to use textual data only. In order to determine how well transformer-based approach compares to methods described in other studies that have utilized textual data, we have replicated experiments described in them and used them as a baseline for our method. Our experiments have confirmed the findings from previous works by other authors [9] that contextual information is important for speaker identification, even in the domain of video game dialogues. Comparing the performance of different approaches we have seen an increase in most of the metrics when using utterance concatenation in order to provide more context. The corpora proved to be a good starting point for the task. However, additional steps can be taken in the future to further improve the results. For example an increase in size should lead to improved results, but combining different
Text-Based Speaker Identification for Video Game Dialogues
53
commercial video games into a single corpora might lead to some level of confusion for the model. This is due to certain character tropes being present across multiple different works of fiction (stereotypical villain phrases for example). Additionally, character classes present in the LIGHT dialogue dataset could be more finely granulated. This might create difficulties for the model to predict the correct class, as they would increase in number, but it can be interesting to ascertain any similarities between characters from the same background (e.g. elven race, warrior background, etc.). Results from our experiments suggest that it is possible to determine which utterance belongs to a certain character with high probability in the domain of video game dialogues. Due to the entertaining nature of the domain, certain traits tend to be overly exaggerated which might assist the transformer-based model to predict correctly. Observing the BERT based method on transcripts of real life conversations could also prove to be interesting. Additionally, because the dialogue dataset used for our experiments comes from a commercial video game that does not rely on text as a sole medium of storytelling, we believe that this approach can also yield good results on games that are more text-driven in nature. Although the combination of audio and visual data might be a more natural source for speaker identification, we have shown that textual data can be used with some success. It is our hope that this will lead to increased usage of textual data for the same task, and inspire more research on datasets that come from interactive storytelling mediums.
References 1. Bergsma, T., van Stegeren, J., Theune, M.: Creating a sentiment lexicon with game-specific words for analyzing NPC dialogue in the elder scrolls V: Skyrim. In: Workshop on Games and Natural Language Processing, Marseille, France, pp. 1–9. European Language Resources Association (2020) 2. Chalkidis, I., Fergadiotis, E., Malakasiotis, P., Androutsopoulos, I.: Large-scale multi-label text classification on EU legislation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6314–6322. Association for Computational Linguistics (2019) 3. Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics (2019) 5. Gordeev, D., Lykova, O.: BERT of all trades, master of some. In: Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, Marseille, France, pp. 93–98. European Language Resources Association (ELRA) (2020) 6. Hazen, T., Jones, D., Park, A., Kukolich, L., Reynolds, D.: Integration of speaker recognition into conversational spoken dialogue systems (2003)
54
D. Radisavljevi´c et al.
7. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751. Association for Computational Linguistics (2014) 8. Kundu, A., Das, D., Bandyopadhyay, S.: Speaker identification from film dialogues, pp. 1–4 (2012) 9. Ma, K., Xiao, C., Choi, J.: Text-based speaker identification on multiparty dialogues using multi-document convolutional neural networks, pp. 49–55 (2017) 10. Mairesse, F., Walker, M., Mehl, M., Moore, R.: Using linguistic cues for the automatic recognition of personality in conversation and text. J. Artif. Intell. Res. (JAIR) 30, 457–500 (2007) 11. Prechelt, L.: Early stopping - but when? In: Orr, G.B., M¨ uller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 1524, pp. 55–69. Springer, Heidelberg (1998). https://doi.org/10.1007/3-540-49430-8 3 12. Serban, I.V., Pineau, J.: Text-based speaker identification for multi-participant open-domain dialogue systems (2015) 13. Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7463–7472 (2019) 14. Sztah´ o, D., Szasz´ ak, G., Beke, A.: Deep Learning Methods in Speaker Recognition: a Review (2019) 15. Madabushi, H.T., Kochkina, E., Castelle, M.: Cost-sensitive BERT for generalisable sentence classification on imbalanced data. In: Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda, Hong Kong, China, pp. 125–134. Association for Computational Linguistics (2019) 16. Urbanek, J., et al.: Learning to speak and act in a fantasy text adventure game. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 673–683. Association for Computational Linguistics (2019) 17. Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan, vol. 1: Long Papers, pp. 253–263. Asian Federation of Natural Language Processing (2017)
Automatic Monitoring and Analysis of Brands Using Data Extracted from Twitter in Romanian Lucian Istrati1,2(B) and Alexandra Ciobotaru3 1
2
Cognience SRL, Bucharest, Romania [email protected] Faculty of Mathematics and Computer Science, University of Bucharest, Bucharest, Romania 3 Advanced Technologies Institute, Bucharest, Romania [email protected] Abstract. We present a novel framework for brand monitoring and analysis, based on the available data of Twitter in Romanian. The framework uses a sentiment analysis text classifier for distinguishing Twitter posts between positive or negative, which was trained and tested using a novel dataset of tweets in Romanian, labelled by the authors. We created and compared four adapted preprocessing pipelines, that generated four sets of data, on which we trained several machine learning models. Based on the evaluation metrics, results show that a neural network using fastText has the best F1-score and accuracy, thus this model was further used for our proposed framework for brand monitoring and analysis. Our application creates various reputation scores, based on which it generates three kind of reports: reputation report of a single brand, reputation report of an industry and comparative reputation report of two companies, in a desired time frame. Keywords: Microblogging · Artificial intelligence · Machine learning · Sentiment analysis · Brand monitoring · Twitter · Romanian · FastText
1
Introduction
Online reputation management deals with influencing and controlling the reputation of an organization or an individual, by actively monitoring brand mentions on websites and social media. Online reputation (e-reputation) is perhaps the most valuable asset for an organization, be it a business, a public institution or a non-profit organization, etc. A negative reputation can have undesirable effects, just as a positive reputation demonstrably increases corporate worth and provides sustained competitive advantage, more benefits being described in [34]. Twitter is currently an important social network worldwide, ranking 15th in the top of the most used social networks worldwide since July 2020, according to The original version of this chapter was revised: Affiliations of the authors “Alexandra Ciobotaru” and “Lucian Istrati” have been corrected. The correction to this chapter is available at https://doi.org/10.1007/978-3-030-82199-9 61 c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022, corrected publication 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 55–75, 2022. https://doi.org/10.1007/978-3- 030-82199-9 5
56
L. Istrati and A. Ciobotaru
the statistics portal Statista1 . In Romania, the use of the Twitter microblogging platform is not very popular; in July 2020, it registered approximately 77 thousand active users, compared to Facebook, which registered in the same month a number of 18.6 million active users2 . However, our interest was to conduct an online reputation study based on techniques of Natural Language Processing and Understanding (NLP & NLU), using machine-learning algorithms, which is why we chose Twitter platform in our analysis, considering the fact that Twitter users mainly use text to formulate their opinions. Currently, there is no open-source tool to analyze the e-reputation of a company based on on-line Romanian texts which mention that company. NLP resources for the Romanian language are rather scarce, hard to aggregate and use, which is why this paper aims to solve some of these needs, without the use of translation engines from a language rich in open-source NLP tools, such as English. The creation of a brand analysis and monitoring tool involved two stages: the first stage was to find the best sentiment analysis algorithm applied on the dataset created by us specifically for this task, and the second stage was to create an API (Application Programming Interface) that generates reputation scores for companies, for industries, and makes comparisons between two companies. There is not any labelled tweets dataset in Romanian available open source, so we created our own by manually labelling roughly 2000 Romanian tweets into positive and negative. We also created a hybrid preprocessing pipeline that uses custom lists to generate understandable content from emoticons and Romanian internet slang. The manually labelled corpora was passed through this pipeline and several machine learning classifiers were trained on the data. In order to validate the usefulness of the pipeline, we generated three additional datasets having various degrees of preprocessing. Experiments show that our proposed initial pipeline is valid, and the algorithm based on fastText word embeddings [19] proved to have best results, so this was chosen for the sentiment analysis that lays at the base of our brand monitoring and analysis tool. Further, we created a Flask API where we deployed our brand monitoring and analysis application. With such a brand-monitoring tool, which analyses texts in Romanian language, local companies, as well as international companies, can view their e-reputation score, understand what caused certain reactions among users, and take timely measures to adjust and enhance their e-reputation, when it is necessary.
2
Recent Works
The main tool used to analyse a company’s reputation is usually sentiment analysis. As described generally in [3,12] and in more detail in [24], there are several approaches that tackle the sentiment analysis problem, which mainly fall into three categories: lexical based approaches, machine learning based techniques 1 2
https://www.statista.com/statistics/272014/global-social-networks-ranked-bynumber-of-users/. https://gs.statcounter.com/social-media-stats/all/romania.
Automatic Monitoring and Analysis of Brands
57
and hybrid approaches [33]. Lexical based approaches use mainly dictionaries like SentiWordnet [5] or SO-CAL [32], and are rule based [35], while Machine Learning techniques use labelled or unlabelled data to train a classifier. When data used is unlabelled, the technique is unsupervised and the primary algorithm that could be used is K-means (see [26]), while with labelled data the process is called supervised and is generally more reliable, if the data is labelled responsibly. For English language, the state of the art results were achieved by Baziotis et al. in their paper [6], applying Deep Bi-LSTM augmented with two kinds of attention mechanisms, and Cliche Mathieu in his work [9], where he used an ensemble of LSTMs and CNNs. Both used the SemEval-2017 Twitter dataset, for the task 4, subtask A of the SemEval-2017 competition, which is described by Rosenthal et al. [28]. Besides the state-of-the-art results, other works include classical machine learning algorithms based on methods that work with decision trees, like Adaboost [15], Random Forest [4], or just with simple Decision Trees [31]. A Support Vector Machine based approach also distinguished in the work of Ahmad et al. in their paper [1], and a Naive Bayes one in [18]. Another important method that we experimented with and achieved good results was using a neural network with fastText embeddings [7], which was previously used in [2] to classify tweets into flu or non-flu related tweets. A comparative study between traditional machine learning and deep learning approaches for text classification is described by Kamath et al. in their work [20]. For Romanian language, the paper of Buzea et al. [8] describe a “three word level” approach which analyzes the polarity of a text considering three different subsystems, at sentence level, document level and aspect level. Other works included the ones of Marcu and Danubianu in their paper [25], where they describe an application of “Ekman” and “Plutchik” models for a dataset of educational system reviews given by Romanian highschool students, and also the work of B˘ adica et al. [11], where they describe an algorithm based on a dependency parsing approach for sentiment analysis. Regarding brand analysis, we found in literature a tool for generating ereputation scores for brand analysis explained in [30], where Algerian tweets are preprocessed and then classified into positive, neutral and negative, using Logistic Regression and Support Vector Machines algorithms, using manually labelled data. According to this paper, E-reputation is the perception that Internet users have of the company, brand or people who collaborate (managers, employees) and which is potentially visible on many supports of the net. Also, 66% of consumers seek advice before buying a product and 96% of consumers seek advice before buying a product are influenced by the E-reputation of a brand during a purchase. Younas and Owda show in their work [37] a lexicon-based method for analysing the sentiment of tweets regarding BBC news articles. Another important work includes the works of Dutot and Castellano [14] where they suggested the need to analyse a brand considering “social media”, as well as other
58
L. Istrati and A. Ciobotaru
components such as “brand characteristic” and the quality of the services or of the website of that brand. For our case study, we opted for the analysis of brands based on Twitter social media platform. Another article adjacent to the field of brand analysis is the work of Hroux et al. [17], where they measure innovation by using web-based indicators of four core concepts (R&D, IP protection, collaboration, and external financing) on data gathered from 79 corporate websites of Canadian nanotechnology and advanced materials firms, using keywords frequency analysis.
3
Methodology
First, we extracted tweets in Romanian language from Twitter and manually annotated them to create a dataset for training a sentiment analysis model. Then we passed these tweets through the preprocessing pipeline and trained a number of models for sentiment analysis in order to find the best one for the task. The steps involving the process are represented in Fig. 1. After choosing the best model for the sentiment analysis task, the next step was to create an API for brand monitoring and analysis that makes use of the sentiment analysis model created as a base for computing reputation scores.
Fig. 1. Process of creating the sentiment analysis model
3.1
Data Collection
Initially, the data collection process was performed using Twitter API. However, Twitter API proved to have a significant draw-down in its scraping process because it would not allow the extraction of tweets older than two weeks. Therefore, we searched for another tool that could allow scraping for longer periods of time and the solution found was the Python package GetOldTweets33 , which could scrap older tweets, even from a couple of years ago. The ratio between the number of tweets obtained with Twitter API and the number of tweets obtained with GetOldTweets3 package was approximately 1:9. Further, we joined tweets scrapped and passed them to a language detection module4 in order to filter out non-Romanian messages. 3 4
https://pypi.org/project/GetOldTweets3/. https://pypi.org/project/langdetect/.
Automatic Monitoring and Analysis of Brands
3.2
59
Data Labelling
The creation of a sentiment analysis tool for the Romanian language also involved the preparation of a dataset to serve this purpose, as we did not find an opensource dataset containing annotated Romanian tweets for sentiment analysis. We also conducted a round of experiments involving transfer learning from a model trained on a dataset of products and movie reviews in Romanian, however we found out that the structure of texts differs in online texts. We discovered in previous sentiment analysis tasks that most of the labels used for annotation were binary (positive/negative), and the final classification was done making use of the neutral label as well. The neutral category can be derived from the positive/negative categories using thresholds, as Salhi et al. explained in their works [30]. Data collected from Twitter, as described at Sect. 3.1, contained approximately 9200 tweets, which were manually annotated with positive/negative labels by the authors, based on the formula: ⎧ if tweet is negative ⎨0, if tweet is positive T weet = 1, (1) ⎩ eliminated, otherwise After the labelling process ended, a number of 2116 tweets remained, labelled as positive or negative. After eliminating duplicates as well, less tweets remained, which needed to be further balanced, in order to have an approximately equal number of positive and negative tweets in the final dataset. In Table 1 we show a sample of labelled tweets from Consumer services industry. 3.3
Balancing Data and Creating the Final Set of Training Data
After the labelling process was done, the dataset contained roughly 900 negative tweets and 1250 positive tweets. To prevent a bias error of the model we needed to obtain a balance between the number of positive and the number of negative tweets, and in order to do so we applied the following steps: – Scrapped a few more tweets and added to the dataset only the negative ones. – Reevaluated the already labelled positive tweets and removed the ones which were not conveying as much useful information for the model as other more relevant tweets did. – Eliminated some more positive tweets that had a cosine similarity between them of 0.65 or higher, as a lot of them were expressing joy for similar things, like beautiful mornings, or good coffees. 3.4
Data Preprocessing
The preprocessing pipeline created consists of three main parts: noise removal, normalisation and finally, tokenization and stemming. Each element of the
60
L. Istrati and A. Ciobotaru
scheme presented in Fig. 2 represents a text operation applied to the extracted tweets. The order of performing these operations is important. In Table 1 we showed the tweets resulted right after scraping, with their English translations, and the same tweets right after exiting the preprocessing pipeline.
Fig. 2. Text preprocessing pipeline: a. Noise removal; b. Normalisation; tokenizing & stemming.
Noise Removal. Noise removal is about removing special characters, digits, and generally any piece of text that can interfere with the text analysis task at hand, in our case, sentiment analysis. We applied the following noise removing steps: – Every tweet that was not in Romanian was eliminated. – Eliminating URL’s, digits, usernames, HTML tags and replacing “#” with space, as these elements do not provide information regarding the polarity of a tweet. This step was done using Regex. Replacing “#” with spaces, and not simply deleting them, it is important because of the frequent use of combined hashtags like “#First#Second”. Most of the time hashtags are words that can add value to the sentiment analysis task and for this reason it was important to fetch information from these as well;
Automatic Monitoring and Analysis of Brands
61
– Replacing slang-words and abbreviations with proper words. This step was done with the help of a custom list containing common abbreviations and slang words in both English and Romanian, and their morphological form that best describe their meaning. This custom list was made with the help of the lists found in [27] and contains 284 definitions. A small set of examples, along with English translations where necessary, is presented in Table 2.
Table 1. Tweets before and after the preprocessing stage, with English translations. Tweet
Translated tweet
Processed tweet
Label
Mˆ ancareeee (@Company A in Bucures, ti) https://www. swarmapp.com /c/3CQj6tuTkIJ
Fooooood (@Company A in Bucures, ti) https:// www.swarmapp.com /c/3CQj6tuTkIJ
[‘manc’]
1
Daca esti in #cluj in perioada #untold poti sa te bucuri de ofertele speciale de poveste de la Company A :D. . . https://instagram.com/ p/ 51yqPhi0K9
If you are in #cluj during [‘dac’, ‘perioad’, ‘untold’, #untold you can enjoy ‘bucur’, ‘ofert’, ‘special’, the Company A’s ‘povest’] legendary special offers :D. . . https://instagram. com/p/ 51yqPhi0K9
Company C revine la AFI, la intrarea Galaxy. Mis, to localizarea noului spat, iu. @AFI Palace Cotroceni https:// instagram.com/p/ 3lR2 BA qMj/
Company C comes back to AFI, at Galaxy entrance. It’s really cool the way the organized their space. @AFI Palace Cotroceni https:// instagram.com/p/ 3lR2 BA qMj/
[‘revin’, ‘afi’, ‘intrar’, 1 ‘galaxy’, ‘misto’, ‘localiz’, ‘noul’, ‘spatiu’, ‘palac’, ‘cotrocen’]
Indiferent, ˘ a. @Company C Unirii https://www.instagram. com /p/BQ ASem3BVt7/
Indifference. @Company C Unirii https://www.instagram. com /p/BQ ASem3BVt7/
[‘indiferent’, ‘unir’]
0
Pe Company D sunetul e decalat de imagine
On Company D the sound is decalated from the image
[‘sunet’, ‘decal’, ‘imagin’]
0
Translation on Company D: “to find a killer” was translated to “let’s kill the criminal”. Gg “0” label means “negative” “1” label means “positive”
[‘traduc’,‘Company D’, ‘find’,‘the’,‘kiler’,‘tradus’, ‘ucid’,‘criminal’]
0
Traducere la Company D: to find the killer a fost tradus cu sa ucidem criminalul. Gg 1 2
1
62
L. Istrati and A. Ciobotaru
Table 2. Some of the abbreviations and replacements done in the noise removal stage, with English translations (in brackets). pt = pentru ca fain = bine (fine) (because)
kil = kilogram
pct = punct (point)
cf = ce faci (how are you)
npt = noapte (night)
really = chiar
scz = scuze (sorry)
nice = dragut (nice)
csf = ce sa faci zenviciuri = sandviciuri (what to do) (sandwiches)
cute = dragut (nice)
nui = nu-i (there isn’t)
c = ce (what)
sendviciuri = sandviciuri faqt = facut (sandwiches) (done)
Table 3. Samples of emojis and emoticons labelled “bun”/“rau” (Meaning “good”/“bad”)
Emoji samples labelled as “bun” (meaning “good”) Emoticons samples labelled as “bun” (meaning “good”) Emoji samples labelled as “rau” (meaning “bad”) :( ;( :-( ;-( =( :’( :–( ::( :=( Emoticons samples labelled as “rau” (meaning “bad”) :) ;) :-) ;-) =) :’) :–) ::)
Normalisation. We normalized the dataset by automatically: – Eliminating diacritics; – Normalizing words that have multiplied letters and getting them to a correct grammatical form (method also presented in [16]). The doubling consonants were not taken into account, because Romanians use text emphasis by prolonging only vowels. In order to decide which vowels and how many vowels were to be eliminated, a backtracking paradigm was used to generate all possible subsets of positions that contained vowels and eliminate every subset combination of letters, until a proper word was found; – Normalising words written in the form of CamelCase. Extracting the containing words from this style of writing, mainly used in hashtags. For instance, after normalisation, “OneWordCombined” becomes “One”, “Word”, “Combined”; – Converting all words to lowercase; – Changing the emojis/emoticons to one of the labels “bun” (meaning “good”) or “rau” (meaning “bad”). In order to do this, we created a dictionary containing a number of 510 emojis and emoticons, based on a dataset described in [36]. The inspiration came from the work of Shaunak Joshi and Deepali Deshpande [22], as well as from the work of Khan et al. [21], where they present a Twitter opinion mining framework that takes into consideration emoticons at preprocessing level. Selecting the translating word for an emoji or emoticon as being “bun”/“rau” was made using the following principle: if the number of tweets in which an emoji/emoticon was used in a positive
Automatic Monitoring and Analysis of Brands
63
context was bigger than the number of tweets in which it was used for a tweet with negative/neutral context, the emoji/emoticon was given the label “bun”(“good”). Otherwise, it was considered “rau” (“bad”). In Table 3 we provide samples of emojis and emoticons translated as “bun”/“rau” in the dataset: – Deleting all remaining punctuation marks, as well as special characters, using Regex. Punctuation marks were not eliminated up to this point because some of them could have been part of an emoticon. – Eliminating words of geographical areas/companies. These proper nouns do not carry any sentiment information and thus we eliminate them, so they don’t bias the training process. In order to do so, we created some lists containing names of cities and geographical regions that appeared in the tweets, as well as names of companies. – Eliminating stop-words. We removed a list of custom stopwords that resulted from combining the default Python packages spacy5 and nltk’s6 stopwords lists, by deciding for each word separately if it brings value to the sentiment analysis task at hand or not, inspired by the research conducted in [29]. Tokenization and Stemming. The final step of the preprocessing stage consisted in splitting strings into tokens and performing the stemming operation using nltk’s SnowballStemmer7 for Romanian language.
4
Experiments and Results
First we tried to use K-means clustering on the scrapped and preprocessed data, but the results proved not to be relevant for sentiment analysis. Then we tried transfer learning as explained in Sect. 4.2, and lastly, we prepared a novel dataset by manually annotating tweets in Romanian for sentiment analysis, as explained in Sect. 3.2. Using our novel dataset, we analysed and compared several classification models which fall into three categories: traditional machine learning models, neural networks and neural networks with contextual word embeddings. The experimental setup in the search for the best sentiment analysis model was made by varying the amount of preprocessing in the pipeline described in Fig. 2, in order to test the preprocessing techniques as well. We created four sets of data: – Datsaset A was made without replacing slangs and abbreviations and also without labelling emojis/emoticons; – Dataset B was made with full preprocessing; – Dataset C was made without labelling emojis/emoticons; – Dataset D was made without replacing slangs and abbreviations. 5 6 7
https://github.com/explosion/spaCy/blob/master/spacy/lang/ro/stop words.py. https://github.com/nltk/nltk data/blob/gh-pages/packages/corpora/stopwords. zip. https://www.nltk.org/ modules/nltk/stem/snowball.html.
64
L. Istrati and A. Ciobotaru
Fig. 3. Distribution of tokens in tweets, with underlying smoothness.
4.1
Datasets Exploration
The distributions of tokens per tweet by label of each dataset described above are shown in Fig. 3. In Table 4 we computed some common statistics descriptors for each subset of the datasets created. Both descriptors in Table 4 and graphs in Fig. 3 were computed and plotted using R software. Analysing Table 4, it can be seen that mean and median differ for each subset, which make distributions asymmetrical. Also, there are some differences at the peaks, distributions of the “negative” labelled subsets being more flat than the “positive” ones. The medium length of sentences in the subsets of “positive” labelled data is also higher than the “negative”, meaning that people tend to speak nicely on Twitter making use of more words than when they express bad opinions. 4.2
Transfer Learning Experiment
In order to label the collected data we first tried to use transfer learning with a Romanian sentiment analysis engine8 . In this project, the sentiment analysis 8
https://github.com/katakonst/sentiment-analysis-tensorflow.
Automatic Monitoring and Analysis of Brands
65
Table 4. Descriptive statistics of token distributions in sentences. Dataset
Minimum Median Mean Max
Variance Std. deviation
Dataset A, ‘negative’ labels 1.000
8.000
9.135 28.000 30.87228 5.556283
Dataset A, ‘positive’ labels
1.000
7.000
7.834 32.000 23.20206 4.816852
Dataset B, ‘negative’ labels 1.000
8.000
9.207 28.000 30.27687 5.502442
Dataset B, ‘positive’ labels
1.000
7.000
8.039 32.000 22.76176 4.770929
Dataset C, ‘negative’ labels 1.000
8.000
9.138 28.000 30.60687 5.532347
Dataset C, ‘positive’ labels
1.000
7.000
7.843 28.000 22.82466 4.777516
Dataset D, ‘negative’ labels 1.000
8.000
9.190 28.000 30.55641 5.527785
Dataset D, ‘positive’ labels
7.000
8.037 32.000 23.21163 4.817845
1.000
classifier was trained on a dataset of products and movies reviews written in Romanian language. The main problem was that the structure of microblogging data differs from the review of a product or of a movie. This aspect was observed after looking at what the engine predicted for our dataset: 17.78% of the positive tweets were misclassified as negative, while 52.44% of the negative tweets were misclassified as positive. 4.3
Training Supervised Machine Learning Models
The train/test split ratio used for creating all models was of 1/5. For implementing traditional alorithms we used CountVectorizer encoding, an n-gram range of (1, 2) and off-the-shelf sklearn9 classifiers. Some models were optimised by varying hyperparameteres. For Support Vector Machines model best results were found using Sigmoid Function as kernel and gamma set to ’scale’. For Perceptron model, best results were found for l2 regularization and a maximum number of epochs of 100. All the other parameteres, and also all the parameteres for the other tried traditional models, were left unchanged, at their initial default values. For implementing neural networks, we used Keras API for Tensorflow, with Keras embedding, Softmax activation function, Categorical Crossentropy loss function and Adam optimizer. For implementing sentiment classification with BERT word embeddings, we loaded the Romanian BERT model [13] from Huggingface10 . We fine-tuned BERT model by adding a classifier for our binary sentiment analysis task, comprised of a drop-out layer for regularization and a fully connected layer as output, using Softmax activation function. Training was done using cross-entropy loss function and AdamW optimizer. For implementing our last sentiment classifier model, the one using fastText word embeddings [7], we used the method of supervised training from fastText Python library. We were able to hypertune the model by modifying the learning 9 10
https://scikit-learn.org/stable/. https://huggingface.co/dumitrescustefan/bert-base-romanian-cased-v1.
66
L. Istrati and A. Ciobotaru
rate and the number of epochs. Best results were achieved for training a model with 10000 epochs and a learning rate of 0.01. The number of word n-grams was set to 2, for equal comparison with the other trained models. Table 5. Efficiency metrics computed for the tried models. Models
Confusion Matrix Precision Recall F1-score Accuracy
AdaBoost
160
29
88
115
Decision tree k-NN Logistic regression Bernoulli Naive Bayes Multinomial Naive Bayes Perceptron Random forest Stochastic gradient descent Support vector machines XGBoost ANN CNN LSTM NN with BERT NN with fastText
4.4
159
30
86
117
43
146
31
172
153
36
42
161
141
48
30
173
161
28
47
156
155
34
56
14
110
79
28
175
149
40
40
163
141
48
34
169
128
61
44
159
154
35
45
158
95
94
31
172
128
61
53
150
173
59
26
141
159
44
26
163
79.86
56.65
66.28
70.15
79.59
57.63
66.85
70.40
54.08
84.72
66.02
54.84
81.72
79.31
80.50
80.10
74.60
82.46
78.33
80.10
85.19
77.40
81.11
80.87
82.01
73.46
77.50
77.04
58.20
79.71
67.28
72.70
78.84
78.84
78.84
79.59
74.60
80.57
77.47
79.09
67.72
74.42
70.91
73.21
81.86
77.83
79.79
79.59
64.66
84.72
73.34
68.11
71.09
73.89
72.46
70.91
74.57
86.93
80.23
79.95
78.33
85.95
81.96
82.14
Comparison Between Models on Different Preprocessing Pipelines
At first we trained the models using Dataset B. For each created model we computed: confusion matrix, precision, recall, F1-score, accuracy. The results are shown in Table 5.
Automatic Monitoring and Analysis of Brands
67
Further, we choose the algorithms that had the highest accuracy in Table 5 to train 3 more sets of models, on Datasets A, C and D, in order to evaluate the preprocessing pipeline described in Fig. 2. Results in Table 6 show that the neural network with FastText word embeddings still has the best performance on Dataset B, while the neural network with BERT contextual word embeddings performs better with less preprocessing. It can also be observed that dataset B has the highest number of best results out of all four datasets created (results written in bold). The most important aspect is that we achieved worst results on Dataset A, with no preprocessing for emojis and slang words, and with adding further preprocessing methods we get some improvements on Datasets C and D (where we tried each of these two methods separately), while combining both of these methods give best results on Dataset B (which means that the more we extract information from tweets, the higher are the obtained results). As best results were generally achieved on Dataset B, we chose fastText based model trained on Dataset B as being the best for our sentiment analysis task, because it achieved highest accuracy and highest F1-score on this dataset. This model was further used for our proposed brand monitoring and analysis application.
5
Brand Analysis Framework
After choosing the optimal model for the sentiment analysis task, the next step was to create the brand monitoring application that makes use of this model, which would also make possible seeing how well it predicts on previously unseen data. The whole functional process of the application is described in Fig. 4.
Fig. 4. Brand reputation API framework
First, the user can choose in the homepage of the application what he wants to analyze: a company, an industry, to compare two companies, or even, to analyze the polarity of a single tweet. The user has the option to create an analysis report which provides useful information about: the way companies are perceived, how the online sentiment regarding the brand evolved on a monthly basis, how the number of tweets evolved, how is the brand doing when compared to its peers in the industry and what are the main topics of discussion when considering tweets that mention the brand.
68
L. Istrati and A. Ciobotaru
Table 6. Efficiency metrics computed for the best models for datasets with different types of preprocessing. Models
Metric
Dataset A
Dataset B
Dataset C
Dataset D
Logistic regression
Confusion matrix
122 59 36 177
153 36 42 161
131 74 23 165
134 61 29 168
Precision
75.00
81.72
69.03
73.36
Recall
83.09
79.31
87.76
85.27
F1-score
75.88
80.50
75.31
77.04
Accuracy
78.84
80.10
77.28
78.87
Confusion matrix
127 54 27 186
141 48 30 173
126 79 17 171
136 59 22 175
Precision
77.50
74.60
68.40
74.78
Recall
87.32
82.46
90.95
88.83
F1-score
79.44
78.33
75.57
79.33
Accuracy
82.11
80.10
78.08
81.20
Confusion matrix
149 32 47 166
161 28 47 156
151 54 36 152
157 38 29 168
Precision
83.83
85.19
73.78
81.55
Recall
77.93
77.40
80.85
85.27
F1-score
79.94
81.11
77.09
82.90
Accuracy
80.77
80.87
77.15
83.37
Confusion matrix
163 50 15 166
173 59 26 141
156 32 42 163
163 34 24 168
Precision
76.53
74.57
82.98
82.74
Recall
91.57
86.93
78.79
87.17
F1-score
83.38
80.23
80.83
84.90
Accuracy
83.50
79.95
81.17
85.09
Confusion matrix
172 41 53 128
159 44 26 163
157 31 62 143
161 36 45 150
Precision
80.75
78.33
83.51
81.73
Recall
76.44
85.95
71.69
78.16
F1-score
78.54
81.96
77.15
79.90
Accuracy
76.14
82.14
76.34
79.34
Bernoulli Naive Bayes
Multinomial Naive Bayes
NN with BERT
NN with fastText
Based on the user’s selection, the parameters for analysis are set: company name, industry, the period in which the analysis is desired, etc. These parameters are further used in the scraping process. After gathering all tweets from Twitter in the desired period of time and regarding the specific company/companies/industry, along with the number of likes and retweets, tweets enter the preprocessing pipeline described in Fig. 2. Next, the output of the preprocessing pipeline go through the inference stage, where the trained model chosen at 4.4 is used to predict the sentiment of each tweet.
Automatic Monitoring and Analysis of Brands
69
Tweets are automatically labeled positive/neutral/negative, based on Eq. 2: ⎧ ⎨positive, if model score > 66% T weet = negative, if model score < 66% (2) ⎩ neutral, otherwise Also, inspired by the work of Colhon Mihaela et al. [10], we took into account negations, by shifting the polarity of the sentiment when the word “nu” (meaning “no”) was found at the beginning of a tweet. 5.1
eReputationScore Calculation
After determining the sentiment of the tweets involving a certain company/ industry, we need to compute some metrics in order to generate reputation reports. First and foremost, we introduce a concept called “Influence Score”. This is defined as the degree of significance a tweet about a brand has, when compared to the rest of the tweets, regarding the same brand. The following formula was used to calculate the influence score for each tweet: IS(t) = 1 +
3 ∗ R(t) + L(t) 4
(3)
where: R(t) = number of retweets of tweet t; L(t) = number of likes of tweet t. A tweet holds many important additional data, like the number of likes/ retweets. Socially speaking, and as described in [23], when somebody wants to retweet a tweet it is considered that the person agrees with its content, whilst when giving a like, it is not the same level of commitment. This is the reason why, in Eq. 3, the number of retweets has a higher weight than the number of likes. The +1 is added in order to prevent the influence score of a tweet from having the value 0, which would make it erroneously irrelevant when calculating the whole E-Reputation Score. The upper bound of the Influence Score is +∞. We shall define a tweet as ti,j,k , where i is the type of tweet (1 negative, 2 neutral, 3 positive), j is the month the tweet was sent at, and k is the ordered number of the tweet. Let A be a 3xN matrix, the ScoresMatrix, where N is the number of months we are interested in, with its elements defined by Eq. 4: ni,j IS(ti,j,k ) (4) ai,j = 3 k=1 nα,j α=1 k=1 IS(tα,j,k ) where: IS(ti,j,k ) is the Influence Score calculated with formula (8); ni,j is the number of tweets from category i for month j, so that: a1,j is the number of negative tweets for month j; a2,j is the number of neutral tweets for month j;
70
L. Istrati and A. Ciobotaru
a3,j is the number of positive tweets for month j. Next, we compute the ScoresList using Eq. 5, by averaging each line in matrix A, resulting in a vector L of shape [3,1]: ⎛N L=
1 ∗ N
k=1 ⎜N ⎝ k=1 N k=1
a1,j,k
⎞
⎟ a2,j,k ⎠
(5)
a3,j,k
Lastly, we compute the reputation score by subtracting the first element from the third element of this vector: EReputationScore = L[2] − L[0]
(6)
The whole process is represented in Fig. 5.
Fig. 5. EReputationScore computation.
5.2
Analysis Reports
The analysis reports are created using the eReputationScore (Eq. 6), together with the ScoresList (Eq. 5) and the ScoresMatrix (Eq. 4). In order to prove the end to end functionality of the API we created one report that analyses transportation industry. This industry includes three companies, which will be further referred to as Company A, Company B and Company C. The selected period of time was May 2020 - September 2020. As it can be seen from the piechart (Fig. 6), tweets mentioning the selected companies in transportation industry are overwhelmingly negative in the analysed time-frame, with 66.1% of the tweets being labeled as negative, 10.9% as neutral and just 23% of the tweets being considered positive. In the industry top (Fig. 7) we can also observe that Company A is labelled as “bad” (an eReputationScore between –0.6 and –0.2), while Company B and Company C are labelled as “very bad” (an eReputation score between –1 and –0.6). The industry top is calculated using the eReputationScore (Eq. 6), which is a floating point number in the [–1,1] interval. The closer it is to –1, the worse the public image of the brand, while the closer it is to 1, the better.
Automatic Monitoring and Analysis of Brands
71
Fig. 6. Transportation industry piechart for company A, company B, company C
Fig. 7. Transportation industry top
Fig. 8. Transportation industry treemap for company A, company B, company C
Insights seen in the industry top are now replicated on the treemap (Fig. 8). The color of the Company A rectangle is green because the perception is quite good relative to its peers in the industry. While the colors of Company B and Company C are different shades of red because the perception about them is worse than the perception of Company A. Another aspect we can infer from the treemap is the prevalence of a company on Twitter, as the size of its rectangle is directly proportional with the number of tweets about that company found in the scrapping process. From the linechart (Fig. 9) we can figure out the fact that generally the image of the Transportation industry got worse during summer period, as the number of negative tweets got higher in the summer months, and the number of positive tweets got lower.
72
L. Istrati and A. Ciobotaru
Fig. 9. Transportation industry linechart for company A, company B, company C
6
Discussions and Future Works
From our study, we found that it is imperative to automatically differentiate between tweets that actually talk about a company and not about another subject that uses the brand name as a word. By using Named Entity Recognition (NER) we would prevent any irrelevant tweets from being taken into account in the brand analysis of a company. Another special feature that would bring value is word segmentation of merged hashtags. In the microblogging world, it is very common to find hashtags that look in the following manner: “#abc”, where a, b and c are all different valid words, hence the need to identify the boundaries of each one, eliminate the “#” and introduce spaces between a, b and c in order to store them as three different words and not as a unique, unintelligible long word. The crucial subproblem of the brand analysis process is the sentiment analysis one, where there is plenty of room for improvement in the future. For instance, labelling neutral tweets as well, or even better, creating a dataset for sentiment analysis on Romanian tweets that label a broader range of sentiments, like happiness, fear, disgust, etc. Some other approaches could include an initial classification between subjective and objective, to separate news from peoples’ opinions, followed up by a sentiment classification. Creating more classes of sentiments for the emojis/emoticons could also bring value to our analysis, rather than just the positive/negative ones.
7
Conclusions
Our study aimed to search for the best sentiment analysis tool applied on Twitter data in Romanian, in order to develop a model with the purpose of being used practically with an API that extracts meaningful information about how a certain brand is perceived by consumers, either alone or compared with another brand/its other peers in the industry. To achieve this objective, we showed that transfer learning is not a solution, so we created a novel sentiment analysis dataset of manually labelled Romanian tweets. We passed the labelled tweets through a custom preprocessing pipeline,
Automatic Monitoring and Analysis of Brands
73
validated by our experiments, and trained several sentiment analysis models, in order to choose the optimal one for our Romanian sentiment analysis task. A neural network with fastText word embeddings proved to have best results, thus this model was further used for our proposed brand analysis and monitoring application, that quantifies the public perception a brand has by using previously defined e-reputation scores. Social media will become more and more influential in the future as more people start using it and as the amount of data available on-line increases. Brands must take this into account in order to keep their competitive advantage, and use a brand monitoring and analysis tool to track the evolution of their e-reputation. Acknowledgments. The work presented in this article is the result of an internship which took place at the Advanced Technologies Institute during June-September 2020. Alexandra Ciobotaru: supervision, guidance, datasets exploration & visualisation, graphical & equation editing; Lucian Istrati: data collection, data preprocessing, API development, reports visualisation; Both: data labelling, conceptualization & research, writing & editing, training, fine-tuning & model evaluation. We would like to thank Ligia Maria Batrˆınca from Institute of Advanced Technologies, as well as Liviu Dinu and Sergiu Nisioi from University of Bucharest, for useful discussions and valuable feedback. Also, we would like to thank Cognience SRL for funding the participation at Intellisys 2021 conference.
References 1. Ahmad, M., Aftab, S., Ali, I.: Sentiment analysis of tweets using svm. Int. J. Comput. Appl. 177, 975–8887 (2017) 2. Alessa, A., Faezipour, M., Alhassan, Z.: Text classification of flu-related tweets using fasttext with sentiment and keyword features, pp. 366–367 (2018) 3. Alshammari , N., AlMansour, A.: State-of-the-art review on twitter sentiment analysis, pp. 1–8 (2019) 4. Bahrawi, B.: Sentiment analysis using random forest algorithm online social media based, vol. 2, pp. h.29-33 (2019) 5. Banea, C., Mihalcea, R., Wiebe, J.: Multilingual subjectivity: are more languages better?, vol. 2, pp. 28–36 (2010) 6. Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval2017), Vancouver, Canada, pp. 747–754. Association for Computational Linguistics (2017) 7. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 07 (2016) 8. Buzea, M.C., Trausan-Matu, S., Rebedea, T.: A three word-level approach used in machine learning for romanian sentiment analysis, pp. 1–6 (2019) 9. Cliche, M.: BB twtr at SemEval-2017 task 4: twitter sentiment analysis with CNNs and LSTMs. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 573–580. Association for Computational Linguistics (2017)
74
L. Istrati and A. Ciobotaru
10. Colhon, M., Cerban M., Becheru, A., Teodorescu, M.: Polarity shifting for romanian sentiment classification. In: 2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA), pp. 1–6 (2016) 11. B˘ adic˘ a, A.S.C., Colhon, M.: Sentiment analysis of tourist reviews: data preparation and preliminary results. In: Proceedings of the 10th International Conference “Linguistic Resources and Tools for Processing the Romanian Language”, pp. 135–142 (2014) 12. D’Andrea, A., Ferri, F., Grifoni, P., Guzzo, T.: Approaches, tools and applications for sentiment analysis implementation. Int. J. Comput. Appl. 125, 26–33 (2015) 13. Dumitrescu, S., Avram, A.M., Pyysalo, S.: The birth of Romanian BERT. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 4324– 4328. Association for Computational Linguistics (2020) 14. Dutot, V., Castellano, S.: Designing a measurement scale for e-reputation. Corp. Reput. Rev. 18, 294–313 (2015) 15. Felix, N., Hruschka, E., Hruschka, E.: Biocom usp: tweet sentiment analysis with adaptive boosting ensemble (2014) ` 16. Gavilanes, M.F., Alvarez L´ opez, T., Juncal-Mart´ınez, J., Costa-Montenegro, E., Gonz´ alez-Casta˜ no, F.: Gti: an unsupervised approach for sentiment analysis in twitter, pp. 533–538 (2015) 17. H´eroux-Vaillancourt, M., Beaudry, C., Rietsch, C.: Using web content analysis to create innovation indicators what do we really measure? In: Quantitative Science Studies, pp. 1–37 (2020) 18. Joshi, S., Deshpande, D.: Twitter sentiment analysis system. Int. J. Comput. Appl. 180, 35–39 (2018) 19. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification, pp. 427–431 (2017) 20. Kamath, C., Bukhari, S., Dengel, A.: Comparative study between traditional machine learning and deep learning approaches for text classification, pp. 1–11 (2018) 21. Khan, F., Bashir, S., Qamar, U.: Tom: twitter opinion mining framework using hybrid classification scheme. Decis. Supp. Syst. 57, 245–257 (2014) 22. Novak, P.K., Smailovi´c, J., Sluban, B., Mozetiˇc, I.: Emoji sentiment ranking 1.0. Slovenian language resource repository CLARIN.SI (2015) 23. Lahuerta-Otero, E., Cordero-Guti´errez, R., De La Prieta, F.: Retweet or like? that is the question. Online Inf. Rev. 42, 562–578 (2018) 24. Liu, B.: Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers, San franisco (2012) 25. Marcu, D., Danubianu, M.: Sentiment analysis from students’ feedback: a romanian high school case study, pp. 204–209 (2020) 26. Orkphol, K., Yang, W.: Sentiment analysis on microblogging with k-means clustering and artificial bee colony. Int. J. Comput. Intell. Appl. 18, 07 (2019) 27. Pitiriciu, S.: De la abrevieri la conversat¸iile pe internet. Studia Universitatis “Petru Maior” Philologia 9, 66–73 (2010) 28. Rosenthal, S., Farra, N., Nakov, P.: Semeval-2017 task 4: sentiment analysis in twitter, pp. 502–518 (2017) 29. Saif, H., Fernandez, M., Alani, H.: On stopwords, filtering and data sparsity for sentiment analysis of twitter. In: Proceedings of the 9th International Language Resources and Evaluation Conference (LREC’14), pp. 810–817 (2014) 30. Salhi, D.E., Tari, A., Kechadi, T.: Sentiment analysis application on twitter for e-reputation (2019)
Automatic Monitoring and Analysis of Brands
75
31. Shoeb, M., Ahmed, J.: Sentiment analysis and classification of tweets using data mining. Int. Res. J. Eng. Technol. (IRJET) 04(12) (2017) 32. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for sentiment analysis. Comput. Linguist. 37, 267–307 (2011) 33. Thakkar, H., Patel, D.: Approaches for sentiment analysis on twitter: a state-of-art study (2015) 34. Vartiak, L.: Benefits of online reputation management for organizations operating in various industries (2015) 35. Vashishtha, S., Susan, S.: Fuzzy rule based unsupervised sentiment analysis from social media posts. Exp. Syst. Appl. 138, 112834 (2019) 36. Wolny, W.: Emotion analysis of twitter data that use emoticons and emoji ideograms. In: ISD (2016) 37. Younas, F., Owda, M.: Spatial sentiment and perception analysis of BBC news articles using twitter posts mining, pp. 335–346 (2021)
Natural Language Processing in the Support of Business Organization Management Leszek Ziora(B) Czestochowa University of Technology, Dabrowskiego 69, Czestochowa, Poland [email protected]
Abstract. The aim of the paper is to present the role of Natural Language Processing (NLP) in the support of contemporary business organization management especially focusing on decisions making processes. The article puts emphasis on the characteristics of NLP including its evolution, key technological components, and development trends. It also presents the research on the advantages and disadvantages of applying NLP in nowadays enterprises as well as focuses on the review of selected case studies, reports and practical examples. Keywords: Natural language processing · Decision making support · Business analytics · Machine learning · Artificial intelligence · Big data
1 Introduction Natural Language Processing solutions are widely applied in many branches such as Finance, Banking, Telecommunication, E-commerce, Medicine, or Healthcare. It is still being constantly developed and is becoming better every year. As an example of its development progress might pinpointed the latest GPT-3 language generator solution by OpenAI where it can be applied in e.g. different texts development and completion, conversations with humans or documents analysis. The paper addresses the problem of NLP applicability in the support of contemporary business organization management and underlines many benefits resulting from its application in the process of management e.g. improvement of the decision-making process. It seems that learning from experience and improvement of existing solutions’ performance might be one of the key elements of NLP and AI further development [19].
2 Characteristics of Natural Language Processing Natural Language Processing (NLP) solutions are more often implemented in the process of contemporary business organization management. L. Deng and Y. Liu underline the interdisciplinary feature of NLP solutions and state that it “combines computational linguistics, computing science, cognitive science and artificial intelligence” [4], and the mentioned authors evoke the general notion of NLP defined as “the use of computers to process or to understand i.e. natural languages for the purpose of performing useful tasks” © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 76–83, 2022. https://doi.org/10.1007/978-3-030-82199-9_6
Natural Language Processing
77
[4]. D. Khurana et al. present broad classification of NLP where this solution is divided into Natural Language Generation with Natural Language Text and Natural Language Understanding (Linguistics) subdivided into Phonology, Morphology, Pragmatics, Syntax, and Semantics [9]. This paper proposes a holistic view of NLP combining such areas of its application as: sentiment analysis, machine translations, auto-corrections with grammar verification, voice recognition, chatbots and assistants, text mining (e.g. text extraction, classification, summarization) (Fig. 1). NLP solutions support multiple business activity domains such as: customer service, advertisement campaigns, market research e.g. facilitation of various analyses and reports. M.Z Kurdi perceives NLP as a “discipline found at the intersection of Computer Science, Artificial Intelligence and Cognitive Psychology [11].
Natural Language Processing
Machine translation with auto corrections and grammar verification
Sentiment Analysis (supervised and unsupervised learning)
Chatbots and assistants
Voice/speech recognition
Text mining (extraction, classification, summarization) Fig. 1. A holistic view of natural language processing. Source: Author’s study
The broad category of NLP application in the area of text mining is presented by Kulkarni and Shivananda where the authors focus on data extraction (e.g. collecting data from different text files), exploring and processing data (e.g. text standardization, tokenizing, stemming, lemmatizing, text data exploration), converting text to features, clustering documents and many more. It is worth noticing the deep learning application of NLP in retrieving information, classifying text and next word prediction [10]. The implementation of machine learning algorithms for NLP was thoroughly presented by T. Beysolow. This author reviews machine learning and deep learning popular packages such as open source TensorFlow using “data flow graphs for computational processes”, Keras developed for prototyping applications and production engineering area, and Theano including different computational functions in comparison with TensorFlow [2]. NLP solutions base on machine learning and might play a crucial role in the improvement of business processes quality in enterprise and it perfects and increases the
78
L. Ziora
business value of the entire decision-making process. The scope of NLP applicability is also constantly growing to embrace more branches. Business organizations apply supervised and unsupervised solutions in order to support and improve the decision-making process at its every stage together with the acceleration of every day’s business operations [21]. One of the most significant applications of NLP is sentiment analysis which as an opinion mining showing the positive, neutral and negative polarity of a comment placed e.g. in social media, portals and conducted at the whole document, sentence-level, or the selected aspect level. The general classification of sentiment analysis in the literature of subject distinguishes two main approaches: machine learning which is divided into supervised (decision trees, linear classifiers like neural networks or support vector machines) and unsupervised learning. The second approach is called lexicon-based with dictionary-based and corpus-based approaches [8]. C.A. Iglesias and A. Moreno state that “sentiment analysis technologies which enable the automatic analysis of the information distributed through social media to identify the polarity of posted opinions have been extended to analyze other aspects such as users’ emotions, combining text analytics with other inputs, including multimedia analysis or social networks analysis [5]. The data source for the purpose of analysis include: traditional media (TV, radio, press) social media, discussion forums posts, blogs and microblogs, YouTube comments, book, and film reviews sites, etc. [6]. Qualitative and quantitative data as well as big data play a significant role as a source of data for NLP analyses [7]. B. Liu also states that “nowadays, individuals, organizations and government agencies are increasingly using the content in social media for decision making and also confirms the growing number of business organizations applying this solution in such industries and domains as financial services, social events, health care, tourism, hospitality [13]. A.R. Pathak, B. Agarwal, M. Pandey, and S. Rautaray confirm that sentiment analysis “helps end users and business industries in decision-making for purchasing products, launching new products, assessing the industry reputation” [15]. The same authors provide an overview of deep learning applications and distinguish graphical taxonomy of sentiment analysis based on traits like levels, polarity, output, language support, applicability to domain; text representation, evaluation metrics, benchmarked datasets and tools [15]. As far as chatbots are concerned there can be observed growing popularity in its application in different business domains, especially in customer care in banking, financial, telecommunication and healthcare institutions. The need for Chatbots especially in the business perspective is underlined by S. Raj which he treats as a marketing tool and enlists its advantages such as: accessibility (every consumer may ask any question instantaneously), efficiency, availability (24/7), scalability, cost, and insights. The same author further presents the pros of bringing revenue to the business organization which uses it in daily business activity and shows selected success stories of Sephora which “increased makeover appointments by 11% via Facebook Messenger chatbot”, Nitro Café which “increased sales by 20% with their Messenger chatbot” allowing for ordering and payments or Sun’s Soccer where chatbots “drove nearly 50% of its users back to their site throughout specific soccer coverage” [18]. Another application of NLP might be the creation of ontologies used e.g. in the support of processes management where i.e. “the
Natural Language Processing
79
idea of corporate ontology is supposed to help organize the terminology database used in a specific environment” [17].
3 Selected Areas of NLP Business Applications: Review of Research and Practical Examples Natural Language Processing finds its application in multiple areas connected to the management of a business organization. In the area of Logistics and namely Supply Chain management, it is worth mentioning benefits provided by Blume Global such as “reduction or removal of language barriers for consistent communications and better supplier relationships, improvement of information received from supply chain stakeholders through chatbot and adaptive interview technology, optimization of the supply chain through querying complex datasets using natural language, enhancement of customer satisfaction through smart, automated customer service, understanding and mitigating potential risks with suppliers, manufacturers and other supply chain stakeholders through analyzing reports, industry news, social media, and other areas” [3]. In the broad finance area, it is worth enlisting applications provided by A. LaPlante and T.F. Coleman who mention that “NLP can be used to significantly reduce the manual processing required to retrieve corporate data from sources including the Security Exchange Commission’s” and also state that NLP can be used to verify the consistency between company reports and financial statements and provides an efficient means of monitoring consumer and investor sentiments. Sentiment analysis in finance embrace analysis of customers opinions on specific financial products or services, fraud detection, credit risk scoring, and assessment, targeting product offerings, automation of algorithms connected to trade detection, credit risk scoring and assessment, targeting product offerings, automation of algorithms connected to trade [12]. The sentiment analysis tools can be applied for broadly understood management of local government entities as it was presented in the research conducted in 46 municipalities in the northern part of the Silesian voivodeship in Poland regarding the possibility of using SA tools for municipal management. This research showed that the social media, comments placed in local portals, and the content of correspondence with commune offices are considered to be the main sources of data for sentiment analysis. The research also indicated that there is a largely undeveloped area in the subject of automated testing and analyzing the nature of utterances/statements for the needs of municipal management. Sentiment analysis tools are not widely known in municipal boards however most of the surveyed communes (about 80%) admit that they are trying to follow and analyze opinions expressed by residents. The Accenture report estimates the value of Natural Language Processing market size in the 2021 year as 16 billion dollars and presents top NLP applications in business. This report states that besides structured data such as spreadsheets and tables, almost 80% of data is unstructured and embraces e.g. videos, voice recordings, or social media posts. The report also pinpoints the fact that AI solutions play a crucial role in “processing and analysis of large volumes of unstructured data and NLP is one of three AI-driven capabilities that enterprises harness to create business value and competitive advantage” [1]. The report brings up many benefits resulting from NLP application where it is worth
80
L. Ziora
mentioning “improved understanding of user queries and enterprise content, in case of chatbots improvement of business processes and reduction of support costs, improvement of brand reputation. In the case of intelligent document analysis, it is improved compliance and risk management, enhancement of business processes, better internal operational efficiency. In the case of document search and match it is a reduction of time to fill a job position, increased revenue, and cost reduction. In the case of sentiment analysis, it was the provision of marketing and competitive intelligence, enhancing product development, improvement of customer retention [1]. Ch Pimm et al. showed incident and accident reporting at Air France with the help of NLP solutions which are “languageindependent and are using categorization”. The authors combined a similarity analysis system and the automatic classification and category suggestion system to filter certain dimensions of similarity [16]. Another worth mentioning solution is empathiq.io which is “patients experience platform that improves health care system through automated feedback analysis using natural language processing. This platform uses NLP to assess the “sentiment of opinions depending on predefined categories like quality of medical care, quality of service. It also gives the possibility to change the opinion by reaching out to people who wrote bad reviews and the key features include data collection mechanism, evaluation system, administrative and client panel, integration of the systems and different brand sharing options” [20]. In case of healthcare, it is worth noticing two main cases such as: “comprehending human speech and extracting its meaning and unlocking unstructured data in databases and documents by mapping out essential concepts and values and to use this information for decision making and analytics. In this branch can be applied speech recognition (using e.g. Hidden Markov model), data mining research, computer-assisted coding, automated registry reporting, clinical decision support, risk adjustment, clinical trial matching [14]. The reports and case studies clearly indicate the progress in the field of NLP as well as the flexibility and utility of those solutions applied in many contemporary companies.
4 Survey Research Results As far as research methodology is concerned the respondents included 50 students of intramural and extramural studies of the Czestochowa University of Technology in the management field of study. They had to indicate benefits resulting from the application of Natural Language Processing solutions in business organizations, perspectives of its development as well as to indicate which solutions they have already used. The survey research was completed in an electronic form and then the answers were analyzed. Many of the respondents indicated such benefits as: machine translation with the use of e.g. Google translate or DeepL solutions (allowing especially for real time translations) (90%), utilization of chatbots which enable better and faster customer service (86%), document classification recognizing semantic patterns and syntax (80%), spam classification in e-mails reception (78%), sentiment analysis (76%), extraction of information concerning products and services with access to the information in a real time (74%), searching for different analyses e.g. semantic scholar, question answering (e.g. Wolfram Alpha platform, 72%), speech recognition (including transcriptions from audio and video files – speech to text and text to speech), 70%, speech synthesis (spoken dialogue systems) e.g. applied in intelligent home 68%, and then office and smart cities
Natural Language Processing
81
solutions. The main benefit here is acceleration and automation of routine daily tasks. These research results were presented in Fig. 2. It seems that the popularity of e.g. machine translations over e.g. sentiment analysis is caused by the specificity of the research environment where the research was conducted from the user perspective of NLP solutions.
m ach in e tr an slao n s 90% , ch atb ots u sage 86% , document classific aon and paern recogn ion 80% sp am classific ao n 78% , sen m en t an alysis 76% extr acon of in for m aon 74% , sp eech r eco gn io n 72% , sp eech syn th esis 70% , sm ar t h om e 68% 10 9 8 7 6 5 4 3 2 1 0
10
20
30
40
50
60
70
80
90
100
Fig. 2. NLP application areas – survey research results. Source: Author’s Study
As the future perspective of applying NLP solutions is concerned, most of the respondents indicated searching information with the use of voice assistants in a much greater and wider scope than it is realized today, the perspective of better developed human computer interaction, the ability to talk to e.g. robots, etc. As far as the solutions used on a daily basis are concerned the most often used NLP applications were machine translations (94%), voice assistants like Google Assistant, Alexa, Cortana or Siri (86%), e.g. chatbots and virtual assistants helping to present different offers of products and services e.g. insurance policies (possible to appear in the avatar form), telecommunication offers (70%). There also should be highlighted limitations of the study which is connected with the selection of research sample in an academic environment.
5 Conclusion Natural language processing provides multiple benefits to contemporary business organizations. Such solutions facilitate communication between human and computer, allow to conduct sentiment analysis, enable translation between different languages, support e-mail filtering, information extraction, and voice recognition. Its business benefits embrace acceleration of the decision-making process, increasing its quality, reduction of
82
L. Ziora
operating costs, creation of business value, increase of profits thanks to the application of different business analyses using tools such as managerial dashboards. It all may lead to the achievement of competitive advantages in the worldwide market and increasing customers and employees satisfaction. NLP is implemented in almost every industry from retail sales to healthcare. Together with the computational power increase and utilization of deep learning to a greater extent, there can be also observed an increase in NLP solutions quality as well as its efficacy and efficiency.
References 1. Accenture report: top natural language processing applications in business. Unlocking value from unstructured data (2019). https://www.accenture.com/_acnmedia/PDF-106/AccentureUnlocking-Value-Unstructured-Data.pdf 2. Beysolow, T.: Applied natural language processing with python. In: Implementing Machine Learning and Deep Learning Algorithms for Natural Language Processing, pp. 4–9, Apress, San Francisco (2018) 3. Blumeglobal: https://www.blumeglobal.com/learning/natural-language-processing/ 4. Deng, L., Liu, Y. (eds.): Deep Learning in Natural Language Processing, Springer Nature Singapore, p.1 (2018) 5. Iglesias, C.A., Moreno, A.: Sentiment Analysis for Social Media, MDPI, Basel, Switzerland, p.1 (2020) 6. Jelonek, D., St˛epniak, C., Turek, T., Ziora, L.: Planning city development directions with the application of sentiment analysis. Prague Economic Papers 29, 274–290 (2020). https://pep. vse.cz/corproof.php?tartkey=pep-000000-0643 7. Jelonek, D., St˛epniak, C., Ziora, L.: The meaning of big data in the support of managerial decisions in contemporary organizations: review of selected research. In: Proceedings of Future of information and Communication Conference (FiCC 2018), Singapore, pp. 195–198 (2018) 8. Jelonek, D., Mesjasz-Lech, A., St˛epniak, C., Turek, T., Ziora, L.: Potential data sources for sentiment analysis tools for municipal management based on empirical research. In: Arai, K., Bhatia, R. (eds.) FICC 2019. LNNS, vol. 69, pp. 708–724. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-12388-8_49 9. Khurana, D., Koli, A., Khatter, K., Singh, S.: Natural Language Processing: State Of The Art, Current Trends and Challenges August 2017. https://arxiv.org/abs/1708.05148 10. Kulkarni, A., Shivananda, A.: Natural Language Processing Recipes: Unlocking Text Data with Machine Learning and Deep Learning using Python. Apress, Bangalore, Karnataka, India (2019) 11. Kurdi, M.Z.: Natural Language Processing and Computational Linguistics 1. Speech, Morphology, Syntax, Wiley (2016) 12. LaPlante, A., Coleman, T.F.: Teaching computers to understand human language: how natural language processing is reshaping the world of finance (2017). https://globalriskinstitute.org/ publications/natural-language-processing-reshaping-world-finance/ 13. Liu, B.: Sentiment analysis. Second edition. Mining Opinions, Sentiments, and Emotions, p. 6. Cambridge University Press, New York (2020) 14. Marutitech: Top 12 use cases of Natural Language Processing in Healthcare. https://maruti tech.com/use-cases-of-natural-language-processing-in-healthcare/ 15. Pathak, A.R., Agarwal, B., Pandey, M., Rautaray, S. (eds.): Deep Learning-Based Approaches for Sentiment Analysis. Springer Nature Singapore (2020)
Natural Language Processing
83
16. Pimm, Ch., Raynal, C., Tulechki, N., Hermann, E., Caudy, G., et al.: Natural Language Processing (NLP) tools for the analysis of incident and accident reports. In: International Conference on Human-Computer Interaction in Aerospace (HCI-Aero), Brussels, Belgium. (2012). . https://www.researchgate.net/publication/280751863_Natural_Lang uage_Processing_NLP_tools_for_the_analysis_of_incident_and_accident_reports 17. St˛epniak, C., Turek, T., Ziora, L.: The role of corporate ontology in IT support in processes management. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) Intelligent Systems and Applications 2019, pp. 1285–1297. Intellisys, Springer, Cham (2020) 18. Raj, S.: Building Chatbots with Python. Using Natural Language Processing and Machine Learning, pp. 5–7. Apress, Bangalore (2019) 19. Russel, S.: Human compatible artificial intelligence and the problem of control, p.91. Viking (2019) 20. Zi˛eba, B.: Natural language processing: empathiq.io Case Study (2018). https://inovatica. com/blog/natural-language-processing-empathiq-cs/ 21. Ziora, L.: Machine learning solutions in the management of a contemporary business organisation. J. Decis. Syst. Taylor Francis (2020). https://doi.org/10.1080/12460125.2020.184 8378
Discovering Influence of Yelp Reviews Using Hawkes Point Processes Yichen Jiang1(B) and Michael Porter1,2 1
2
Department of Engineering Systems and Environment, University of Virginia, Charlottesville, VA 22903, USA {yj3us,mdp2u}@virginia.edu School of Data Science, University of Virginia, Charlottesville, VA 22903, USA
Abstract. With the development of technology, social media and online forums are becoming popular platforms for people to share opinions and information. A major question is how much influence these have on other users’ behavior. In this paper, we focused on Yelp, an online platform for customers to share information of their visiting experiences on restaurants, to explore the possible relationships between past reviews and future reviews of a restaurant through multiple aspects such as star-ratings, user features and sentiment features. By using the lasso regression model with review features processed through Hawkes Process Model and B-Spline basis functions as the modeling of restaurant basic performance, average star-ratings, low star-ratings and sentiment features of past reviews have been found to have significant influence on future reviews. Due to the limited dataset, we performed simulation on restaurants’ reviews using Multinomial Logistic Regression and re-built the model. A verification process has been performed eventually using Logistical Regression. The simulation and the verification results have been found to support the prior findings which indicate that influence between past and future reviews does exist, and can be revealed on multiple aspects.
Keywords: Yelp reviews Sentiment features
1
· Hawkes Point Process · Data mining ·
Introduction
With the development of technology, users can seek and receive information remotely from online platforms. Generating and sharing information could be even more convenient with low cost. Based on this, there are full of redundant, unconfirmed information on the internet which requires receivers’ own judgment, and tests the capability of those online platforms of filtering the unwanted information as well. Through the online interaction behaviors, users could be affected by the receiving information more or less, and such potential influence will be c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 84–104, 2022. https://doi.org/10.1007/978-3-030-82199-9_7
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
85
reflected and revealed on their following behaviors, online or offline, in a short term or long. Moreover, users’ opinion can be artificial guided and induced to a possible way that brings benefits or other advantages to either the platforms or other related profiteers without being noticed by the users. Therefore, tracking or mining such influence becomes interesting to us. Among variate online platforms, Yelp brings out our attention which becomes the ideal target of our research. Yelp is the most popular online forum for information sharing between restaurant customers in the US. By posting the reviews through users, Yelp provides an interactive platform for customers to display and share the pictures and opinions towards the local businesses, therefore, the potential customers would receive a general impression of a business based on the reviews and determine whether they are still interested and willing to visit or not, which brings the convenience for the customers. Specifically, Yelp users are able to post reviews toward a restaurant regarding their experience during the visit, including but not limited to the meal quality, environment and services; more importantly, star-ratings of the restaurant, ranging from 1 to 5, will be given from the customers along with the review which represents their overall feeling during visits. An averaged star-rating computed based on all reviews will be display on the main page of each restaurant representing the general feeling of the majority of the customers. The averaged star-rating will leave the first impression to the potential customers thus is critical and influential: relatively high proportion of reviews with high star-rating makes it more likely to attract customers while lower star-rating will affect in an opposite way. These types of user-generated content may affect customers’ opinion towards a business. However, this may not proceed in the direction that the restaurant intends: user may generate the review content based on their subjective feelings which may deviate from the reality, intentionally or unintentionally attracting users; In addition, competitors of one business may employ water army to post false content to discredit the business maliciously. With the development of online forum and social media, an increasing number of customers prefer looking at the reviews and recommendations before visiting the business, such that the unreliable content may mislead the customers, affect the users’ first impression towards the business. These negative effects will snowball and may eventually give rise to revenue loss to the business. Therefore, it is important to explore such dependency among reviews to investigate how a past review will influence the subsequent reviews. Many of the prior studies have already brought this topic to people’s attention by investigating the influence of user-generated content towards the information receivers, and such influence could be reflected on the enhance of users’ purchase intention, and the benefits the businesses could receive from. One study of Yelp review persuasiveness has been performed which indicates the higher trustworthiness of positive reviews other than negative reviews and two-sided reviews with respective to user attitude and purchase intentions [1]. A study of investigating the motives of reading and articulating the reviews on Yelp has found that people are likely to perceive benefit from other reviewers’ experience and share their experience with others, and such likelihood is positively related
86
Y. Jiang and M. Porter
to users’ income [2]. Businesses could also benefit from such review platforms, and the reviews generated from the users of the platforms. More specifically, a one-star increase of the star-rating of Yelp reviews have been found that lead to a five to nine percent increase in business revenue [3]. The impact a Yelp user could receive from restaurant reviews is obvious by intuition, which can be observed through the growing population of the users of restaurant review websites. However, the impact a prior review could produce towards the future reviews was hardly investigated in the past, which makes this study become challenging. One study of investigating the sequential dependencies in Yelp review ratings suggests that both within-reviewer and within-business ratings are influenced by their previous ratings [4], which reveals the dependency/influence and subjectivity of the reviews. However, this is the only study we have found focusing on the review dependency/influence, and such influence between the reviews is believed to cause the unreliability of such information-delivery. If such influence does exist, it is imaginable that any information fragment of a review would hold a certain probability of being received by other reviewers and being duplicated by other reviewers in their own reviews generated subsequently. This process will become repetitive and such information fragment will be transmitted through the review chain continuously, such that anything deviates from the reality (in positive or negative way) would be reserved in the review chain over time without being intervened, and potentially influence the business benefit. Therefore, in this paper, we address three research question regarding the aforementioned review influence: 1) Does such dependency/influence exists between reviews of a business? 2) If so, where does it exists? 3) How to model such impact? To answer the above three questions, inversely we determined to build regression model of the review chain first with variables (detailed addressed in Sect. 3) associated with multiple aspects of a review. By modeling the process of review chain, the variable(s) that shows the statistical significance, if exist, would be considered as the aspects that reveal such influence among reviews, hence, indicating the existence of review influence of a business; if no variable has been discovered to be statistically significant, then one can state that such influence among reviews still awaits to be discovered. However, based on what we have learned from prior studies, there are still limitations among all existed approaches addressing these questions. For instance, one study mentioned above [4] that focused on exploring the review dependencies have revealed their existence on review star-ratings between pairs of nearby reviews. However, many of other aspects that may take effect have been overlooked, which will be addressed in the current study. These aspects can be summarized as research gaps as follows:
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
87
1) More influential aspects should be considered The aforementioned research concentrated on revealing the review influence through exploring the possible relationship between review star-ratings, meaning that the review influence discovered exists on review star-ratings only. However, from the nature of the dependency of such information-delivery, one can intuitively infer that influential aspects from a prior review to its subsequent reviews will not be limited to a unique aspect, but we may assume that many aspects will be potentially affected, such as content, votes received from Yelp users, and user information [5]. Therefore, more factors should be considered, meaning that features associated with corresponding aspects should be extracted for further analysis. 2) Review will be influenced by more than one review The prior research which explored review influence by analyzing the sequential dependencies among reviews of a business have treated this problem pairwise [4], that is, all the analyses were performed on current review with one of its targeting prior review in pair; However, people will read more than one past review in order to receive the whole picture of the business as much as possible before making decision of visiting or not. Therefore, the aggregated influence from past reviews to the future reviews should be considered. 3) Review posted in the far past will still take effect The aforementioned research [4] only take consideration of the prior reviews nearing the current review but ignored the possible influence received from far past since reviews with high votes will be recommended and presented on the main page of a business by default through Yelp’s own sorting function. Yelp users will easily read those reviews without subjectively sorting them, hence, be potentially affected by them. In addition, people can read prior reviews posted in the far past through keyword searching, which highly raises the possibility of these prior reviews being reviewed by users. 4) Avoid the influence from businesses The impact between reviews can be captured by analyzed the similarity between reviews. One approach is comparing if the current review will behave in the same way given the star-rating of a prior review deviated from the average rating of the business by [4]. The average star-rating of a business is the baseline that one should keep concentrated on since all the similarities between reviews may come from the review influence, and the baseline, meaning that users will have similar experiences of visiting the same business. Research should be performed on analyzing the ‘pure’ review influences based on the modeling of the baseline, to try avoiding the bias from background. In order to answer the research questions, as well as address the above research gaps, we proposed a novel hybrid model which incorporates Lasso
88
Y. Jiang and M. Porter
Regression model with Hawkes Point Process, where the Hawkes Point Process has been implemented for capturing and accumulating the potential impact that a prior review may deliver to its subsequent reviews. The review chain of any business presented on Yelp can be considered as a Hawkes Point Process where the potential impact a review would receive from each of its prior reviews will be computed based on the Hawkes Process model and added onto the variable of each review, thus accumulating and delivering the impact through the review chain. All the variables computed by the Hawkes Point Process will be applied in Lasso Regression model to predict the star-rating of each review of a business, where the L-1 regulation of the Lasso Regression Model holds the shrinkage thus selects the significant variables among all for each business. Instead of the constant intercept, we implemented the B-Spline basis functions for a varying intercept which models the basic case of a business and can be considered as the baseline. In order to verify our finding obtained from the Lasso Regression model, we performed the simulation process as well, in which we implemented Multinomial Logistic Regression to generate fake review ratings based on the distribution of true review ratings for each business. In addition, we shuffled the reviews and matched them with the generated review ratings to perform the Lasso Regression with these generated fake reviews, to verify the findings we obtained from the prior Lasso Regression model we built. All the detailed methodology implemented in this paper is presented in Sect. 4. We organized this paper as follow: We describe the data used and the variables processed in this study in Sect. 3; then we present our purposed model with details in Sect. 4. The modeling results have been presented in Sect. 5. Finally, we conclude our study and discuss the limitation with possible future works in Sect. 6.
2
Literature Review
Our study aimed at detecting possible influence existing between reviews of a business. We achieved this through performing star-rating prediction of reviews where multiple features will be extracted and contribute to the final results with varying degrees, thus, such contribution differences will expose where review influence locates on. The prediction of review star-rating has become one of the biggest challenges for researchers who are interesting in and willing to explore the Yelp platform. Prior studies have adopted various approaches to achieve this goal: Asghar [6] has treated this problem as a classification problem with five classes (corresponding to star-rating of 1 to 5 for a review) combining four different approaches of text feature extraction: unigrams, bigrams, trigrams, and latent semantic indexing. Moreover, Machine Learning algorithms combined with sentiment analysis have been implemented into the review rating prediction [7]. Regression model has been implemented in star-rating prediction frequently: Lasso Regression and Vector Auto-Regression have been developed for long-term and imminent future popularity/rating predictions of Yelp reviews, with the implementation of text
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
89
features such as positive and negative unigrams extracted [8]. Aside from the review rating prediction, Fan and Khademi applied Linear Regression, Support Vector Regression and Decision Tree Regression combining with Term Frequency (TF) and Part-of-Speech to perform the star-rating of a business. Machine Learning algorithm including Lasso Regression, Random Forest and several other models have been implemented for discovering the factors that affect the preferences of romantic partners in their choice of businesses, with respect to business characteristics, review ratings, and romantic-relative languages and words used in the reviews [9], in which Lasso Regression showed its better performance comparing to many other approaches. In order to capture the possible influence between reviews of a business, one study explored cognitive sequential dependencies through comparing the measures of how much the current review is deviating from the mean between different review distances [4]. In current paper, we implemented Hawkes Point Process Model to capture and aggregate such impact from a prior review to each of its subsequent reviews in current paper. Hawkes Point Process has been created by A. G. Hawkes [10] and has been implemented widely in modeling the occurrence of a event series such as earthquake [11]. Recently, it has been implemented frequently in modeling and predicting the cascade of streaming social media. One of the study has implemented Hawkes Point Process in predicting the popularity of twitter cascades with respect to the expected number of future events [12]. Pinto et al. [13] developed a Hawkes-based information diffusion model for topic trend detection in social networks which takes the user-topic interactions into consideration; a bi-direction relationship between users and items (user-toitem and item-to-user) was considered and introduced combining with temporal point process model for investigating the latent features beneath the networks [14]. Moreover, Multivariate Hawkes Point Process Model has been implemented in investigating the Yelp reviews: it has been developed for capturing the effect of review star-rating from users to the star-rating of subsequent reviews of a business [15]. To the best of our knowledge, all relevant prior studies have focused on investigating the review influence/dependency through analyzing the review star-ratings only, however, multiple aspects of future reviews would possibly be affected and present such effect through information sharing; Moreover, any future review will possibly be affected by multiple prior reviews, which is accorded with users’ browsing habits, however this has not been addressed in some of the prior research. To address these gaps, we purposed a hybrid model in which Lasso Regression has been adopted for review rating prediction along with features processed through Hawkes Point Process Model as the independent variables. Given all the great prior studies, text and sentiment features extracted from Yelp reviews, and user features of reviewer have been applied into this study.
90
Y. Jiang and M. Porter
3
Data Description
3.1
Raw Data
In order to verify the existence of review influence/dependency among business reviews, one must explore the possible aspects that review influence may locate. Past research that investigated the review influence among business reviews have revealed the existence of such review influence in review star-ratings; Motivated by this, one may reasonably deduce that not only the star-ratings, but other aspects could also reflect such review influence. Hence, it is necessary to expand the exploration scope to perform considerable investigation. Specifically, business data (e.g. business star-rating, business review count), user data (e.g. user review count, user fan count), and review data (e.g. review star-rating, review text) have been studied and investigated further for building the model. To guarantee that the data targets our research goal accurately, we narrowed down the scope of qualifies businesses by two criteria: 1) Businesses with over 500 reviews; 2) Businesses with over 100 reviews posted per year. Thriving businesses with good benefits can easily attract numerous customers to visit and post reviews, thus, providing us abundant corpus data. The dataset filtered through such criteria guarantees the relatively complete causal chain of review influence among reviews if exist. All the filtered businesses have been matched with their corresponding reviews and user information. 3.2
Features
Based on the aforementioned data, data pre-processing has been performed to standardize the data into desired formats suiting for further modeling and application. In this study, we applied four types of features: star-rating features, user features, text features, and interaction between features. Star-Rating Features. For star-rating features, we converted the star-rating of each review using one-hot encoding algorithm. For instance, the star-rating of a 4-star review can be converted into: – – – – –
1-star 2-star 3-star 4-star 5-star
rating: rating: rating: rating: rating:
0; 0; 0; 1; 0.
Therefore, we obtained five different features for each review. In addition, averaged star-rating has been calculated as a feature which indicates the averaged star-rating of all past reviews until the current review, according to the sequential-ordered reviews.
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
91
User Features. For User features, we collected review count, yelping since, number of fans (since number of fans is highly correlated with number of friends, we chose number of fans instead of two), and voting count of the aggregation of three categories: useful, cool and funny, of the poster of current review. Particularly, yelping since measures the time length between the registration time of the user account and the posting time of current review. This time length has been converted into numeric consisted by an integer of number of days from registration day to the posting date of the review, and a fraction of hours and minutes of the posting time. Text Features. In order to extract information thoroughly from the raw data and obtain a better performance on the model, we considered extracting text features from Yelp reviews. Initially, there are six features considered being extracted from the review texts, however, due to the high Pearson correlation between some of the features, we considered three features being included in our model: average Word Probability, polarity and subjectivity. 1) Average Word Probability Average Word Probability was obtained based on two components: Term Frequency (TF) and Total Term Frequency (TTF). In this case, we defined a review as a document, and all reviews of a business as the corpus instead of reviews of all businesses, since different restaurant-related terms will be mentioned in different businesses. The necessary processing procedures have been performed for calculating TF and TTF, including document tokenization, stemming and stopword removal. Furthermore, Document Frequency (DF) of each term has been calculated, representing the frequency of documents that contain each of specific term. A controlled vocabulary has been constructed based on a filtered list of terms based on setting reasonable thresholds of DF, hence maintained the important text information and reduced the processing cost of the model. Unigram features were extracted in this study. Average Word Probability of each review, then, was obtained through the following equation: AverageW ordP robability(d) = mean(P rob(t, d)) T F (t, d) ) = mean( T T F (t)
(1) (2)
Where TF(t,d) refers to the Term Frequency of a term within a document, TTF(t) refers to the Total Term Frequency of a term among all the documents of a business, and Prob(t,d) refers to the probability measurement of a term. In this study, all terms in the controlled vocabulary were taken into the above calculation, meaning that terms not appearing in current document will obtained a 0 score on Prob(t,d); Moreover, the averaging process of each document was calculated based on all terms’ Prob(t,d).
92
Y. Jiang and M. Porter
2) Sentiment feature: Polarity Polarity is a sentiment feature that measures the sentiment polarity of corpus, which ranges from –1.0 to 1.0, corresponding to extreme negative sentiment to extreme positive sentiment. This feature is extracted through a python package called “TextBlob” [18] based on the raw review text content (without any text processing). 3) Sentiment features: Subjectivity Another sentiment feature is subjectivity, which measures the subjectivity of a corpus, ranging from 0 to 1.0, corresponding to very objective to vert subjective. This feature is extracted through python package “TextBlob” as well. Interactions Between Features. Considering the influence of star-ratings and the sentiment features, we implement interactions between star-ratings and sentiment features, including: 1) Interaction between star-rating and Polarity; 2) Interaction between star-rating and Subjectivity; 3) Interaction between starrating and Average Word Probability. Interactions are calculated as the multiplier of star-rating and value of corresponding sentiment feature. 3.3
Variables
Based on the aforementioned extraction process, 17 features have been extracted from the raw data regarding various aspects of a review. In order to incorporate the features with Hawkes Point Process Model, all the features have been implemented into the Hawkes Point Process Model to aggregate all the possible impacts received from prior reviews. To further model the decaying speed of impact of different features, five different values of decay parameter have been selected to incorporate with the features, such that a total number of 17×5=85 variables (also called Hawkes features) have been created, in which each feature will be fitted with five decay values. This will be further discussed in the next section.
4
Methodology
Discovering and Understanding the inner relationship between business reviews regarding various review aspects is a complicated but meaningful problem. This paper aims at building appropriate model to reveal the existence of such review influence among business reviews, and investigate how reviews would affect one another, with respect to multiple influential aspects. Based on the prior studies that addressed these research questions, it is not hard to find out the limitations and gaps on the methodology they applied for investigating the problems: 1) Reviews would be influenced by not only a single aspects, but more aspects will be influenced and revealed such influence; 2) Any of the review would possibly
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
93
be influenced by not only any other single review, but all prior reviews together; 3) Reviews posted from far past would still take effect on current review; 4) Review influence should be investigated under the case that similarities derived from the background (business) have been avoided. All the aforementioned aspects are the targets that addressed in the proposed method, which will be reflected in the modeling part, and be presented next. 4.1
Lasso Regression Model
We implemented Lasso Regression to calculate and optimize the coefficient for each variable as the event magnitude αj of each variable, in which Lasso Regression is a special type of Linear Regression with L-1 Regularization as shrinkage. The objective of the regression is to minimize the Sum of Squares by constrains: n
(yi −
i=1
xij βj )2 + λ
j
p
|βj |
(3)
i=1
where λ is the tuning parameter that controls the L-1 penalty, and generally the smallest λ will be chosen. 4.2
Lasso Regression Model with Hawkes Features (Variables)
To achieve the goal incorporating all the aforementioned problems, we proposed a novel method that integrates the Lasso Regression Model with Hawkes Point Process Model [10]. The entire model implements the features extracted from the raw data and processed through applying Hawkes Point Process Model to capture the influence received from all the prior reviews (including those posted in the far past), and performs the prediction of review star-ratings through Lasso Regression Model. The model format is presented as follows: yi = =
βxt + C
(4)
β [λ(t)] + C
(5)
where β represents the coefficients of variables, λ(t) represents the variables (Hawkes features) we extracted from the yelp dataset. The Hawkes Process Model applied for extracting features can be expanded as follows: αg(t − ti ) (6) λ(t) = b(t) + =
i:t>ti
αg(t − ti )
(7)
αδe−δ(t−ti )
(8)
i:t>ti
=
i:t>ti
94
Y. Jiang and M. Porter
Since a review series of a business only contains one immigrant event which is the first review of that business, it can be modeled by Marked Hawkes Process where the Hawkes intensity function only contains its summation term; α is the branching factor of the process which controls the number of events the process may generate; g(t−ti ) is the kernel function, and we applied exponential distribution as the kernel function to model the information diffusion process of the Yelp reviews, with decay parameter δ. Then, the Lasso Regression Model can be expressed as follows: −δk (t−ti ) βjk αj δk e (9) + θq Bqi yi = j
k
i:t>ti
The structure of the above model is basically built on Lasso Regression Model, where β is the coefficient obtained from Lasso Model for Regression −δ k (t−ti ) α δ e is each variable; The term locates within the brackets j k i:t>ti the variables applying in the prediction of review star-ratings, referring to the Hawkes features, which are the features extracted from the raw data and processed through Hawkes Point Process Model. The summation term θq Bq is the intercept term of the Lasso Regression Model, which is substituted and modeled by B-Spline basis functions as the baseline of the business, where q represents the knot of the B-Spline functions. Particularly, we set year-break of review posting times as the knots of the B-Spline functions, and the spline order has been set to 3. Moreover, the review posting times are set to be the control points of the B-Spline functions, which were accurate to seconds to differentiate reviews posted on the same day. More information of B-Spline basis function is presented explicitly in Appendix. Specifically, αj is the branching factor of the j-th feature; Particularly, αj has been set to αj = 1 for the star-rating of a review, or αj = f eature value for all the other features of a review (e.g. review count, sentiment score etc.). Term δe−δ(t−ti ) is the function of exponential kernel of the Hawkes Point Process Model, which determines the decaying process of an information diffusion process. Term ti is the occurrence time of i-th event, in our case, the posting time of i-th review of a business. Term δk is the k-th value of the decay parameter δ; In our case, five values have been selected for the decay parameter: δ ∈ [0.005, 0.05, 0.1, 1, 5], representing five different decay speed of the impacts received from prior reviews. Based on the aforementioned review aspects, we defined and extracted three types of raw features from the Yelp review dataset: star-rating features, user features and text features, 17 features in total. Each feature was matched with five different values of decay parameter when calculating the aggregated influence received from past reviews through Hawkes Point Process Model, hence, a total number of 17×5=85 variables have been created and implemented into the Lasso Regression Model. Based on the shrinkage property of the Lasso Regression Model, variable with the most appropriate value of decay parameter (which fits the real decaying speed) will be selected as a significant feature.
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
4.3
95
Simulation Using Multinomial Logistic Regression Model
Considering that we only have a small certain number of businesses with over 500 reviews in total, we decided to perform the simulation based on the Multinomial Logistic Regression Model, which would remedy the situation of limited data and present a general view of the businesses as well. During the simulation, we implemented the Multinomial Logistic Regression Model with the star-ratings as the dependent variable and the B-Spline basis elements alone as the independent variables. Multinomial Logistic Regression is the extension of general Logistic Regression for predicting categorical variables with multiple categories instead. The predicted probability of receiving k starrating (k ∈ [1, 5]) of a review can be expressed as follow: eβk ·Xi P (Yi = k) = K βj ·Xi j=1 e
(10)
where (k)
(k)
(k)
βk · Xi = β0 + β1 x1 + β2 x2 + · · · + βp(k) xp (k)
p
(k)
i=1 p
= β0 + = β0 +
i=1
(11)
βi xi
(k)
(12)
(k)
(13)
βi (θi Bi )
where i denotes the variable number ranging from 0 to a total number of p; j denotes the star-ratings ranging from 1 to a total number of K = 5. This model was implemented for simulation in current paper, which returned the probabilities of receiving the five different star-ratings for each review in the business. Thus, we can generate new star-rating for each review based on the cumulative probabilities computed from the results. A total number of 100 times of simulations were performed using Multinomial Logistic Regression to generate fake (simulated) star-ratings. As the number of simulations was going up for each business, we were obtaining a set of generated star-ratings for each review, (e.g. 100 star-ratings will be returned for each review if 100 runs of simulation have been conducted) and the resulted star-ratings among all the reviews of a business follow exactly the distribution of the cumulative probabilities (obtained from the result of Multinomial Logistic Regression) when the number of simulation is sufficient enough (the star-ratings were generated based on it). 4.4
Lasso Regression Modeling on Simulated Data
After the simulation procedure, we performed Lasso Regression Modeling on simulated data. Since simulation process for each business was performed 100 times which returned 100×number of reviews of simulated star-ratings in a business, Lasso Regression model was built on simulated data for each run of the simulation with simulated star-ratings as dependent variables. Particularly, for each
96
Y. Jiang and M. Porter
run of the simulation, business reviews were shuffled and re-positioned to each order, matched with the simulated star-rating on the same order, and inputted into the model as independent variables.
5
Results
This section presents the modeling and simulation result from multiple steps. The general procedures we performed on this study include: 1) Multivariate Hawkes Process and Lasso Regression Modeling on processed Yelp data; 2) Generating simulated data through Multinomial Logistic Regression; 3) Multivariate Hawkes Process and Lasso Regression Modeling on simulated Yelp data; 4) Verification through Logistic Regression. We will introduce the above procedures with details and discuss the findings obtained from the results. 5.1
Lasso Regression Modeling on Processed Yelp Data with Hawkes Features (Variables)
The hybrid model built based on Multivariate Hawkes Point Process and Lasso Regression Model has been implemented on the Yelp review data through Python and R. All the Yelp data has been collected through the public Yelp Dataset Challenge [16] of year 2019 and 2020 from which a total number of 1715 unique businesses containing more than 500 reviews have been extracted and analyzed, being matched with corresponding review and user information; text features and interactions between features have been extracted and computed as well. All the raw features have been processed through Multivariate Hawkes Point Process to acquire the variables which aggregate the influence from prior reviews. With the pre-process through Multivariate Hawkes Point Process and the BSpline basis functions, we could input all the data for building a Lasso Regression Model to make predictions and check the significance of each variable where the significant variables will indicate the influence gaining from prior reviews. The results have been summarized in Table 1 and Fig. 1. Lasso Regression Modeling allows selecting a set of most effective variables among all similar variables such that reducing the complexity and the likelihood of being over-fitted. Table 1 provides the general view of the significance of each variable: The count of each variable represents the number of businesses that obtained a significant result on that variable among all businesses regardless of the decay parameter of that variable, and the proportion was computed through dividing the count by 1715, which is the total number of businesses in out dataset. From the Proportion one can find that the average star-ratings (with proportion of 0.1569) has the most significant impact on the future reviews, followed by 1-star review (proportion = 0.1230), 2-star review (proportion = 0.0746), and
Discovering Influence of Yelp Reviews Using Hawkes Point Processes Table 1. Variable significance of original dataset Variable
Count Proportion
1 Star
211
0.1230
2 Star
128
0.0746
3 Star
123
0.0717
4 Star
94
0.0548
5 Star
106
0.0618
Average star-ratings
269
0.1569
Votes
96
0.0560
Elite count
53
0.0309
Fan count
56
0.0327
Review count
70
0.0408
Yelping since
53
0.0309
Average word probability
66
0.0385
Sentiment: polarity
91
0.0531
Sentiment: subjectivity
128
0.0746
Stars × average word probability
49
0.0286
Stars × polarity
79
0.0461
117
0.0682
Stars × subjectivity
Fig. 1. Coefficient result of lasso regression model built on original dataset
97
98
Y. Jiang and M. Porter
3-star review (proportion = 0.0717), which indicates that reviews with lower starrating are more likely to prompt Yelp users who saw the reviews to post new reviews; sentiment-related variables such as sentiment subjectivity (proportion = 0.0746), sentiment polarity (proportion = 0.0531) and the interaction between sentiment subjectivity and star-ratings would trigger new reviews as well, since reviews with strong personal feelings or along with extreme star-ratings will be more infectious. The dot plot provides a view of the variables towards influence direction. The dots presented in Fig. 1 represent the non-zero coefficients obtained from the result of Lasso Regression Model in which coefficients of each variable have been plotted together regardless of the decay parameter, and the boundary of X-axis has been set to [-15,15] to eliminate outliers for visualization purpose. A positive coefficient indicates a positive influence of prior reviews applied on the future reviews, while the a negative coefficient performed in an opposite way. Therefore, we can observe that the majority of variables could not guarantee a certain influence direction on the future reviews; However, average star-ratings are highly likely to have a positive influence on future reviews, which reveals the nature that a business with higher rating will keep attracting customers to visit and post positive reviews toward the business on Yelp, and thus helps improve rating or at least remain it unchanged; sentiment subjectivity and sentiment polarity have similar trends but less obvious as averaged star-rating does, which indicate that reviews with subjective sentiment or positive sentiment are more likely to exert positive influence on future reviews, in other words, reviews with high star-rating would be triggered. All findings obtained from the basic Lasso Regression Model indicate the existence of inner relationships between prior reviews and future reviews with respect to different review aspects (variables), which motivated us to perform further analysis. 5.2
Generate Simulated Data Through Multinomial Logistic Regression
Due to the limited number of qualified businesses we have (1715 businesses in total), we performed the simulation based on the Multinomial Logistic Regression Model for the businesses with significant yelp variable(s) in the result of Lasso Regression Model to model the basic standard of the businesses, which helps recognize and differentiate the influence from the reviews themselves and the businesses, and this would also remedy the situation of small dataset and present a general view of the businesses. Furthermore, we concentrated on businesses with reviews posted in high frequency to detect more accurate inner relationships between reviews, such that a total number of 152 businesses with at least one significant variable (variable with non-zero coefficient) and over 100 reviews posted per year were selected for simulation, with significant coefficients scattered in Fig. 2. It can be obviously observed that the majority of the significant coefficients of selected businesses are positive, which indicates that these businesses are more attractive than others with frequent-posted reviews such
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
99
that will hold a positive influence from prior reviews on future reviews, even for all variables we considered.
Fig. 2. Coefficient result of selected businesses
We implemented the Multinomial Logistic Regression Model for simulation with the star-ratings as the dependent variable and the B-Spline basis elements alone as the independent variables, returning the probabilities of receiving five different star-ratings for each review in the business, and then generated new star-rating for each review based on the cumulative probabilities computed from the simulation results. One hundred times of simulation were performed for each selected business. As the number of simulation went up for each business, we obtained a set of generated star-ratings for each review, (e.g. 500 star-ratings for each review if 500 times of simulation have been conducted) and the resulted star-ratings among all the reviews of a business follow exactly the distribution of the cumulative probabilities when the number of simulation is sufficient enough (we generated the star-rating based on that). A total number of 15200 times of simulation have been performed for model re-building in next step. 5.3
Lasso Regression Modeling on Simulated Review Star-Ratings and Hawkes Features (Variables)
In order to obtain the basic standard of the businesses, we re-computed the decay variables for all features using the Multivariate Hawkes Point Process Model, in
100
Y. Jiang and M. Porter
which the influence of past events (reviews) will be quantified and aggregated on the following events; however, contrary to the earlier implementation, we performed some changes to this model: 1) Differing from the previous processing which was performed only once for each business, we re-computed the decay variables based on the number of simulation we ran for each business; For each time of the simulation, we replaced the true star-ratings with the simulated star-ratings; 2) We shuffled the reviews such that different reviews were placed on the original order corresponding to the simulated star-rating. Following the above changes, we built the Lasso Regression Model over again for simulated data obtained from simulated star-ratings and shuffled Yelp reviews processed through Multivariate Hawkes Point Process Model. The coefficients generated by the re-built Lasso Regression Model were compared to the observed coefficients obtained from the original Lasso Regression Model for checking the significance of each variable. Specifically, for each business, we have 100 times of simulation on which Lasso Regression Model were built over again to compute the coefficients for each variable with different values of decay parameter, hence we have 100 coefficients obtained from re-built Lasso Regression Model on each variable with unique value of decay parameter; each coefficient was compared to the observed coefficient obtained from the original Lasso Regression Model: the number of absolute values of simulated coefficients obtained from re-built Lasso Regression Model that are larger than or equal to its corresponding observed coefficient will be divided by the total number of simulations so as the p-value of the current variable with unique value of decay parameter to determine its significance. Since we have five values of decay parameter, we could obtain five p-values with respect to the different values of decay parameter, we hence determined the significance of the current variable regardless of decay parameter by the lowest p-values among all five p-values: if one of the five p-values indicates its significance, we could then conclude the significance of current variable regardless of the significance of other four p-values. The results of simulated coefficients have been plotted in Fig. 3. Figure 3 presents the density of non-zero coefficients of simulated data comparing to the corresponding observed coefficients regardless of the decay parameter, with threshold set to −15 and 15. Due to the fact that the data implemented for generating the simulated star-ratings was the B-Spline basis elements which were considered representing the basic standard of the businesses, the coefficients are scattered around 0 (red region) as what we expected for the business’s standard, while the corresponding observed coefficients are expanded with different degree of variations for different variables. 5.4
Verification Through Logistic Regression Model
We performed verification on the results obtained from the Lasso Regression Model built on the original dataset (with 1715 businesses). We created a binary
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
101
Fig. 3. Result of simulated coefficient for selected businesses, obtained from re-built Lasso Regression Model
label for each business where a business received a label of 1 if it has at least one non-zero coefficient on modeling result of the variables, or a label of 0 with no non-zero coefficients exist within the result, regardless of the variable type, and this label were set to be the dependent variable of the Logistic Regression Model. Business attributes were extracted as the independent variables of the Logistic Regression Model and were selected based on multiple times of modeling, which includes: – Stars: the average star-rating of the business, in which the range of starratings from 0 to 5 has been divided into levels with increment of 0.5 (e.g. 0, 0.5, 1.0), such that the average star-ratings were truncated corresponding to the closest level (e.g. an average star-rating of 3.82 will be matched to level of 4.0); – Review count: total number of reviews received from Yelp users; – Year count: total number of years the business has been operated, which has been extracted through the posting time of the earliest and latest review of the business; – Price range: the dollar sign presented on the home page of the business on Yelp, representing the general cost for visiting it: more dollar signs indicate a possible higher cost; – State: the state where the business locates; – Attire: casual or dressy; – Breakfast, Brunch, Lunch, Dinner, Late-night, Dessert: Boolean variable indicating whether the business provides the corresponding service or not.
102
Y. Jiang and M. Porter Table 2. Result of logistic regression model Variable
Coefficient P-Value
Stars
–0.1969
0.000
0.0004
0.000
Year Count
–0.0397
0.002
State
–0.0505
0.176
Attire
0.1021
0.111
Price Range
0.1546
0.035
Dessert
0.0628
0.555
Review Count
Late night
0.1248
0.264
Lunch
–0.0965
0.397
Dinner
–0.1368
0.353
Brunch
–0.0912
0.453
0.2446
0.102
Breakfast
The result of Logistic Regression Model has been summarized in Table 2: the star-rating of the business is significant with coefficient of 0.1969 which indicates it is more likely that past reviews of a business with lower average star-rating will influence the futures of current business, furthermore, a lower average starrating is caused by accumulative reviews with low star-rating which verify the finding from Table 1 that 1-star, 2-star and 3-star ratings are the variables that have significant influence on the future reviews; review count is significant which indicates that a business attracts relatively more customers to visit and post reviews will hold influence on the future reviews; year count is significant however with a negative coefficient, from which we could infer that businesses with relatively long-term operation have more reviews posted at the early stage of Yelp’s development with less influence due to the limited number of users.
6
Discussion
We held the assumption that there is influence/dependency between past and future reviews before conducting the experiments. In order to verify this, we extracted features through Hawkes Process Model to aggregate the possible influence for each review from all its prior reviews, and built Lasso Regression Model on these Hawkes features (variables). Based on the results of our proposed method, we proved that the influence between reviews does exist; Furthermore, it can be revealed from multiple aspects of a review, including low star-ratings (which was investigated and found by prior research), sentiment scores, as well as interactions between sentiment scores and star-ratings. These findings advance the existing methodology, and provides possibility for further analysis. The limitation of the current paper can be addressed in several aspects. First, the businesses in the public Yelp dataset were partially selected such that most
Discovering Influence of Yelp Reviews Using Hawkes Point Processes
103
are located in Las Vegas and Phoenix, which may cause bias on the result and inapplicability of conclusion for businesses elsewhere. Second, filtered businesses came from 2019 and 2020 Yelp dataset which contain inconsistency of year range: businesses in dataset of year 2019 lacked the information of year 2020. In addition, simulation time was set to 100, which can be enhanced to reach a more accurate and reliable result. Based on the aforementioned limitation, we can improve the research in multiple directions: apply a larger and more general business dataset; increase the number of simulations; particularly, further analysis could be performed to explore specific relationships between prior reviews and future reviews with respect to different aspects (e.g. how a review feature would affect other features specifically).
7
Conclusion
In this study, we performed analysis on Yelp data to investigate the influence of prior reviews on future reviews of a restaurant. Review features from multiple aspects were extracted and processed through Hawkes Process Model to aggregate influence from prior reviews, and were applied into Lasso Regression Model along with B-Spline basis functions as baseline of the restaurant. The basic results proved that such review influence does exist, and can be found on multiple aspects of a review such as sentiment score and star-ratings. These findings have been presented through the simulation as well, and have been partially verified through Logistic Regression Model.
8 8.1
Appendix B-Spline Basis Function
The basic framework of B-Spline curve has been created by Schoenberg on 1946 [17], and has been developed to adjust different application such as modeling of 3-D geometry shape or interpolation of fluctuating data points for smoothing purpose. A k-order B-Spline curve is composed by a set of linear-combined control points Pi and B-Spline basis functions denoted as Ni,k (t), and each control point is associated with a basis function in a recurrence relation such that: t − ti ti+k − t + Ni+1,k−1 (t) ti+k−1 − ti ti+k − ti+1 1 if ti ≤ t ≤ ti+1 = 0 otherwise
Ni,k (t) = Ni,k−1 (t) Ni,1
(14)
The shape of B-Spline basis function is determined by the knot vector: T = (t0 , t1 , ..., tk−1 , tk , tk+1 , ..., tn−1 , tn , tn+1 , ..., tn+k )
(15)
The number of elements of the knot vector is defined by the sum of number of control points and the order of the B-Spline curve (n+k+1).
104
Y. Jiang and M. Porter
References 1. Pentina, I., Bailey, A.A., Zhang, L.: Exploring effects of source similarity, message valence, and receiver regulatory focus on yelp review persuasiveness and purchase intentions. J. Mark. Commun. 24(2), 125–145 (2018) 2. Parikh, A., Behnke, C., Vorvoreanu, M., Almanza, B., Nelson, D.: Motives for reading and articulating user-generated restaurant reviews on Yelp. com. J. Hosp. Tour. Technol. (2014) 3. Luca, M.: Reviews, reputation, and revenue: The case of Yelp. com. Com (March 15, 2016). Harvard Business School NOM Unit Working Paper (12-016) (2016) 4. Vinson, D.W., Dale, R., Jones, M.N.: Decision contamination in the wild: sequential dependencies in online review ratings. Behav. Res. Methods 51(4), 1477–1484 (2018). https://doi.org/10.3758/s13428-018-1175-8 5. Hu, N., Liu, L., Zhang, J.J.: Do online reviews affect product sales? the role of reviewer characteristics and temporal effects. Inf. Technol. Manag. 9(3), 201–214 (2008) 6. Asghar, N.: Yelp dataset challenge: review rating prediction. arXiv preprint arXiv:1605.05362 (2016) 7. Xu, Y., Wu, X., Wang, Q.: Sentiment Analysis of Yelp’s Ratings Based on Text Reviews (2014) 8. Kc, S., Mukherjee, A.: On the temporal dynamics of opinion spamming: case studies on yelp. In: Proceedings of the 25th International Conference on World Wide Web, pp. 369-379 (2016) 9. Rahimi, S., Andris, C., Liu, X.: Using yelp to find romance in the city: a case of restaurants in four cities. In: Proceedings of the 3rd ACM SIGSPATIAL Workshop on Smart Cities and Urban Analytics, pp. 1-8 (2017) 10. Hawkes, A.G.: Spectra of some self-exciting and mutually exciting point processes. Biometrika 58(1), 83–90 (1971) 11. Freed, A.M.: Earthquake triggering by static, dynamic, and postseismic stress transfer. Ann. Rev. Earth Planet. Sci. 33, 335–367 (2005) 12. Mishra, S., Rizoiu, M. A., Xie, L.: Feature driven and point process approaches for popularity prediction. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1069-1078 (2016) 13. Pinto, J.C.L., Chahed, T., Altman, E.: Trend detection in social networks using Hawkes processes. In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015, pp. 1441–1448 (2015) 14. Wang, Y., Du, N., Trivedi, R., Song, L.: Coevolutionary latent feature processes for continuous-time user-item interactions (2016) 15. Porter, M.: Multivariate hawkes point process models for social systems. In: Proceedings of the 62nd World Statistics Congress of the International Statistical Institute (2017) 16. Yelp open dataset. https://www.yelp.com/dataset/, Accessed 20 Feb 2020 17. Schoenberg, I.J.: Contributions to the problem of approximation of equidistant data by analytic functions. In: IJ Schoenberg Selected Papers, Birkh¨ auser, Boston, MA, pp. 3–57 (1988) 18. TextBlob: Simplified Text Preprocessing. https://textblob.readthedocs.io/en/ dev/, Accessed 23 Jan 2021
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market Monirah Ali Aleisa(B) , Natalia Beloff, and Martin White University of Sussex, Falmer, Brighton BN1 9RH, UK {Ma989,n.beloff,m.white}@sussex.ac.uk
Abstract. One of Saudi Arabia’s vision 2030 goals is to keep the unemployment rate at the lowest level to empower the economy. Research has shown that a rise in unemployment has a negative effect on any countries gross domestic product. Artificial Intelligence is the fastest developing technology these days. It has served in many specialties. Recently, Artificial Intelligence technology has shined in the field of recruiting. Researchers are working to invest its capabilities with many applications that help speed up the recruiting process. However, having an open labor market without a coherent data center makes it hard to monitor, integrate, analyze, and build an evaluation matrix that helps reach the best match of job candidate to job vacancy. A recruiter’s job is to assess a candidate’s data to build metrics that can make them choose a suitable candidate. Job seekers build themselves metrics to compare job offers to choose the best opportunity for their preferred choice. This paper address how Artificial Intelligence techniques can be effectively exploited to improve the current Saudi labor market. It aims to decrease the gap between recruiters and job seekers. This paper analyzes the current Saudi labor market, it then outlines an approach that proposes: 1) a new data storage technology approach, and 2) a new Artificial Intelligence architecture, with three layers to extract relevant information from data of both recruiters and job seekers by exploiting machine learning, in particular clustering algorithms, to group data points, natural language processing to convert text to numerical representations, and recurrent neural networks to produce matching keywords, and equations to generate a similarity score. We have completed the current Saudi labor market analysis, and a proposal for the Artificial Intelligence and data storage components is articulated in this paper. The proposed approach and technology will empower the Saudi government’s immediate and strategic decisions by having a comprehensive insight into the labor market. Keywords: Recruiting · Job seeker · Artificial Intelligence · Natural language processing · Clustering algorithms · Recurrent neural network
1 Introduction Recruiting is critical to any organization’s success. It is a challenging task with a long process. Nowadays, recruiters must use an effective HR (Human resource) system to select from a pool of applicants the most appropriate candidates that will also focus © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 105–124, 2022. https://doi.org/10.1007/978-3-030-82199-9_8
106
M. A. Aleisa et al.
on emphasizing employee retention [1]. There are several features that recruiters should consider when making a decision. Furthermore, these features will build matrices that can give an indicator value. Features include, for example, the number of vacant positions, the number of applications required, the level of education, knowledge, skills, years of experience, abilities, preferences, values, and more [2]. On the other hand, many job seekers join the labor market – from new graduates to those looking for better or different opportunities. As the market is frequently changing, job seekers are often looking for better positions. The job seekers’ activities as a selfregulated mission involve building a CV, finding a career path, finding the best place to apply to, preparing for an interview, comparing job offers, and more. All activities contain features that will build matrices that can give an indicator value. Most research focuses on the recruiters to find the best candidates but not that much attention has been given to the job seekers’ side. It is concluded from Okun’s Law that the rise in unemployment in the US affects the gross domestic product (GDP). According to Okun’s Law, a one percent increase in unemployment causes a 2% fall in GDP [3]. According to the General Authority for Statistics, the Saudis’ unemployment rate in Saudi Arabia for the second quarter of 2019 is (12.3%). And this is a high percentage for a rich and developing country with a population that does not exceed more than 20 million. The importance of helping to reduce unemployment is clear here, and the Saudi Vision 2030 addresses this issue. Saudi Vision 2030 is a plan that was announced on April 25, 2016, and coincides with the date set for announcing the completion of the handover of 80 government projects. The plan was organized by the Council of Economic and Development Affairs. And jointly achieved by the public, private and non-profit sectors [4]. The Kingdom of Saudi Arabia is making tremendous efforts in this field. However, there are many obstacles in the Saudi labor market, each with different dimensions. It is not within the limits of this paper to discuss them all. However, it might be helpful to review some of them, to get an overview of these Saudi market obstacles. For example, market regulations, the wage gap between national and expatriate workers, most nationals employed are in the public sector, not enough well-paying and productive jobs for the young and growing population, just some of the obstacles [5]. To overcome such obstacles to efficient recruitment, a national data center to store and integrate data that integrates Artificial Intelligence (AI) models to accurately analyze and predict trends could be very useful. To test this hypothesis, this paper proposes an AI model as a matching engine that serves both recruiters and job seekers, using labor market big data. The model will utilize text mining models, natural language processing, and clustering algorithms to extract and analyze relevant information that is integrated, from different sources, into the national data lake (DL). This approach will allow mapping between job supply and demand in the labor market, which is too complex to achieve manually. This paper is organized as follows. Section 1 provides a brief description of Saudi Arabia’s labor market. Section 2 highlights the DL (i.e. the data repository) and the AI model’s techniques that will be used in the proposed solution: Clustering Algorithm, Natural Language Processing, and Returned Neural Network. Section 3 describes the solution proposed. Finally, Sect. 4 presents the conclusions of this study.
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
107
2 Background Research A labor market can be defined as a mapping process that forms a mechanism to match demand with supply for employers and employees [6]. The term ‘labor market’ is criticized by the US Secretary of Labor. He thinks that this term regards employees as being bought and sold, such as any other product, whereas they are unique in several ways. The term labor market refers to supply and demands in this field. Employees sign clear contracts with some tasks and responsibilities. The contract involves employees’ rights, salary, allowances, and detailed constraints binding both parties [7]. Historical transformations in the labor market has generally been due to changes in labor conditions, and some factors have dramatically changed the characteristics and nature of the Saudi labor market in the past few decades, and both advanced and developing countries. Several jobs have disappeared while new jobs have become available; some are novel jobs that did not exist until a few years ago. Also, the quantity and quality of skills and qualifications in demand because of this new labor market has changed dramatically, which might influence the nature of the new labor market. This research, outlined in this paper, first analyzes the Saudi labor market in-depth, trying to understand the Saudi labor market, national projects that currently support the labor market, the labor market growth, the current recruiting system, and the new establishment of the Saudi data and AI authority. This research poses some interesting questions. With all projects that the government has implemented, why did the situation not improve significantly? How can AI be effectively exploited to improve the current Saudi labor market? What is the best repository to store national data? What are the machine learning (i.e. clustering algorithms) that can be used to build the model? Finally, how to structure the new approach? To try and answer these questions we discuss in more depth the: Saudi labor market, Saudi national projects for the labor market, Saudi labor market growth, the current recruiting system in Saudi, and the Saudi data and AI authority, next. This leads us to propose a new data repository architecture and AI model for matching Saudi job seekers to the Saudi labor market see Sect. 2 Data Lake and Algorithms. Saudi Labor Market. The Kingdom of Saudi Arabia is currently witnessing an unprecedented economic transformation, which has affected all government activities and may have the effect of creating new jobs and bringing more Saudi women into the labor market, bearing in mind that one of the economic objectives of the Saudi Vision 2030 is to reduce foreign remittances [8]. The rapid revolution in the labor market might take an unexpected road, where changes in the skills needed for current jobs coincide with the emergence of new jobs, requiring an effective monitoring skills method, which has not been available until now [9]. The Saudi labor market relies heavily on foreign workers, especially in the private sector, for two reasons. The first reason is the massive demand for workers in the oil sector. The second reason is the size of the Kingdom of Saudi Arabia, which needs large infrastructure projects that require temporary workers who work only for the duration of the project and therefore do not provide secure employment opportunities for Saudis. Saudi Arabia has two distinct labor markets with different characteristics with many such workers, one for the Saudis and the other foreign workers [10].
108
M. A. Aleisa et al.
In general, Saudi Arabia’s labor market is divided into a government sector that follows the General Organization for the Retirement System and a private sector whose pension system is subject to the social security system. The International Labor Organization (ILO) has defined underemployed as individuals searching for work and available to work more hours but worked fewer hours than their capacity and/or willingness to work [11]. Employment for Saudis is at the forefront of the discussion of the Kingdom’s economic policies. However, there is still much research needed to identify solutions that ensure adequate and sustainable jobs for Saudi citizens. Labor market developers aim to find jobs in the private sector for citizens, where the significant difference in both labor rights and the cost of employment between nationals and foreign workers means that employers always favor foreign workers. Stephen points out that the employer’s perspective is often missing in the discussion of Saudization1 , and it should be seriously analyzed to identify policies that work on the ground, rather than employer’s evasion through ‘delusional work’ and other fraud techniques on labor market regulations [12]. Abdul Hamid Al Omari is an economic specialist for a Saudi financial agency and one of the well-known economic writers in Saudi Arabia; he reviewed an analytical study of the Saudi labor market and explained no official documented information from the government approval of data sources. The conflicting information about Saudi Arabia’s labor market is only due to the multiplicity of semi-official and natural bodies dealing with the employment situation. Additionally, he summarized the most important characteristics of the labor market in Saudi Arabia, which were developed by the Labor Force Council, as follows: • The lack of adequate data on the Saudi labor market, employment, and unemployment. • The largest age group in the Saudi population is children and adolescents. This is reflected in the increase in the working-age population. • Saudi women’s contribution to the labor market is low, almost 6.0%. Although a large proportion of Saudi women applicants have university qualifications, there are limited opportunities available to Saudi women. • The lack of relevance of the current education system to modern developments in Saudi society and the imbalance in the structure and curricula have been revealed by monitoring its scientific level. From the above point, the first and most important reason for unemployment among the population is the superficial education level. In addition, there is a lack of training or low levels of it, both before and during employment, which may cause staff to lose their jobs, thus contributing to the rise in unemployment. There are massive differences between foreign workers in terms of qualifications and skills and their professions, as a large proportion of foreign labor is unskilled, and workers hold low-paid occupations and do not require any scientific or technical skills. To illustrate, 79.6% of foreign workers fall into this category; only 20.4% are taking up professional [13]. 1 Saudization is the newest policy of the Kingdom of Saudi Arabia implemented by its Ministry
of Labor and Social Development, whereby Saudi companies and enterprises are required to fill up their workforce with Saudi nationals up to certain levels.
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
109
Saudi National Projects for the Labor Market. The problem of unemployment and employment with inappropriate qualifications is not new. Solutions have always been temporary; although the government has implemented many programs to solve this problem, the situation has not improved significantly. National programs are shown below in Table 1. These programs include the HADAF program, one of the mechanisms that contribute to the provision of qualified Saudi cadres, whereby trained, educated young people of both sexes achieve strategic goals, which has social and economic benefits and provides security. Capacity services are focused on the functional linkage between job seekers and private sector employees. The needs of the labor market can be addressed through several channels, such as the site for e-recruitment and re-training and employment centers; as well as applying a system to protect wages, achieve Saudization, the incentive program, the Human Resources program, and the National Labor Observatory Portal [8]. The National Labor Observatory portal is part of the initiatives to stimulate the private sector to expand Saudization. It is also one of the critical national initiatives contributing to improving and developing the market and supporting decision-makers as part of The National Transformation Program, one of the Saudi Vision 2030 initiatives developed to serve its people [14]. The most prominent current products of the National Labor Observatory are indicators of the Saudi labor market, its definition, the formulation, and participants’ characteristics. Characteristics include the private sector’s social insurance, mobility and job stability, graduates’ employment, and establishments that recruit subjects to NETAKAT. NETAKAT and TAQAT are two initiatives of the Saudi Ministry of Labor. NETAKAT evaluates establishments operating in the Saudi market. The constructs’ ranges are classified into four levels: platinum, green, yellow, and red. The originality ranges are divided into two main categories: enterprises with less than ten employed and enterprises with more than ten employed. First, only one citizen must give an enterprise the advantages of a domain initiative, and the second requires different settlement rates. On the other hand, the TAQAT program offers a range of specialized services provided by the Human Resources Development Fund HRDF to support job seekers by delivering research a database of job seekers from the citizens to choose suitable candidates [14]. The Saudi Ministry of Labor also provides many other programs and initiatives that ensure smooth structural transformation in the Saudi labor market’s composition, including developing market control mechanisms, combating concealment and deportation of offenders, developing remittance systems, and protecting wages [10]. There are pressing government movements to overcome labor market problems by launching all these programs and integrating them. However, full data integration of all of these programs is an obvious problem, that could be solved by developing a national DL to serve as a national repository, see Table 1. Saudi National Projects for the Labor Market illustrates some of the overlapping characteristic the existing programs mention and highlights the need for a national repository.
110
M. A. Aleisa et al. Table 1. Saudi national projects for the labor market
The Program DAF
Training √
Funding √
HAFEZ
Evaluate the workplace
√
TAQAT NETAGAT
Job search
√ √
√
Saudi Labor Market Growth. In this research, statistics are obtained from the General Authority for Statistics (GSTAT), which show a significant gap between Saudis and nonSaudis. Out of 9,093,773 employees in the General Organization for Social Insurance in Saudi Arabian (GOSI), 7,157,265 are not Saudis, and only 21.29% are Saudis. These figures are not for employees with low qualifications or insufficient education levels, as they relate to the professions, see Table 2. Out of 923,504 Saudi job seekers, there are 549,851 that hold Bachelor’s degrees, see Table 3.
Table 2. Education levels Education levels Lawyers Directors and business managers Specialist professionals Technical and humanitarian staff Professional technicians Clerical occupations Retail staff Service occupations Skilled agricultural occupations Animal husbandry & fishing Industrial occupations Chemical operations Occupations that support basic engineering
The total of 2,371,390 employees who are non-Saudis and hold domestic household occupations are not considered in this paper due to the low payment and poor education level [15].
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
111
Table 3. Education status and Saudi job seekers Education status
Saudi job seekers
Illiterate
20,592
Can Read & Write
7,637
Primary
42,860
Intermediate
44,719
Secondary or equivalent 240,981 Diploma
9,423
Bachelor’s degree
549,851
Higher/Diploma
5,082
Doctorate
228
Not specified
2,131
Total
923,504
From our analysis of the Saudi labor market, we conclude that there are significant problems with an unbalanced labor market where non-Saudis are dominating this labor market. In this paper, we will not try to find the reasons or historical accumulations of the problem as our focus will be to build a model to match candidates with jobs and produce a more Saudi-oriented labor market trend, thus supporting the Saudi government’s Saudization policies. The Current Recruiting System in Saudi. Recruiting data comes from the General Authority for Statistics (GSTAT), who collect their data from the General Organization for Social Insurance (GOSI), the Ministry of Labor and Social Development MLSD, the Human Resources Development Fund HRDF, TAQAT, the Ministry of Civil Service MCS, the National Information Center NIC, the Ministry of Education, King Saud University, and public organizations for technical and vocational training. The data does not include employees in the military sectors and those not registered in the records of GOSI, MCS. Further, the data of the GOSI, MCS, is preliminary. What is worth mentioning is that there is no fully comprehensive link between these entities currently as the data is collected from officially presented entities in paper format. There is a mismatch in official data on employment in the private sector, which comes from two primary sources: The Department of Statistics and Information and the Ministry of Labor. The inconsistency can be observed by measuring the ratio Saudization in the private sector. The percentage of Saudization in 2014, using the data from the Department of Statistics, was 22.1%, but was not more than 15.5% for the same period, according to the Ministry of Labor. For ongoing correction of the employment situation in the Kingdom, a statement shared by the Department of Statistics and the Ministry of Labor was issued in February to confirm that Saudi General statistics are the primary source of employment statistics [16]. This mismatch is one of the main reasons for the lack of economic indicators of employment and unemployment in the market, which
112
M. A. Aleisa et al.
helps to monitor the performance of the labor market and informs economic policies that aim to correct the current situation. Due to the Covid-19 pandemic, the unemployment rate among Saudis has risen to 15.4% in the second quarter of 2020. This represents an increase of 3.1% points over the same period in the previous year. From the previous year, the overall unemployment rate (for Saudis and non-Saudis) increased to 9.0%, up 3.4% points from the second quarter. The year 2019 AD and the results of the labor force survey were significantly affected by the effects of the covid-19 pandemic on the Saudi economy. The total labor force participation rate (for Saudis and non-Saudis) is 59.4% during the second quarter of 2020 [17]. Currently, the methods of job search are as follows: prospective employees apply directly to an employer, fill in and send an employment application form by post or electronically; or they ask friends and relatives about job opportunities. Answering published advertisements for official jobs necessitates registration with the Ministry of civil service. People can also register with private employment offices or start a private business. To do the latter, they should apply for a permit or license to start their own business [15]. The public sector depends on promotions in the public sector, depending on its age and length of service [16]. This suggests that the Saudi labor market is an open market to the extent that there are highly diverse ways to apply for jobs in both government and private sectors. There are some effects of the initiative program of HADAF, including the ability of companies to manipulate the system, which may explain the lack of full commitment by the employees of the institution in two ways: 1. Employing Saudis temporarily is one-way companies circumvent the system and avoid recruitment restrictions on expatriate workers. They improve their image by hiring many Saudis when they need to hire foreign employees or update their work visas. The program attempts to prevent this circumvention by requesting the workers’ records at an average of 12 working weeks for Saudi employees. However, there are still reports of companies employing many Saudis with low salaries over a short period. 2. Reducing size to avoid the required Saudization ratios. One way to avoid the penalty is to reduce the number of workers in the company to less than ten to be listed outside the program ranges [18]. Al Omari discusses how companies in Saudi Arabia need to find ways to continue to attract experienced and expatriate Saudi talent to maintain their success. He also suggests companies need to design a work environment desired by employees and offer a more extended period. He further reflects that the Saudi government needs to develop a broader conception of market characteristics and focus on them when analyzing and studying the labor market in Saudi Arabia. Finally, his work is focusing attention on analyzing and delving deeply into the various aspects of these critical characteristics of the Saudi labor market that will help those interested in the market overcome the obstacles of Saudization [13]. Labor market analysts see that labor market problems in Saudi Arabia are concentrated on the dependence on foreigners in the private sector, where the unemployment rate was 12% in 2017 [8].
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
113
The imbalance of the labor market in the private sector, which is characterized by its heavy reliance on low-paid expatriate workers (about 80% of the workforce is low-skilled workers with primary or lower education), has exacerbated unemployment among Saudi youth and has contributed to reducing the efficiency and productivity of the private sector. This kind of employment cannot significantly contribute to the transfer of knowledge and the competitiveness of the economy as it depends on its productivity on physical exertion [10]. Unemployment is one of the economic indicators of labor market performance, and it affects families by the loss of their purchasing power, and the nation generally loses their contribution to the economy. Unemployment is also a driver of migration patterns. The problem of poverty and unemployment have always been critical obstacles to economic development [19]. The main conclusion drawn about the impact of the labor market problem on the economy is that the high unemployment rate in countries that are weak in economic growth is not surprising, but it is not expected to occur in a rich country with profitable economic growth like the Kingdom. Therefore, solutions urgently need to be found; as such this paper proposes a model to solve some of these problems. Restructuring the Saudi economy is a long-term strategic development objective, but it cannot be done in isolation by reforming the labor market only, especially in the private sector, which relies heavily on low-wage and low-skilled expatriate labor. This excessive dependence on this type of employment reduces opportunities for the development of the Saudi economy structure. It also reduces the provision of job opportunities for Saudi citizens. It is understandable that unemployment rates are high in countries that are weak in economic growth, but it is not expected to occur in a rich country with profitable economic growth like the Kingdom. The Saudi economy is heavily dependent on the oil sector, which does not provide sufficient employment opportunities, neither do other sectors directly related to the petrochemical industry. Also, as the government sector cannot absorb this large number of job seekers, the private sector is the only sector whose growth level can be reflected in the creation of more jobs. Private sector growth in the last ten years has been around 9.7%. Naturally, this growth has been accompanied by an increase in the number of jobs. Indeed, the number of job opportunities in this sector increased by approximately 83%, but Saudi citizens filled only 17% of the private sector posts, and expatriate workers acquired the rest. The radical solution is to obtain accurate data that can be integrated from all sources and analyzed. In this respect, the Kingdom of Saudi Arabia government has made tremendous efforts towards digitalization, as it has established an authority called the Saudi Data and Artificial Intelligence Authority (SDAIA). Saudi Data and Artificial Intelligence Authority. Saudi data and AI authority (SDAIA) it is a new establishment is Saudia Arabis started in 2019. It supports the achievement of the Kingdom’s Vision 2030 and unleash the Kingdom’s capabilities and intending to build a data-based economy. SDAIA works to regulate the data sector and enable innovation and creativity through three arms: The National Data Management Office and the National Information Center, and the National Center for Artificial Intelligence. Unlocking the latent value of data as a national wealth to achieve the aspirations of Vision 2030 by defining the strategic direction for data and AI and supervising
114
M. A. Aleisa et al.
its achievement through data governance, providing data-related and forward-looking capabilities, and enhancing them with continuous innovation in the field of AI [20]. The National Data Management Office in SDAIA is building a National Data Bank, which will regulate the injection of stream data flowing from all government agents. The aim is to control the power of data that opens many opportunities and gives a clear national agenda to solve many problems. Using data will pave the way for innovation and achievements, and by managing it well, it will become a valuable source of wealth not only for the Kingdom but also for the world [20]. DLs are emerging as an increasingly popular solution for Big Data at the enterprise level [24]. It has significant advantages over traditional data warehouses. Data scientists, data analysts, and data engineers can access data much easily and faster than would be possible in a traditional data warehouse. Increase the agility and provide more opportunities for them to explore and proof of concept activities.
3 Data Lake and AI Model Getting all the data in one place will support integrating it in a way that allows data engineering to clean it, then for data scientists to analyze it, and to apply ML algorithms on the data. This section first provides a background of DL and utilization. Second, it highlights the fundamental understanding of ML algorithms needed to build this AI model, such as clustering and NLP. 3.1 Data Lake James Dixon, first mentioned the concept of a Data Lake (DL) as a data repository in 2010. He stated that a DL manages raw data as it is ingested from multiple data sources. It does not require cleansed data or structured data [21]. A DL is a daring new approach that harnesses the power of big data technology. It is “A methodology enabled by a massive data repository based on low-cost technologies that improve the capture, refinement, archival, and exploration of raw data within an enterprise” [22]. Data are stored in the DL in their original format, whether it is structured, unstructured, or multi-structured. Once data are placed in the lake, it is available for analysis [23]. DLs are often described in the literature with the characteristics illustrated in Table 4.
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
115
Table 4. Key characteristics of a DL from a technical and business perspective DL characteristics Business requirements
Technology requirements [24]
• DLs are essential for companies and businesses. It gives them a competitive advantage in the data storage domain. The distinct characteristic is that it attracts more attention from business fields instead of academic research fields [23] • A capability where a business can get raw data, i.e. unchanged data, from different source systems in an enterprise, readily available for analysis [25, 26] • A relatively new concept whose definitions, characteristics, and usage is currently more prevalent in web articles than academic papers • DLs are built on the concept of early ingestion and late processing, and it should be integrated with the rest of the enterprise’s IT infrastructure
• DLs are a collection of technologies that serve the data’s need as a central repository • DLs serve as a cost-effective place to conduct a preliminary analysis of data [27] • DLs are created to handle large and fast arriving volumes of unstructured or semi-structured data for further dynamic analytical insights • DL data can be accessible once it is created in the DL • DLs require maintaining the order of the data arrival • DLs will being flexible and task-oriented, data structuring should be implemented only where the DL outflow is the analyzed data for what is necessary • DLs should handle SQL, NoSQL, OLAP, and OLTP • DLs have a flat architecture, where each data element has a unique identifier and a set of extended metadata tags • A DL forms a vital component of an extended analytical ecosystem • There should be different possibilities to split data in the DL, i.e. DLs can be partitioned by their lifetime or the type of data • For DLs partitioned by their type: Raw data, augmented daily data sets, and third-party data • For DLs partitioned by lifetime: Data that are less than six months old, older but still active data, and archived data [24]
To ensure the effectiveness of a DL architecture, you must keep the following in mind while building and storing data. See Table 5.
116
M. A. Aleisa et al. Table 5. DL architecture build characteristics and data types.
Architecture build characteristics
Data types [24]
• Capability to expand very large • Flexible policies and governance should be developed according to the need for identification, retention, and data disposition • DL governance should include an application framework for • Contextualizing data • Advanced metadata management, • Centralized indexing, • Consider the relation between data stored, • Keeping track of data usage • Fully shareable and accessible data, • Shared access is simultaneous, • Access from any device to support the mobile workforce. • Agile analytics.
• • • • • • • •
Transaction logs Sensor data Social media Document collections Geo-location Images Video Audio
A typical DL Architecture usually consists of three layers; a data source layer, a processing and storage layer, and a visualization (target) layer, see Fig. 1.
Fig. 1. High-Level DL architecture [28]
There are two well-known types of DLs: a logical DL and a public DL. The source layer can consist of homogenous sources, similar data types or structures, easy to join, and consolidate data, and/or heterogeneous sources, which means different data formats and structures. A method of extracting, transforming, and loading (ETL) is needed to aggregate the raw data from the sources. The data processing layer, efficiently designed to datastore, metadata store, and the replication for the high availability and to support the security, scalability, and resilience of the data and proper business rules and
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
117
configurations are maintained through the administration. The DL’s target layer (visualization) receives data from the processing layer through an API layer or connectors [29]. Several reference DL architectures are now being proposed to support the design of big data systems. Here is represented “one of the possible” architectures (Microsoft technology-based), see Fig. 2.
Fig. 2. High-level microsoft technology-based [29].
3.2 Algorithms Data collection and preparation are fundamental for running ML models. Having data in a DL we can store data as it is, no need to structure the data, or run different types of analytics at the time of analysis ‘schema on read’. Query results are obtained faster using low-cost storage from more sources in less time, and here ML helps to automatically find complex patterns in data. Because ML is able to act without being explicitly programmed this helps decision-makers to take more accurate decisions. Clustering Algorithms. We can see from Fig. 1, that we need to process raw data in the DL through the ETL/API process to present useful information to a visualization target whereby a user (job seeker or recruitment agent) will use the result, e.g. use the processed data to match job seeker to job. We need to define suitable processing algorithms that can match and classify data efficiently to do this task. A suitable class of algorithms are Clustering Algorithms [31], which are classified as unsupervised machine learning algorithms. These algorithms desire good at discovering data patterns to build natural groups for similar data points, and are particularly useful if there is no explicit data class to be predicted. Different types of clustering algorithms and choosing a cluster algorithm type depend on the case and the data used, and there is not one single best type for all cases [30]. With our architecture, we will consider the BIRCH clustering algorithm for processing and classifying data. The BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an agglomerative hierarchical clustering algorithm. It is used best with extensive databases. It works by partitioning objects hierarchically using a tree structure and then applies other clustering algorithms to refine the clusters. It iterates by using new input data points and tries to produce the best quality cluster. It finds the best clustering with a single scan of the data, then it improves the results with a few additional scans [31]. Natural Language Processing − Word Embedding. We have data of type text in our data that we need to process. ML works only with numeric representation. Some of it is
118
M. A. Aleisa et al.
free text, e.g. job description, others can be selected from multiple-choice entry, e.g. job level. ML does not accept text input. Here where we need a type of AI model to process text and convert it to numeric representation, for this task we need to exploit Natural Language Processing (NLP). NLP is a part of the Artificial Intelligence, Computer Science and Linguistic fields of study whose focus is on making computers understand the sentences or words written in human languages [32]. NLP has gained much attraction recently because of the many applications and fields in which it can serve. Word embedding is a subfield of NLP that is concerned with the process of converting words to real numbers to feed them to algorithms, where algorithms as known do not accept text representation. In other words, it is a numerical representation of text which captures the data [33]. The data contains job descriptions and candidate skills that need to be analyzed to enable us to get to the best match. Using prediction-based methods will enable us to analyze and find words of similarity between job specification and job seeker that we need for matching. NLP prediction techniques are used to generate word embedding, which capture meanings, as well as the semantic relationships of words. Given words from their context, trying to predict new words, using the surrounding words to predict the word of interest. Specifically, NLP prediction is a ML algorithm that has been trained on a large corpus of text data. The training process involves either using a word and trying to predict words that occur in its context or using words in some context to try and predict a specific word of interest. This word embedding model is low in dimensionality. Essentially what this ML algorithm tries to do is to encode each word as a vector of other words. Therefore, keywords from job description will be highlighted and mapped with keywords from candidate skills. There are famous ML algorithms that have pre-generated word embeddings that can be used in models. Examples are Word2Vec, GloVe from Gensim library [33]. Currently, a job recruiter looks for keywords in the job description and then tries to find the similar words in the candidate skills and qualification. The key to our AI model success will be trying to make the model replicate this behavior automatically, and learn from its mistakes. To do that, we will use techniques to find a keyword in job descriptions and in candidate skills and qualifications to map them. The first technique is TF-IDF where TF stands for term frequency and IDF stands for inverse document frequency. To capture how often a word occurs in a document, as well as how often that word occurs across the entire corpus, both the frequency and relevance of words representing the significance of each word [34]. Capturing how often a word occurs in a document will present the keywords in job descriptions and candidate skills that is needed for mapping. The second technique is Recurrent Neural Networks (RNN). deep learning can eventually give machines an ability to think, analogous to common sense, which can augment human intelligence. RNN will be used to support the generating of keywords as it proved its power in this field [35]. Search Models Background. There has been some previous research work in the recruiting field, which was mainly focused on linear models. These previous research work tried to find the best candidate for the recruiter but did not consider the complex relationships between features. All previous solutions handled recruiting problems from independent dimensions. For example, Ha-Thuc and his team introduced a collaborative filtering approach based on matrix factorization to generate candidates’ skill expertise scores, then they utilized them as features in supervised learning and sorted them for normalized discounted cumulative gain NDCG [36, 37].
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
119
Another example in this domain is done by Sahin and his group. They proposed a system that is divided into an online system for serving the most relevant candidate results and an offline workflow for updating different machine-learned models [38]. This paper is distinguished from others, where we have access to the data resident in The Saudi national data bank where the history of the job seeker is available and can be integrated with the job seeker job profile. Moreover, it works with three layers using a combination of unsupervised learning algorithms. Each layer makes it easier for the next layer from the computational perspective. Finally, it chooses the users’ choices and assigns them a weight, which is then used to rank the results. To our knowledge, this final feature has never been debated before, and as a daily base task, this step is usually made implicitly subjective. In contrast, in our model, the users’ preferences are calculated and added to the final score, which is then ranked to provide the best recommendation.
4 Proposed Solution This section focuses on our proposed solution for a robust AI model using Python and associated libraries (see Sect. 3.2) to create a proof of concept. Our solution extracts relevant information from available data on both sides, the recruiters and job seekers. Our AI model consists of three layers: the Initial Screening layer, Mapping layer, and a Preferences layer. The three layers work in sequence to match the job seeker with the best job ID, see Fig. 3.
Fig. 3. Proposed AI Recruiting Model (AIRM)
120
M. A. Aleisa et al.
Our proposed model makes the following requirements: • • • • • •
A national recruiting platform already exists; Data are clean and prepared for analysis; All recruiters’ and job seekers’ data are in the DL; Job seekers can be tagged to one job or more; Job ID can be tagged to one candidate or more; A comprehensive directory of jobs and majors that suit them should be stored in the DL, updated whenever needed; • Job seekers’ data should include personal histories, skills, capabilities, a responsibility that job seekers are willing to take, and accomplishments. 4.1 Initial Screening Layer This layer will work as a preparation phase. It will use BIRCH to build cluster groups job specializations, which will enable the second layer to treat each cluster specialty separately. This layer’s input is from both sides, the recruiter’s data and job seekers’ data. From the recruiter’s data, the industry name, job level, job title, job location, the employment period is full-time or part-time, and gender from the job seekers’ side is considered. Other required data includes industry, location, employment period it fulltime or part-time, and gender. The AI model will set these groups and their ID. The result will be stored as a data frame in the DL. It will be an iterative stream process that considers the immediate changes from the user profile. This layer will reduce the AI model’s computational requirements for the next layer to enable the Mapping layer to work only with the needed group ID. Hierarchical clustering is a common form of unsupervised learning. It is suitable for this new proposed model, compared with other clustering algorithms. The similarity between the clusters is calculated from the dissimilarity measures like the Euclidean distance between two clusters. So, the larger the distance between two clusters, the better it is [39]. There are two commonly known hierarchical clustering approaches, namely agglomerative and divisive. The data set must first be prepared before the data clustering process can begin. Scaling or normalizing the data, as well as missing value imputation, are all necessary steps. The feature values of each observation are represented as coordinates in n-dimensional space (n is the number of features), and the distances between these coordinates are calculated to normalize the data. The hierarchical clustering is suitable for this new proposed model because of three reasons: 1. Hierarchical clustering does not try to make clusters of the same size as k-mean clustering. In the Saudi jobs data set, the group size will vary, and we do not aim to have the same group size. Instead, we need to explore the real group size in each job group. 2. Hierarchical clustering does not require the number of clusters as an input parameter. In the Saudi jobs data set, the data is significant, and it will change all the time. We cannot decide the number of clusters at the beginning of the algorithm. So, hierarchical clustering helps to take away the problem of having to pre-define the number of clusters. 3. Hierarchical clustering can virtually handle any distance metric [40].
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
121
4.2 Mapping Layer This layer will deal with the job groups that have been clustered and given a cluster ID each for both sides. It will parse the critical skill and responsibilities from the job description as well as parsing the qualification, experience, and responsibilities that the job seeker is willing to take from the job seeker data by using the Python Natral Language Toolkit (NLTK) library for processing the text data from both sides. A dictionary of keywords will be created with a unique ID by using RNN. This will be stored in the DL and retrieved when needed. After that, Word2Vec will be applied to convert the words in the job description and the job seekers qualification to numeric values, which is vectors and then check the similarity of the words in the job seekers qualification to the dictionary of words built for each job ID. This calculation will be for both sides. It is not a redundancy task. The preference from both sides will be add in the next layer. Any outliers will be detected and removed. Then the score of each job ID and job seeker ID will be sorted in a data frame in the DL. Equation (1) and (2) illustrates the computation of ScoreSeek for a particular JobID and ScoreJob for a particular CandidateID where: • DicJob is the job description of dictionary words; • DicSeek is the qualification dictionary word; • SimJob is the similarity value calculated of one word from qualification against job description dictionary; • SimSeek is the similarity value calculated of one word from the job description against the qualification dictionary; • ScoreJob is the sum of SimJo;. • ScoreSeek is the sum of SimSeek; • n number of words in the dictionary; ScoreSeekjobID =
n
SimSeekn
(1)
1
ScoreJobcandidateID =
n
SimJobn
(2)
1
4.3 Preferences Layer This layer will add the preferences, which is the weight of the word that is more important for both sides. The keywords or features will be displayed for both sides on the platform, and the user will sort his preference. Sorting the words will enable the model to give weight to each keyword. For example, if you are a recruiter in an academic field, you will give more weight on a candidate that has published a paper or if you want this candidate to be located in the same location where the job is, you will give more weight to the location. Then an in-depth text ranking will be applied, and the result will be stored back in the DL. Equation (3) and (4) illustrates the computation of ScoreSeek-pre for a particular JobID and ScoreJob-pre for a particular CandidateID where:
122
M. A. Aleisa et al.
• SeekPre is the weight given by the job seeker to a word; • RecPre is the weight given by the recruiter to a word; ScoreSeekseekpre =
n
SimSeekn . SeekPre
(3)
SimJobn . RecPre
(4)
1
ScoreJobrecpre =
n 1
5 Future Work The next step in this research is to implement a robust AI model using Python and its related libraries to create a proof of concept; the solution will use relevant information from the available data using data from both sides, recruiters and job seekers. The implementation will be divided into three phases. First, the Initial Screening phase, where we will use BIRCH to build cluster groups job specializations. In the second phase, we will process text data from both sides using the Python Natural Language Toolkit (NLTK) library. We will build a mapping model that will parse information from job descriptions and data from job seekers to build dictionaries, which will be constructed using RNN. The preferences from both sides will be added to the model in the third phase. The final step is to evaluate the model’s performance.
6 Conclusion This paper started with an overview of Saudi labor market and the government’s tremendous efforts to improve it. We highlighted some of the current national projects, such as the initiation of Saudi Data and Artificial Intelligence Authority (SDAIA). The analysis concluded the need for a national central data repository and AI model that can think like a human in the recruiting field. Then we presented the concept of a DL as a suitable data repository with significant advantages over traditional data repository. Furthermore, the paper proposed an AI recruiting model AIRM, suitable for the Saudi labor market, which takes into consideration the preference of the recruiter and the job seeker and works to imitate the human brain. The AI model consists of three layers that work in sequence to match the job seeker with the best job ID. The first layer is called an Initial Screening layer. It builds groups of jobs from the same industry to gather and give a group ID by clustering them. The second layer is called a Mapping layer, which uses the RNN to find keywords, and Python NLTK library for word embedding Word2Vec to calculate the similarity of the keywords in the job seekers’ qualification to the dictionary of words built for each job ID and vice versa. Then, the score of each job ID and job seeker ID will be sorted in a data frame in the DL. The third layer is the Preferences layer, which will add the preferences as a weight of the word that is more important for both sides. Then the result will be stored back in the DL. Further work in this research is to implement the proposed AIRM model and evaluate its performance.
AIRM: A New AI Recruiting Model for the Saudi Arabia Labor Market
123
References 1. Nikolaou, I., Oostro, J.K., Employee Recruitment, Selection, And Assessment: Contemporary Issues For Theory And Practice. Psychology Press, Hove, East Sussex (2015) 2. Breaugh, J.A.: Recruiting and attracting talent. SHRM Foundation, United States of America (2009) 3. Kenton, W.: Okun’s Law. Investopedia. https://www.investopedia.com/terms/o/okunslaw.asp (26 Mar 2020). Accessed 07 Dec 2020 4. Saudi vision: Governance Model For Achieving Saudi Arabia’s Vision 2030. https://www.vis ion2030.gov.sa/ar/node (12 12 2019). Accessed 7 Dec 2020 5. Alsabeeh, H., Aljassim, A., Ahmed, M., Hagemann, F.: Labor Market Dynamics in the GCC States. OxGAPS Oxford Gulf & Arabian Peninsula Studies Forum, Oxford (2015) 6. Gill, K., Scott, E., Ward, L.: Understanding labour market information. Produced by the Department for Education and Skills, p. 5 (2004) 7. Ehrenberg, R.G., Smith, R.S.: Modern Labor Economics, p. 2. Pearson (2012) 8. Albaker, A., Alabdani, A.: Labor Market Challenges in KSA. Saudi Arabian Monetary Agency, in Arabic (2018) 9. Wowczko, I.A.: Skills and vacancy analysis with data mining techniques. Open Access Inf. 2, 31–49 (2015) 10. Al-Zughaibi, S.A.: The Importance of Labor Market Reforms in Restructuring the Saudi Economy, in Arabic. https://www.alriyadh.com/915519 (5 5 2014). Accessed 7 Dec 2020 11. Greenwood, A.M.: International definitions and prospects of underemployment statistics. In: Proceedings for the Seminario sobre Subempleo, pp. 8–12 (1999) 12. Stephen, H.: Is it possible to revive the labor market? King Faisal Center for Research and Islamic Studies, Riyadh (2018) 13. AlOmari, A.: Characteristics of the Labor Market in Saudi Arabia. Economic writer Abdul Hamid Al-Omari, in Arabic. http://abdulhamid.net/archives/3450 (30 09 2003). Accessed 7 Dec 2020 14. Al-Sudairy, M.A.: Launching the National Labor Observatory Portal to Stimulate Emiratization and the Organization of the Labor Market. Saudi press agency, in Arabic. https://www. spa.gov.sa/1880625 (31 1 2019). Accessed 7 Dec 2020 15. Statistics, G.O.: Labor Market 3Q 2018. General Organization for Statistics Labor Force Statistics and Social Conditions. Riyadh, in Arabic (2018) 16. Altorki, F., AlEshakh, R.: Future Features of the Saudi Labor Market. Jadwa investment, Riyadh, in Arabic (2015) 17. G. a. f. statistics: genral authority for statistics Labor market statistics second quarter of 2020. Genral authority for statistics, Riyadh, in Arabic (2020) 18. Alsharif, M.: Specialists: 15 percentage of fake Emiratisation in the private sector. https://tin yurl.com/y2rysjud (12 9 2018). Accessed 7 Dec 2020 19. Sundsy, P., Bjelland, J., Reme, B.: Towards real-time prediction of unemployment and profession. Telenor Group Research, Fornebu, Norway (2019) 20. SDAIA: About SDAIA. SDAIA, in Arabic (1 3 2019). Accessed 7 Dec 2020 21. Zomay, A., Sakr, S.: Encyclopedia of big data technologies. https://link.springer.com/refere nceworkentry/10.1007%2F978-3-319-63962-8_7-1#howtocite (01 June 2018) 22. Fang, H.: Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. In: The 5th Annual IEEE International Conference on Cyber Technology in Automation, Shenyang, China (2015) 23. Khine, P.P., Wang, Z.S.: Data lake: a new ideology in big data Era. Researchgate (15 Dec 2017)
124
M. A. Aleisa et al.
24. Tolstoy, A., Miloslavskaya, N.: Big data, fast data and data lake concepts. Procedia Comput. Sci. 302, 300–305 (2016) 25. Llave, M.R.: Data lakes in business intelligence: reporting from the trenches. Procedia Comput. Sci. 138, 516–524 (2018) 26. Patel, P., Diaz, A.: Data lake governance best practices. https://dzone.com/articles/data-lakegovernance-best-practices 25 Apr 2017 27. Stein, B., Morrison, A.: The enterprise data lake: Better integration and deeper analytics. Technol. Forecast: Rethinking Integr. 1, 18 (2014) 28. EDUCBA: Data Lake Architecture. EDUCBA. https://www.educba.com/data-lake-architect ure/. Accessed 7 Dec 2020 29. Marionoioso: a big data architecture in data processing. https://marionoioso.com/2019/08/ 22/a-big-data-architecture-in-data-processing/ 22 8 2019 30. Brownlee, J.: 10 clustering algorithms with python. Machine Learning Mastery. https://mac hinelearningmastery.com/clustering-algorithms-with-python/ (20 08 2020). Accessed 7 Dec 2020 31. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: International Conference on Management of Data, vol. 25.2, p. 103 (1996) 32. Chopra, A., Prashar, A., Sain, C.: Natural language processing. Int. J. Technol. Enhancements Emerg. Eng. Res. 1(4), 131–134 (2013) 33. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space, vol. 3. arXiv, arXiv:1301.3781 (2013) 34. Borges, L., Martins, B., Calado, P.: Combining similarity features and deep representation learning for stance detection in the context of checking fake news. ACM J. Data Inf. Qual. 9(4), 39 (2019) 35. Sutskever, I., Martens, J., Hinton, G.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning, pp. 1017–1014 (2011) 36. Ha-Thuc, V., Venkataraman, G., Rodriguez, M., Sinha, S., Sundaram, S., Guo, L.: Personalized expertise search at LinkedIn, [cs.IR]. arXiv, arXiv:1602.04572v1 (2016) 37. Jarvelin, K., AL¨ Ainen, J.K.: Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst. (TOIS) 20(4), 422–446 (2002) 38. Cem Geyik, S., et al.: Talent search and recommendation systems at LinkedIn: practical challenges and lessons learned, [cs.AI]. arXiv, arXiv:1809.06481v1 (2018) 39. Pathak, M.: Hierarchical clustering in R – DataCamp. DataCamp. https://www.datacamp.com/ community/tutorials/hierarchical-clustering-R#what (24 July 2018). Accessed 6 Feb 2021 40. Sharma, P.: Hierarchical clustering | hierarchical clustering python. Analyticsvidhya. https://www.analyticsvidhya.com/blog/2019/05/beginners-guide-hierarchical-clustering/ 27 May 2019. Accessed 06 Feb 2021
Chat-XAI: A New Chatbot to Explain Artificial Intelligence Mingkun Gao1(B) , Xiaotong Liu2 , Anbang Xu2 , and Rama Akkiraju2 1
University of Illinois at Urbana-Champaign, Champaign, IL 61820, USA [email protected] 2 IBM Research - Almaden, San Jose, CA 95120, USA [email protected] {anbangxu,akkiraju}@us.ibm.com
Abstract. Explaining artificial intelligence (AI) to people is crucial since the large number of AI-generated results can greatly affect people’s decision-making process in our daily life. Chatbots have great potential to serve as an effective tool to explain AI. Chatbots have the advantage of conducting proactive interactions and collecting customer requests with high availability and scalability. We make the first-step exploration of using chatbots to explain AI. We propose a chatbot explanation framework which includes proactive interactions on the explanation of the AI model and the explanation of the confidence level of AI-generated results. In addition, to understand what users would like to know about AI for further improvement on the chatbot design, our framework also collects users’ requests about AI-generated results. Our preliminary evaluation shows the effectiveness of our chatbot to explain AI and gives us important design implications for further improvements. Keywords: Chatbot · Explaining artificial intelligence · Proactive interactions · Collecting customer requests · High availability and scalability
1
Introduction
Artificial Intelligence (AI) and Machine Learning (ML) algorithms have been widely applied in various scenarios in our daily life. From AI-generated translation from one language to another, to AI autopilot used in commercial fights, AI has become a vital part of our life, and AI-generated results play a more and more important role in affecting people’s decision-making process [8]. This suggests the importance of explaining AI to people [3]. In addition, the growing adoption of AI technologies boosts great academic [11,12] and industrial [9,10] interest of studying and practicing how to explain AI to people effectively. Chatbots, as a widely-used tool in our daily life to conduct interactions with people for various tasks [13] (e.g. Microsoft Xiaoice for virtual companion and M. Gao—Work was done during internship at IBM. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 125–134, 2022. https://doi.org/10.1007/978-3-030-82199-9_9
126
M. Gao et al.
Amazon Alexa as intelligent assistant), have great potential to serve as a successful and effective tool to explain AI to people. Chatbots are good at conducting proactive interactions with people and collecting users’ requests [16]. Prior to applying chatbots in industry, most communications between users and service agencies are passive interactions, meaning that users can get responses as long as they interact with service agencies and make specific requests. When people encounter AI-generated results, many of them don’t know what to ask and many people even believe that AI is perfect [17], especially for lay users without any background information of AI. This makes passive interactions less effective in explaining AI to people. However, chatbots can overcome the problem by conducting proactive interactions to users. To be more specific, chatbot designers can prepare explanations for a set of pre-determined questions that they assume users would like to ask and use chatbots to prompt users to check those explanations during the conversation. However, how do chatbot designers know what requests users tend to make? Previous studies [1,3] solved this problem by conducting large-scale interviews and asking interviewees what they would ask about AI-generated results. Since chatbots are the perfect tool to collect users’ requests at a very large scale, we can use chatbots to collect what users intend to ask about the AI-generated results. Next, we can use the collected questions to expand the existing question set and hence improve the design of the chatbot in subsequent design iterations. In addition, chatbots have the advantage of high availability [15] and high scalability [14]; they’re 24/7 available and can be scaled to serve unlimited number of users simultaneously. To build a better platform for AI explanation, we make the first step to explore how to use chatbots to explain AI-generated results. Previous research indicate that explaining how the AI model works [1,3] and the confidence level [2] are important for users to understand the AI modeling process. To fully utilize the strengths of chatbots on collecting users’ requests and conducting proactive interactions, we propose a chatbot explanation framework with three main design principles: 1) proactively collect users’ requests about AI; 2) proactively explain the AI model; and 3) proactively explain the confidence level. Following these design principles, we implemented Chat-XAI, a chatbot for the IT Operations team of a corporation to explore the story of anomalies detected in a train ticket booking system. Figure 1 shows an example conversation between a user and Chat-XAI. For all the AI-generated information shown in the story, Chat-XAI follows our explanation framework to collect users’ requests (questions), and explain how the AI works and its confidence level. Our preliminary evaluation of Chat-XAI indicates the effectiveness of our chatbot AI explanation framework and the strengths (e.g. convenience to explore explanation) of using chatbots to explain AI. In addition, those in-situ requests (questions) about AI Chat-XAI collected from users show that users care about the important concepts included in the AI-generated results and our interviews suggest that users prefer more flexibility when exploring explanation in the conversation. These are valuable feedback for us to improve the chatbot design in the future.
Chat-XAI
127
Fig. 1. A Screenshot of an Example Conversation that an IT Supporter in the SRE Team of the Corporation Explores Information with Chat-XAI.
2
Chatbot Explanation Framework
Previous work on explaining AI [1] emphasized the importance of helping people understand how the AI model works overall with simple ways of explanation, especially for people with little AI background. In addition, Kocielnik et al. [2] indicates that showing and explaining the confidence level of AI-generated results can adjust people’s expectation to AI models, and help to promote people’ satisfaction and acceptance to AI. Meanwhile, to understand what people want to ask about AI-generated results, previous studies [1,3] in explaining AI collect users’ requests through large-scale interviews. Chatbots have been proven good at collecting users’ requests and conducting proactive interactions at high availability and high scalability. These advantages make chatbots a good choice for explaining AI. 2.1
Design Principles
To help people understand AI-generated results, according to previous work [1– 3], we derived our design principles of chatbot AI explanation framework below: – Proactively collect users’ requests about AI: For AI-generated results, the chatbot tells users that the result is generated by AI and proactively asks users what questions they want to ask about AI. Next, the chatbot collects users’ requests (questions). This would help the chatbot designer collect users’ questions about AI, which can be used to improve the conversation design in the future.
128
M. Gao et al.
– Proactively explain the AI model in a general way: For AI-generated results, the chatbot proactively asks users whether they want to check how the result was generated by AI and shows users a brief explanation of how the AI model works in general if users want to see the explanation. This would help users to understand how the AI model works generally, especially for users who know nothing about AI and don’t know what to ask about AI. – Proactively explain the confidence level of AI in a general way: For AI-generated results, the chatbot asks users whether they want to check the confidence level of AI and whether they want to check how AI evaluates the confidence level. Next, the chatbot shows the corresponding information (e.g. confidence level, the explanation of how AI evaluates the confidence level) to users if they indicate they want to see those information. This would help users get good sense of the accuracy of the result and remind users that AI is not perfect. To make the conversation flow flexible, after we show the AI-generated results, we first collect users’ questions about AI proactively. If the user asks about how the AI model works or the confidence level of the AI result, our chatbot will show the corresponding explanation to the user directly and proactively prompt the user to check other explanations. 2.2
Chatbot Implementation
We built our chatbot Chat-XAI as a Slack APP and its backend was built with Watson Assistant Service [19]. Watson Assistant Service is a popular platform where people can build their chatbots which can be further integrated into other platforms, such as Slack. Following our design principles, we implemented our chatbot to assist the IT Operations team of a corporation to understand the problem of a train ticket booking system. Anomaly detection from logs is one fundamental IT Operations management task, which aims to detect anomalous system behaviors and find signals that can provide clues to the reasons and the anatomy of a system’s failure. The system will generate a story (report) consist of information (e.g. update time, severity level, summary of the anomaly) of several anomalies detected during each given period of time, and the story will be sent to the IT Operations team to help them understand the corresponding system failure. Among all the information shared, three pieces information are generated by different AI algorithms: – Insight of a story: The insight of a story is generated by the TextRank algorithm [4]. Every story consists of several anomalies and each anomaly has its own log file. The TextRank algorithm is applied to the text information (e.g. title) of log files of all anomalies in a story to generate the insight, which can be considered as the summary of the story. – Summary of each anomaly: An anomaly is detected by a PCA algorithm [7] based on count vectors of templates extracted from logs [5].
Chat-XAI
129
Table 1. Examples of explanations for different AI-generated results. Insight of a story
Summary of an anomaly
Topology graph of related microservices
AI-generated Connection Error, Unknown result Normal, Service Query, Opened Connection Error
NETWORK End Connection Error; NETWORK [listener] Connection Error; Unknown Normal
See Fig. 2
Explanation of the AI model
The insight was extracted by an AI algorithm (TextRank) which can find words of frequent co-occurrence from the log files of the corresponding anomalies
Our anomaly detector parses logs into log templates that represent the key information from logs. A PCA model was built on the logs in the normal state, and an anomaly was detected if the distribution of log templates during runtime deviates from that in the normal state.(The templates are ranked by their frequency discrepancy from the normal state below). Discrepancy Template 0.99 NETWORK End Connection Error 0.91 NETWORK [listener] Connection Error 0.67 Unknown Normal
Application and network topology refers to a map or a diagram that lays out the connections between different applications or microservices. It is generated by the Granger causality model based on the correlations between microservices. The red part shows which service has the problem
Confidence level
Very High Confidence
85%
98%
Explanation of how to evaluate of confidence level
The level of confidence was evaluated by human experts
The confidence of a detected anomaly is higher if the distribution of log templates during runtime deviates farther from that in the normal state
The confidence was derived by aggregating the causality scores in the model
– Topology graph of related microservices: The topology graph is generated by the Granger causality algorithm [6] on the invoking interactions among microservices. Table 1 shows examples of the AI-generated result, the explanation of the AI model, confidence level and how to evaluate the confidence level for the insight of a story, the summary of an anomaly and the topology graph of related microservices, respectively. Table 2 shows the features mentioned in design principles with dialog examples.
3
Field Deployment and Preliminary Evaluation
We deployed Chat-XAI on the Slack channel of the IT Operations team of the corporation to assist them in their system maintenance tasks. To collect users’ requests for further improvement on chatbot design, we conducted semistructured interviews with four Site Reliability Engineers (SRE) in the IT Operations team, whose working experience in ranges from 8 months to 6 years. All
130
M. Gao et al.
Fig. 2. An example of topology graph of related microservices. Table 2. Dialog examples of design principle features. Proactively collect users’ requests about AI Chat-XAI: What question(s) do you have about the AIgenerated insight of the story? Please write it down. User: Should I create a ticket for this? Proactively explain the AI model in a general way Chat-XAI: Do you want to know how to automatically get the insight from the story information? User: Yes. Chat-XAI: The insight was extracted by an AI algorithm (TextRank) which can find words of frequent co-occurrence from the log files of the corresponding anomalies. Proactively explain the confidence level of AI in a general way Chat-XAI: Do you want to check our level of confidence in the AI-generated insight? User: Yes, I want to check. Chat-XAI: Very High Confidence Chat-XAI: Do you want know how we evaluate the level of confidence in the AI-generated insight? User: Yes. Chat-XAI: The level of confidence was evaluated by human experts
of these SREs have experience of using chatbots before and three of them have experience of learning AI at school. During the interview, each SRE used ChatXAI to explore a story consist of three anomalies and to finish the task of understanding a system breakdown (e.g. why the system breaks down and where). All participants checked the AI-generated results, including the insight of a story,
Chat-XAI
131
the summary of each anomaly, the topology graph of microservices, and the corresponding explanations. 3.1
Findings
Explanation Framework. Our preliminary evaluation indicates that our chatbot AI explanation framework can effectively help users understand the AI modeling process behind the AI-generated result and improve their perception of the results. When asked about whether this explanation framework can help to understand the AI results, P2 said: “Yeah, right now, I see the information here and it’s quite helpful”, P3 said: “Yeah, it makes sense. It does make sense.” and P4 said: “This is really interesting kind of generation. .... It is extremely useful. I think it’s telling what’s happening.” Further more, our explanation framework also help users improve their perception of the confidence level of the AI-generated result. For example, P2 also mentioned that “I mean the summary and the description (the insight) all here definitely helps me understand how it works and also showing the confidence score definitely boosts the confidence I’m using it.” In addition, proactively prompting users to check explanations of AI-generated results and evaluation of confidence level is also welcomed. For example, P2 said: “I like this kind of interaction of the chatbot system. We can ask (pre-)generated questions and get the answer out of it. I like it.” User-Driven In-Situ Feedback Chat-XAI can help collect users’ requests (questions) about AI effectively and efficiently. This makes it much easier for chatbot designers and researchers in AI explanation to gain insights of what people want to ask so that they can make further improvement on chatbot and AI explanation framework design. According to users’ in-situ inputs collected by Chat-XAI, we found users care about details of important concepts mentioned in AI-generated results. Some participants asked Chat-XAI those pre-determined questions we assume they would ask (i.e. How the AI model works and what the confidence level is) during their communication with Chat-XAI. For example, P1 asked Chat-XAI “How was this summary generated?”, P2 asked “How confident AI is about these results?” and P3 asked “How did you generate these insights?” This indicates that explaining how the AI models with confidence level and explaining how to evaluate the confidence level are important for users to understand the AI results. Meanwhile, we also collected users’ requests (questions), which asked about different perspectives of the AI-generated results from what we assumed, through Chat-XAI. In addition to how the AI model works and what the confidence level is, we found that participants are curious about details of important concepts mentioned in the AI-generated results. For example, P1 asked Chat-XAI “What is internal server error?” (“Internal Server Error” is part of the AI-generated insight of the story P1 went through). P4 asked Chat-XAI “Where are the connection error found?” and “What region is this (connection error) in?” These
132
M. Gao et al.
collected questions indicate that when explaining AI to users, if some important concepts (e.g. connection error) are included in AI-generated results, explaining those concepts with details would be better for users to understand the results. Advantage and Precaution of Explaining AI with Chatbots. In order to get more general insights of using chatbot to explain AI, at the end of each interview, we asked participants the advantage and the precaution of using chatbot to explain AI. Through our analysis, the advantage of using chatbot to explain AI is that it’s convenient for users to explore information. Chatbots can show information in a more convenient way wherew people need minimal actions to explore information. For example, P2 said: “I definitely see advantages. (I) Like grouping all the events (anomalies); (I) like listing them all here. And (I) also like at the same time within the same specific window we could see any other anomalies listed up here.” and P4 said: “The advantage of that (Chat-XAI) is everything can be done through minimal operations.” On the other hand, the precaution of using chatbot to explain AI is that the conversation flow should be designed in a more flexible way to better lead users to check the explanation. P3 said: “I think if I have to say one thing over all, this is to be really useful tool. It would be nice to have a bit more freedom on the question I was asking. ... If you think there’s one aspect of importance, but (the) story (conversation flow) decides to lead you to a different way.” and P4 said: “The disadvantage is getting stuck the in the flow sometimes.” These results indicate that in terms of the chatbot conversation flow to explain AI, in addition to proactively prompting users to check predetermined explanations which the designer of the chatbot feels important, the chatbot should also give users more freedom to ask questions from different perspectives. However, to realize this, the chatbot designer should collect more users’ requests (questions) about the AI-generated results and prepare more explanations, which indicates the importance of our first design principle (i.e. using chatbot to collect users’ requests proactively).
4
Discussion and Future Work
Can chatbots play a role in explaining AI to users? The answer is YES! Given chatbots’ superiority in collecting users’ requests and conducting proactive interactions, we propose a chatbot AI explanation framework which can proactively collect what users want to ask about AI, and proactively prompt users to check the explanation of how the AI model works and the corresponding confidence level. Our preliminary evaluation indicates that chatbots can help to explain AI-generated results to users effectively and users like the convenience provided by the chatbot when exploring information. Our work makes the first step towards using chatbots to explain AI-generated results. Neither explaining AI nor using chatbots to conduct specific interactive tasks is a new topic. However, just like conversation-driven development with
Chat-XAI
133
ChatOps [18], this surprisingly simple combination of explaining AI and chatbots works well for users to understand AI. Our work can also help researchers in explaining AI to collect users’ questions about AI and get insights of what/why AI confuses people at a very large scale given chatbots’ high availability and high scalability. According to the feedback we collected through Chat-XAI and semistructured interviews, we plan to add more detailed explanations of important concepts mentioned in the AI-generated results. In addition, we will make the conversation flow more flexible and give users more freedom to explore the explanation of AI-generated results in the future. Furthermore, for the AI-generated graph (i.e. topology graph of related microservices), we will replace the current static version with an interactive version to improve users’ experience, so users can directly click a component of the graph to check the explanation rather than entering text requests.
References 1. Dhanorkar, S., Wolf, C.T., Qian, K., Xu, A., Popa, L., Li, Y.: Who needs to know what, when?: broadening the Explainable AI (XAI) Design Space by Looking at Explanations Across the AI Lifecycle. In: Proceedings of the 2021 ACM Designing Interactive Systems Conference 2. Kocielnik, R., Amershi, S., Bennett, P.N.: Will you accept an imperfect ai? exploring designs for adjusting end-user expectations of ai systems. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems 3. Liao, Q.V., Gruen, D., Miller, S.: Questioning the AI: informing design practices for explainable AI user experiences. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems 4. Mihalcea, R., Tarau, P.: Textrank: bringing order into text. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing 5. Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M.I.: Detecting large-scale system problems by mining console logs. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles 6. Granger, C.W.J.: Investigating causal relations by econometric models and crossspectral methods. In: Econometrica: journal of the Econometric Society (1969) 7. Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. In: Chemometrics and Intelligent Laboratory Systems (1987) 8. Phillips-Wren, G.: AI tools in decision making support systems: a review. Int. J. Artif. Intell, Tools (2012) 9. H2O Driverless AI (2018). https://www.h2o.ai/products/h2o-driverless-ai/ 10. Arya, V., et al.: One explanation does not fit all: a toolkit and taxonomy of ai explainability techniques. In: arXiv preprint arXiv:1909.03012 (2019) 11. Adadi, A., Berrada, M.: Peeking inside the black-box: a survey on Explainable Artificial Intelligence (XAI). In: IEEE Acces (2018) 12. Gilpin, L.H., Bau, D., Yuan, B.Z., Bajwa, A., Specter, M., Kagal, L.: Explaining explanations: an overview of interpretability of machine learning. In: 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA) 13. Grudin, J., Jacques, R.: Chatbots, humbots, and the quest for artificial general intelligence. In: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems
134
M. Gao et al.
14. Reasons Why your Business needs a Chatbot? (2020). https://marutitech.com/7reasons-why-business-needs-chatbot/ 15. Techlabs, M.: Top 5 Benefits Of Using Chatbots For Your Business (2017). https://chatbotsmagazine.com/top-5-benefits-with-using-chatbots-for-yourbusiness-159a0cee7d8a 16. Saunders, A.A.: Top 7 Benefits of Chatbots for Your Business. https://www. digitaldoughnut.com/contributors/asena (2017) 17. Sahota, N.: Perfectly Imperfect: Coping With The ‘Flaws’ Of Artificial Intelligence (AI) (2020). https://www.forbes.com/sites/cognitiveworld/2020/06/15/perfectlyimperfect-coping-with-the-flaws-of-artificial-intelligence-ai/ 18. REGAN, SEAN: What is ChatOps? A guide to its evolution, adoption, and significance (2016). https://www.atlassian.com/blog/software-teams/what-is-chatopsadoption-guide 19. Watson Assistant (2020). https://www.ibm.com/cloud/watson-assistant/
Global Postal Automation Aimee Vachon, Leslie Ordonez, and Jorge Ram´on Fonseca Cacho(B) Department of Computer Science, University of Nevada, Las Vegas, Las Vegas, NV 89154, USA {navarra,ordonl1}@unlv.nevada.edu, [email protected]
Abstract. Since the late 1950s there has been a shift in postal automation in the United States. In this paper, we examine the effects of postal automation in the U.S. and focus on the major innovations that have been implemented to increase productivity and decrease labor costs. From the beginning of the New Field Theory approach to image processing technologies, we focus on the role of Optical Character Recognition (OCR). Similarly, we also discuss how other countries, such as India, China, and Japan, have executed postal automation. With rising concerns about automation and the effect it could have on the labor market, we examine what lies ahead in the future for postal automation. Keywords: Postal automation · Optical Character Recognition · Hand-written address interpretation technology · New Field Theory Binarization · Letter sorting machine · Feature extraction · Bangla numeral recognition
1
·
Introduction
For the last three decades, postal automation has become a prevalent material for research and analysis. Among the papers published for this topic are those that outline postal mail operations both for analytical and informative purposes. Additionally, there are papers that review the inefficiencies in the postal system, and discuss existing or potential solutions or optimizations to problems. Finally there are also papers that forecast future advancements in the postal automation industry. Our paper, however, focuses on the following: 1. The history and background of postal automation, including the application of the New Field Theory method of problem solving [12], an introduction to Optical Character Recognition (OCR) [16–19,22,26,32], and an introduction of Hand-written Address Interpretation (HWAI) technology [25]; 2. An overview of the mail process in the United States, which will cover: mail classification, handling process, and technologies used [8,9,11,34]; 3. An overview of global postal automation, focusing on automation in India [33, 35], China [1,21], and Japan [3,5,15];
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 135–154, 2022. https://doi.org/10.1007/978-3-030-82199-9_10
136
A. Vachon et al.
4. Future projections or predictions for the postal automation industry [4]; 5. How postal automation promotes innovation, and redefines the labor market in a positive way.
2 2.1
Background History of Postal Automation
1920s: Gehring Mail Distributing Machine: Very little documentation is available for this distributing machine. It is a single-keyboard letter sorting machine, see Fig. 1.
Fig. 1. Image of a clerk on a gehring mail distributing machine [11].
1940s to 1960s: In the 1940s, the Sestack Letter Sorter machines were used and soon became obsolete because they were prone to jams, see Fig. 2. From the 1940s to the 1960s the amount of mail being processed by the United States Postal Office more than doubled [11]. Interest in automation in the postal service led to an increase in research into new technologies. 1950s: Shift to Mechanization [11]. In 1956, research on sorting codes, Multiposition Letter Sorting Machines, and Mark II facer-cancelers sorting codes helped automate the sorting of mail delivery based on geographic areas represented by zone codes. The multiposition letter sorting machine (MPLSM), Transorma, was a foreign-built machine first used in the U.S. that sorted parcels based on the zone codes. The Transorma could sort letters, cards, and circulars. At the rate of 15,000 pieces per hour, the Transorma could sort double than that
Global Postal Automation
137
Fig. 2. Image of a Sestack Letter Sorter [11].
of a number of clerks [11]. During the late 1950s, American built MPLSMs were being manufactured and distributed nationwide. 1965: First OCR Appears: Optical character recognition technology was used in conjunction with an MPLSM. The first-generation OCR could sort up to 43,000 addresses per hour [11]. A limitation of this machine was that the letters fed into it had to be presorted, and it could only read about 80 out of the 600 most common typefaces used at the time [11]. 1980s: Updates: In 1983, the USPS awarded a contract for postal research to the Center of Excellence for Document Analysis and Recognition (CEDAR) at the State University of New York at Buffalo (SUNY Buffalo) [7]. The first single-line optical character reader (OCR) was introduced in L.A. This OCR could read and print a five-digit ZIP code to correctly sort mail by delivery destination. This OCR used the POSTNET barcode (later, it was expanded to read ZIP+4 code, an improvement on the old ZIP code). OCR became more widely used, and could process 6,200 pieces of mail per work hour, compared to 1,750 pieces processed by MPLSMs [11]. 1980s-1990s: New Versions of Old Equipment Appear: The Corporate Automaton Plan that began in the early 1980s of the Postal Service led to an increase in the quality of service [24,25]. Multiline Optical Character Readers (MLOCR) began to replace the single line OCR used, since these new machines could read new and old versions of ZIP codes. Advanced Facer-Canceler Systems (AFCSs) replaced Mark II Facer-Cancelers, since they could process almost double the amount (30,000) [11]. Automating the sorting based on destination
138
A. Vachon et al.
required an update on ZIP+4 to include a delivery point. Delivery Barcode Sorters (DBCSs) were implemented, and then later replaced MPLSMs. The FSM 775 is a flat sorting machine used in 1982. This machine could sort 6,200 flats per hour into 100 bins [11]. Then came the FSM 881 in 1992, which increased the 6,200 to 10,000 flats per hour. FSM1000 s were later introduced for flats that could not be processed by the FSM 881. In 1999, the AFSM 100 was the first fully-automated flat sorting machine, capable of processing 300,000 flats a day [11]. Delivery offices implemented the Flats Sequencing System (FSS) that sorts 16,500 flats per hour into delivery order [11]. The image management system for processing letter mail that cannot be handled by postal OCR utilized HWAI technology in the Remote Bar Coding System (RBCS). From 1996–1997 there were about 250 HWAI technology systems distributed nationwide with research still in it’s early stages [25]. Early 2000s - Present: Updates: Delivery Barcode Input-Output Subsystem Sorters (DIOSS) are upgraded delivery barcode sorters that have 302 bins to hold mail (about 5x more than previous machines), and improved the overall efficiency of the postal service [11]. Remote encoding centers were used for letters whose addresses could not be read. These centers were used in the late 1990s as a temporary solution, since most machines began to have the ability to read handwritten addresses. In 2010, automated parcel and bundle sorters appeared and increased efficiency by 4,500 pieces an hour (compared to 2,770 pieces an hour) [11]. The Mailing Evaluation Readability Lookup Instrument (MERLIN) was the automation of determining postage discounts, sorting, and processing barcodes for First-Class Presort Mail. The Intelligent Mail barcode (IMb) is the upgraded version of POSTNET and PLANET codes used to identify and process mail. This new barcode offers a higher range of services and improved efficiency in sorting, processing, and delivery [11]. 2.2
A New Field Theory Approach to Postal Automation
In 1964, Asher and Post applied the new field method to postal automation and addressed the issue of “informational decision-making by postal operators, who in the future, may encode mail for computer processing” [12]. Theoretical models for complex problem solving are constructed analytically then translated into promising solutions. However, the transition may not have a practical algorithmic approach such that, it cannot be achieved in a programmed manner. This is where the new field model applies for problem solving. It employs a more heuristic approach to the process. In 1963, James Asher outlined the problem solving process for a neo-field model in phases, as follows: 1. Concept Disruption: any potential blemishes or alternative notions would be exposed, thus, disrupting the concept. 2. A State of Tension: resulting tension from the disruption is exacted on the individual and becomes the motivation for problem solving.
Global Postal Automation
139
3. Sustaining Tension: since tension is the motivator for the process, maintaining it is crucial to achieve a solution. 4. Selective Perception and Fantasy: this promotes the problem solver to create visions to the problem, which is meant to lead to the conception of a solution matrix. 5. A Solution Matrix: the resulting collection of different approaches to solving the problem. This method was applied to the topic of postal automation involving how postal workers would execute mail encoding so that mail sorting could be processed by computers. At the time, when a letter arrived at a mail distribution center, it had to be sorted manually by postal workers. The letter could potentially go through up to five different clerks before the postman handles it for delivery. This was clearly an inefficient system that would not hold up to the “population explosion” [12] happening at the time. The amount of mail to be handled in a few years was expected to be twice the current volume being handled. A prominent hindrance was the computer’s inability to decipher handwritten addresses on envelopes. This means that the challenge was to formulate a way to code an address into language that a machine could understand. This would allow for a mail sorting machine to become a more viable solution. However, this was not an easy task in itself. The code would need to be distinguishable between all the people in the area covered. In order to solve this problem or, at the very least, pose a workable solution, two theoretical models were constructed, each with their own distinctive characteristics: – The first model was aimed at simplifying the stimulus involved and the training for individuals to learn the code. – The second model was geared toward minimizing the training required and avoid any conflict with the individual’s current position, such that it was “designed to permit the employee to transfer his present skill and knowledge to the new system” [12]. These models led to the conception of the Map Keyboard which required only two postal workers to execute. For each letter received, the first operator selected the general area on the map for the recipient’s address and the second operator selected the specific building that the letter would be delivered to. As a theoretical concept, this created a buzz, as it met most of the requirements of the models. The training was minimized and the worker’s current orientation was undisturbed. Though not implemented in practice, it provided an excellent demonstration of how the new-field theory method could be applied to more complex problem solving. 2.3
Introduction to Optical Character Recognition
Optical Character Recognition or OCR goes all the way back to Emanuel Goldberg in 1914, when he invented a machine that converted characters read into
140
A. Vachon et al.
telegraph code. With OCR technology, character recognition is done automatically through an optical mechanism, such as a camera. The character recognition applies to both handwritten and printed text. OCR function similarly to how humans read. However, they are not as accurate as humans. That being said, the performance of an OCR relies on the quality of the input documents. An example of an application of OCR is with mobile applications. These applications read an input through the camera and process the information using OCR. The OCR process starts by scanning the written text one wishes to digitize. The resulting image is then pre-processed, a step that is aimed at increasing accuracy. Pre-processing can be as simple as increasing the contrast of a black and white image, to using smoothing and normalization [22] to remove noise and other data that may confuse the next step of the process. That next step is detecting and extracting features, which then become part of the pattern recognition of each letter. To do so, patterns for each letter may be pre-programmed (or learned from a sample data set in the case of machine learning). Segmentation may also be used to separate the letters and areas of the image that have text. This can be done before or after pre-processing. At this point a digital version of the text is available. However, it may contain errors from incorrect pattern recognition. Therefore, a post-processing step is done next in order to correct errors generated in the OCR process. Some of this post processing may include using a confusion matrix, or corrections based on the context surrounding the error that was generated in the OCR process [19]. Machine learning can also be used to improve most parts of the OCR process, including how to select between candidate corrections, given many of these aforementioned features [16– 18]. Even after all of this, some errors may still remain; however, OCR errors have been shown not to affect the average accuracy of text retrieval or text categorization [26]. Research at UNLV’s Information Science Research Institute (ISRI) [32] has shown that information access in the presence of OCR errors is negligible as long as high quality OCR devices are used [27–31]. There are several different applications for OCR including, but not limited to, bank check processing, passport recognition at airports, business card information extraction, and automatic number plate recognition. Overall, the technology of OCR allow for the conversion of data from a physical document to an image with accuracy, decreased margin of error, and minimal time and effort for the user. 2.4
Hand-Written Address Interpretation Technology
In the 1980s, letter mail processing systems that used OCR technology processed mail at the rate of 30,000 pieces per hour compared to 500 to 1,000 pieces per hour for systems without OCR [25]. RBCS, an image management system, was an attempt at narrowing this difference in processing rates by assigning bar codes. RBCS had a Remote Computer Reader (RCR) that uses HWAI technology. The following HWAI system was first used in 1996, developed by CEDAR at SUNY Buffalo. The input of the system was a “212 dpi binary image of handwritten address block provided by the RCR system” [25]. The output returned
Global Postal Automation
141
“data structures consisting of ZIP Codes, add-ons, confidences associated with recognition, etc.” [25]. The following is an outline of the HWAI control structure [25]: 1. 2. 3. 4. 5. 6. 7.
Line Separation Word Separation Parsing ZIP Code Segmentation and Recognition Postal Directory Access Word Recognition Encode Decision
HWAI technology saved the USPS $100 million in labor costs in 1997 when it was first implemented and went on to save over $1 billion from 1997 to 2005 [23]. HWAI technology played a pivotal role in further automating the USPS.
3 3.1
Overview of the Postal Mail Process in the United States Mail Classification
Labor costs account for about 85% of total postal costs [34]. A turning point in postal reorganization came with the shift from zone codes to zip codes (And then ZIP+4). Zone codes were limited, and were not efficient when it came to sorting by destination. The Zoning Improvement Plan (ZIP) Code solved the limitations of the zone codes. More digits were added to the original ZIP codes to hold more information that made processing more efficient. The ZIP+4 uses a nine-digit zip code. This nine-digit code, compared to the five-digit old zip code, allows for sorting by level by post office within a large zone. Barcode sorters sort the processed mail based on carrier routes [11]. To efficiently process large amounts of postage, the Postal Service has three categories of mail (letters, flats, and packages), each of which has to be processed uniquely. The USPS shifted towards a program of postal automation that funded research in OCR (Optical Character Recognition) and BCSs (Barcode Sorters) [11]. The change from zone codes to zip codes helped pave the way for automation by providing an efficient way to identify and sort mail to a much broader geographic area. Businesses were encouraged to adopt the ZIP+4 standard by being offered a discounted rate for large-volume first-class mail. OCR technology could read the ZIP+4 code and then translate it into a bar code to be able to print a unique tag onto the mail. Single-line OCR are capable of reading the last line of an address (city, state, and ZIP). Multi-line OCR are capable of reading four lines of an address, automatically finding the nine-digit ZIP code in a database, and printing it onto a letter [11].
142
3.2
A. Vachon et al.
Handling Process
Letters are smaller and flatter pieces of mail such as postcards, bills, and standard letters. The current system in use is the Advanced Facer Canceller System 200 (AFCS 200). The process of receiving letters starts with culling (filtering the mail that cannot be handled by the machines). The machine then takes pictures of letters with a high-speed camera that will then use OCR technology to be able to read the handwritten or printed text on the letter to match the delivery address to a database of known addresses. Once the information is verified, along with the postage, the machine will spray a unique letter ID tag, and will send it off to be sorted by delivery route. Letters are hand-fed into the Delivery Barcode Sorter that will sort the letters based on a delivery point sequence based on delivery routes. The Delivery Barcode Scanner can sort 36,000 letters per hour [11]. Flats are large envelopes, catalogs, magazines, and newspapers. The flats must be separated, since they arrive in bundles and then are sent off to the Flats Sequencing System. This heavy-duty system is the size of a football field [11]. This system takes care of feeding the bins of flats into the scanning system. This scanning system uses the same OCR technology as letters, where information is read off of the labels and compared to known addresses in a database. This sorting information is passed onto the robotic sorting subsystem, and is then sorted into crates by delivery routes [9]. Packages are irregular shaped packages that are much larger pieces of mail. They are processed through an Automated Package Processing System that consists of conveyor belts and various machines. This system has a scanning and imaging tunnel. At this stage, the tracking information is also updated. After being scanned and processed, the packages travel along another conveyor belt where they are pushed onto bins based on their destination [9]. Figure 3 shows the mail receiving process.
Fig. 3. Mail receiving process [9].
Global Postal Automation
4
143
Global Postal Automation
Automation plays an important role in redefining how a society progresses. Technology has had major breakthroughs since the 16th century but, depending on what one would consider as “technology”, history goes back further. As time propels forward and technology along with it, society must keep up. Therefore, adapting automation is considerably beneficial to this cause. For the purpose of examining postal automation on a global scale, this paper will cover three eastern countries: India, China, and Japan. For India and China, the main focus is on character recognition for their envelopes, as it concerns their letter sorting machines. For Japan, an exciting and new advancement in regards to postal delivery is discussed. 4.1
Automation in India
There are approximately 20 languages that are used in India and Bangla comes in second in the most popular. However, despite it’s prevalent usage, character recognition for Bangla is not as optimal as would be favored. The bigger issue lies with handwritten Bangla characters, since there are fewer variations in the typed characters. In order to optimize the efficiency of their letter sorting machines, the matter of character recognition of handwritten Bangla numerals must be addressed and refined [35]. A very important aspect for character recognition is feature extraction. The main goal is to obtain the postcode numerals from an image of a letter envelope. Before the information is processed, the numerals must first be extracted. In order to do this, the area that contains the postcode must be located. In this case, there are two different areas to find the postcode: (1) in the postcode frames included in the envelopes, or (2) in the destination address block. Figure 4 shows an image of a Bangladesh letter, and it can be seen that the postcode is located on the lower right-hand corner of the envelope.
Fig. 4. An image of a Bangladesh letter with handwritten addressing [35]
144
A. Vachon et al.
The authors, Ying Wen, Yue Lu, and Pengfei Shi consider two approaches proposed for feature extraction of the Bangla characters. The first approach is manual extraction, where data that is pertinent to the specific problem are identified and analyzed in order to determine the features to focus on. The second approach is through “automated feature learning” [35], which is achievable with a large data set. This approach includes techniques such as Principal Component Analysis (PCA) and Support Vector Machine (SVM). Principal component analysis, or PCA, is concerned with dimensionality reduction. Since there is a large sample size involved, the goal is to simplify the set and make it more manageable. This technique comprises of the following steps [33]: 1. Standardization of data: This means that the data set will have a mean of 0 and variance of 1. 2. Computation of the covariance matrix: Note that a covariance matrix gives a measure of variability between each pair of variables. 3. Computation of eigenvectors and eigenvalues from the previously computed covariance matrix. 4. Sorting of the eigenvectors by their corresponding eigenvalues in descending order. 5. Construction of a projection matrix. 6. Transformation of the original set to the reduced set using the projection matrix. A support vector machine is a model that is used for the classification of data and detection of outliers of complex data sets. It distinguishes data considered as ‘noise’ from useful data for the specific problem. In Handwritten Bangla numeral recognition system and its application to postal automation, the authors state that “the appeal of SVM lies in their strong connection to the underlying statistical learning theory. According to the structural risk minimization principle, a function that can classify training data accurately and which belongs to a set of functions with the lowest capacity . . . will generalize best, regardless of the dimensionality of the input space” [35]. The Bangla numerals in Fig. 5 are sampled for extraction and normalized for further processing. Once extracted, PCA and SVM are applied for feature recognition of the numerals. In the previously mentioned article, [35] they considered 16,000 handwritten numerals, which were acquired by a letter sorting machine. From the total set, 6,000 elements were used as the training set, while the other 10,000 were used as the testing set. In regards to measuring the factors in order to determine the most favorable performance for the letter sorting machines, two factors were considered: (1) recognition reliability, and (2) response time. In summary, the most reliable results were attained by integrating PCA and SVM for feature extraction, as shown in Figs. 6 and 7. 4.2
Automation in China
In China, the amount of mail being processed is rising, so a demand for the optimization in effectiveness of the postal service system is increasing.
Global Postal Automation
145
Fig. 5. An image of various versions of handwritten bangla numerals from 0 to 9 [35]
Fig. 6. Recognition results using five different approaches [35]
The manual sorting of mail is not a favorable process, as it is directly accompanied by high labor costs. Therefore, letter sorting machines are widely used for efficient processing of mail. According to Application of Pattern Recognition Technology to Postal Automation in China, “over 100 letter sorting machines have been deployed around China since the 1990s” [21]. There are several manufacturers that produce the sorting machines, but the Shanghai Research Institute of Postal Science (SRI) supplies the most, specifically about 70% of all letter sorting machines in China. A diagram of an automatic letter sorting machine is shown in Fig. 8, and it consists of four main modules: 1. Mail Feeder: letters are fed to the sorting machine one by one and taken to the image scanner by the transport belt. 2. Image Scanner: the image of the letter is captured as it passes the camera. The image scanner is connected to the Postcode and Address Recognition Unit, to which the image of the letter will be sent. 3. Mail Stacker: the sorted letters are stacked according to the real-time control system.
146
A. Vachon et al.
Fig. 7. Recognition results using the integrated system [35]
Fig. 8. Diagram of the letter sorting machine [21]
4. Real-Time Control System: this receives information from the postcode and address recognition unit, then directs all of the mail to their corresponding mail stacks. There are three important pieces of information that are on the envelopes: (1) postcode frames containing the postcode; (2) the postcode located in the address block for the destination address; and (3) the destination address block. Once the image of the letter envelope has been captured, the image is then sent to the computer to analyze the information on the envelopes. In order to obtain the information, the image has to be segmented, processed for character recognition, and interpreted based on the two prior steps [21]. In image segmentation, the items of interest are separated from the rest of the image by taking out the specific areas that contain them. Postcode character extraction can be relatively easy so long as the characters are written within the boundaries of the postcode frames. An issue arises when the characters overlap the borders of the postcode frames. Additionally, the destination address is the trickier part of the equation. This block has to be localized in order to interpret the information that is pertinent to the mail sorting process. The most commonly used technique for extraction of specific objects in an image is binarization. In essence, binarization transforms data into binary numbers, such as transforming an image with pixels into a binary image, such as in Fig. 9 [21]. Once a binary image has been achieved, the next step in the process is to extract connected components of the image. That is, based on a characteristic chosen, objects that present the traits will be grouped together as subsets. These connected components are then bounded by rectangles for further analysis. The connected components of the binary image are shown in Fig. 10 [21].
Global Postal Automation
147
Fig. 9. A binary image of a letter envelope [21]
Fig. 10. An image of the connected components in the previous binary image [21]
The components must be categorized as either noise, graph, or a character, where any components recognized as noise or graph will be removed. The remaining components will be the characters that are the important components. Lastly, a clustering algorithm is used to localize the address block for the destination address, as shown in Fig. 11. This completes the extraction process [21].
Fig. 11. The localized address block after the connected components have been processed [21]
In regards to the classification of the numeral characters obtained, there are several methods outlined in the article by Lu et al. [21]. It is important to note that these methods may be combined to maximize performance improvement. Some methods include: – Havnet: a neural network for pattern recognition of two-dimensional binary images. – Threshold-Modified Bayesian Classifier (TMBC): numeral recognition by finding the mean vectors from the covariance matrix.
148
A. Vachon et al.
– Support Vector Machines (SVM): as previously described, SVMs classify the data and detect outliers of complex data sets. – Multiple-layer Perceptron (MLP): “one feed forward neural network is trained as a numeral classifier, and the back propagation algorithm is applied during the training procedure” [21]. – Tree Classifier Based on Topological Features (TCTF): this classifier is based on the figuration of the numerals and uses that information to classify. In regards to Chinese characters, the amount of characters is a very large set. Due to this, character recognition is divided into a rough classification stage and a fine classification stage. Rough classification is a rapid selection of a small number of data from the initially large set. Typically, the selection is based on common characteristics between the characters. Fine classification employs a function to assign the set of samples. Following character recognition, is address interpretation. The mail is sorted once the address has been interpreted. Though this seems efficient enough, there continues to be research to keep optimizing the postal service system. In a more recent development in Chinese postal automation, around 2,000 delivery vans were employed by China to deliver parcels. They are being tested in China and Detroit, U.S. These all-electric vans, called Quadrobots, are “designed for urban environments and neighborhoods to provide ‘last-mile’ delivery service” [1]. Last-mile delivery pertains to delivery of items from the hub to the customer, while keeping the time it takes and company costs at a minimum [1]. A key detail to point out is that these vans are not operating alone. Though they are autonomous, they are still operated by a driver for data gathering purposes. The vans are designed to “autonomously trail its operator down a street or around a parking lot” while they deliver packages [1]. Figure 12 shows the cargo space inside of the Quadrobot.
Fig. 12. An image of the cargo space of the Quadrobot [1]
Global Postal Automation
4.3
149
Automation in Japan
Japan has one of the best-rated postal services in the world due to its efficiency and organization [15]. A notable machine Japan started using in the late 90s was the Toshiba TT-2000. This machine is a flat and letter sorting machine. Unlike the U.S. postal system, which has a set of machines for letters and flats separately, this machine takes in flats, letters, and other mixed mail in many sizes. The TT-2000 uses OCR Recognition Capability at the carrier sequence level to scan and sort mail based on destination. It is fully equipped with over 900 sorting programs and comes with many functional operations. A few of those functional operations include: forward address printing, canceling function, weighing function, and assisted tray handling, to name a few. The Toshiba TT2000 has a throughput of about 45,000 letters per hour and 25,000 flats per hour [5]. The Culler Facer Canceller (TSC-1000) filters out letters and postcards depending on their dimensions. The TSC-1000 can process 30,000 pieces per hour [5]. The detection system on this machine uses Image Capture via OCR and phosphorescent to detect stamps [5]. The Parcel Sorter is a machine with a converter belt system that sorts small packages by destination. The Barcode Reader (TU-G22) is a barcode recognition system that can process more than 52,000 pieces per hour [5]. Japan Post Co. recently had their delivery robot inspected in October of this year (October 2020). The Government of Japan is hoping to implement automated delivery services to decrease the risk of infection due to COVID-19 and to aid in labor shortages. The delivery robot uses AI and sensors to be autonomous. About the size of a standard wheelchair [3], the delivery robot can maneuver around obstacles and stop at red lights. For an image of the robot, see Fig. 13.
Fig. 13. Japan Co. Delivery Robot is Japan’s solution to decreasing physical contact and labor shortages [3].
150
5
A. Vachon et al.
Future of Postal Automation
In 2001, it was estimated that there were about 668 million pieces of mail being handled daily [20]. Since then, it has increased, and is projected to continue increasing. AP News cited Parcel and Postal Automation Systems Market that, in regards to the global parcel and postal automation systems market, there is a projected increase from 2018 at $2.830 billion to $4.4971 billion in 2025. “The growth of this market will be driven by factors such as the growth ecommerce industry, increasing labor costs, and rising need for automated sorting and delivery processes in the postal industry” [10]. Over the last four decades, developments such as the ZIP+4 code, OCR, BCSs (Bar Code Scanner), and letter sorting machines have been very successful in increasing efficiency and reducing labor costs. However, the question concerning the future of postal automation is of equal or greater importance, and must be addressed. The long term goal is the “continuous flow of the mail stream” [20]; that is, automating for an end-to-end process, from sorting to delivery. By eliminating material handling, there will be a shift from batch processing to a continuous process, which would result in more efficiency. From the current point to the goal, there is plenty of room for more advancements and strides towards better automation. For example, sorting machines could be further optimized to process larger amounts of mail by redesigning or updating equipment and machinery to handle the volume more efficiently. New technologies and software are rolling out more frequently, which could have promising applications for the postal industry. In addition to flats and letters, traffic for parcels is starting to increase. One of the main contributing factors to this rise is e-commerce growth. Figure 14, shows that by 2025, parcel growth will be at a 1:1 parity with regular mail [14]. Projections like these are major motivation factors for further innovating postal automation. That being said, the growth of the e-commerce industry is going to directly affect the continuing efforts to augment and improve postal automation. Though there will be challenges, setbacks, or resistance to change, there will be stronger demand for better functioning mail delivery systems that have increased throughput and quality, while having decreased human errors and labor costs. Figure 15 shows employment projections for postal workers, clearly predicting a decrease in employment of postal service clerks and other postal service occupations.
6
The COVID-19 Pandemic
The SARS-CoV-2/COVID-19 pandemic severely disrupted the postal services market in 2020 and has caused a downturn in the economy [13]. Governmentimposed lockdowns resulted in major restrictions on local and international transportation, which had an effect on trade and delivery of medical supplies (medicines, equipments), important correspondence (financial institutions, welfare benefits, unemployment insurance), and human transportation (business, tourism) amongst other areas.
Global Postal Automation
151
Fig. 14. Projected parcel growth by 2025 [14].
Fig. 15. Projections on employment for various occupations in the USPS for 2019– 2029 [4].
In general, the postal service market experienced a decline in 2020, but recovery is expected in 2021 [2]. Figure 16 outlines 2019 and 2020 data from the United States Postal Service. In 2020, there was a decrease in the overall numbers of career employees, as well as mail volume. However, the annual operating revenue for the USPS increased about $2 billion in 2020. This may be due to the increase in shipping and package volume, as well as the total delivery points. E-commerce has played a major role in the shipment of packages and parcels due to lockdowns and social distancing requirements. In an effort to service consumers amidst the restrictions, consumers turned to the postal service’s website and mobile application. Address changes also increased in 2020, and according to the USPS, online change-of-address requests in 2020 were upwards of 20 million [6]. In addition, new vehicles for the postal service market are intended to shift towards automated and electric vehicles [2].
152
A. Vachon et al.
Fig. 16. United States postal service facts and figures [6].
7
Conclusion
Before the automation period, there was a heavy dependence on tedious manual labor when handling mail, including processing, sorting, and delivery; however, over the course of seven decades, the postal service industry has made great strides in automating services. Recall in the discussion of the new-field theory application to postal automation, there could be as many as five postal clerks who handle a piece of mail before it gets to the postman for delivery. That does not include the additional sorting that the postman has to do in the order of his mail delivery route. There was clearly a demand for the development of a more efficient system with decreased labor costs. The development of letter sorting machines, OCR, HWAI technology, and robotic machinery, as well as other forms of automation, has allowed the postal system to keep pace with growing demand in the industry. Without such automation, many amenities would not be as accessible. The future of automation is bright; the effects of the COVID-19 pandemic have pushed the already existing trend of making product delivery contactless even further and towards becoming more automated. Automation nurtures creativity, but there is no question that postal automation will substantially shift the labor markets. The fear of technology eliminating the working class is not a new phenomenon. Some jobs will be fully automated, but new jobs tend to replace them. Though the areas of focus in the labor market change periodically, there will always be jobs available. However, these new jobs may have a learning curve, as some workers will need additional training to transition into these new positions. Historically, refusal to adapt to new occupations stagnates an industry and creates inefficiencies that ultimately result in loss of competitiveness in a supply and demand market.
References 1. China post to roll out 2,000 robotic mail delivery trucks. ost & Parcel News. https://postandparcel.info/102949/news/. Accessed 24 Mar 2021 2. Global postal services market report opportunities and strategies. The Business Research Company. https://www.thebusinessresearchcompany.com/report/ postal-services-market. Accessed 24 Mar 2021
Global Postal Automation
153
3. Mail delivery robot makes test run on Tokyo road amid pandemic. Kyodo News. https://english.kyodonews.net/news/2020/10/fd361f7eae4b-mail-deliveryrobot-makes-test-run-on-tokyo-road-amid-pandemic.html. Accessed 24 Mar 2021 4. Occupational outlook handbook, postal service workers. Bureau of Labor Statistics, U.S. Department of Labor. https://www.bls.gov/ooh/office-and-administrativesupport/postal-service-workers.htm. Accessed 24 Mar 2021 5. Ocr letter sorting machine (lsm). Toshiba Infrastructure Systems & Solutions Corporation. https://www.toshiba.co.jp/infrastructure/en/security-automation/ solution-product/postal-logistics/ocr-letter-sorting-machine.htm. Accessed 24 Mar 2021 6. Postal facts. United States Postal Service. https://facts.usps.com/. Accessed 24 Mar 2021 7. Postal research at cedar. https://cedar.buffalo.edu/∼srihari/Postal-Research.html. Accessed 24 Mar 2021 8. Postal technology. Encyclopedia Britannica https://www.britannica.com/topic/ postal-system/Postal-technology. Accessed 24 Mar 2021 9. Systems at work. USPS TV, Youtube URL. https://www.youtube.com/watch? v=WX16-52bHvg. Accessed 24 Mar 2021 10. Parcel and postal automation systems market is projected to reach usd 4,497.1 million, growing at a cagr of 6.8% by 2025. Meticulous Research for AP News (2020). https://apnews.com/press-release/pr-wiredrelease/ 17acd911118a009fe8f29d559428d82f. Accessed 24 Mar 2021 11. The United States Postal Service: An American History. Government Relations, United States Postal Service, Washington D.C., (2020). Accessed 24 Mar 2021 12. Asher, J.J., Post, R.I.: The new field theory: an application to postal automation. Human Fact. 6(5), 517–522 (1964) 13. Barua, S. Understanding coronanomics: The economic implications of the coronavirus (covid-19) pandemic. 2020 14. Briest, P., Dragendorf, J., Ecker, T., Neuhaus, F.: The endgame for postal networks: How to win in the age of e-commerce. McKinsey % Company. https:// www.mckinsey.com/industries/travel-logistics-and-transport-infrastructure/ourinsights/the-endgame-for-postal-networks-how-to-win-in-the-age-of-e-commerce. Accessed 24 Mar 2021 15. Brix, A.C.: Postal system. Encyclopedia Britannica. https://www.britannica.com/ topic/postal-system. Accessed 24 Mar 2021 16. Cacho, J.R.F.: Improving ocr post processing with machine learning tools. University of Nevada, Las Vegas, Phd diss. (2019) 17. Cacho, J.R.F., Cisneros, B., Taghva, K.: Building a wikipedia n-gram corpus. In: Proceedings of SAI Intelligent Systems Conference, pp. 277–294. Springer (2020) 18. Cacho, J.R.F., Taghva, K.: Ocr post processing using support vector machines. In: Science and Information Conference, pp. 694–713. Springer (2020) 19. Cacho, J.R.F., Taghva, K., Alvarez, D.: Using the google web 1t 5-gram corpus for OCR error correction. In: 16th International Conference on Information Technology-New Generations (ITNG 2019), pp. 505–511. Springer (2019) 20. Knill, B.: Postal automation delivers. Material Handling Management 56(12), 57– 59 (2001) 21. Lu, Y., Tu, X., Lu, S., Wang, P.S.P.: Application of pattern recognition technology to postal automation in China. Pattern Recognition and Machine Vision (2010) 22. Mithe, R., Indalkar, S., Divekar, N.: Optical character recognition. Int. J. Recent Technol. Eng. (IJRTE) 2(1), 72–75 (2013)
154
A. Vachon et al.
23. Srihari, S.N.: Landmarks in postal research at cedar. https://cedar.buffalo.edu/ ∼srihari/PostalResearch.pdf. Accessed 24 Mar 2021 24. Srihari, S.N., Cohen, E., Hull, J.J., Kuan, L.: A system to locate and recognize zip codes in handwritten addresses. IJRE 1, 37–45 (1989) 25. Srihari, S.N., Kuebert, E.J.: Integration of hand-written address interpretation technology into the united states postal service remote computer reader system. In: Proceedings of the Fourth International Conference on Document Analysis and Recognition, vol. 2, pp. 892–896. IEEE (1997) 26. Taghva, K., Beckley, R., Coombs, J.: The effects of OCR error on the extraction of private information. In: International Workshop on Document Analysis Systems, pp. 348–357. Springer (2006) 27. Taghva, K., Borsack, J., Condit, A.: Results of applying probabilistic IR to OCR text. In: SIGIR 1994, pp. 202–211. Springer (1994) 28. Taghva, K., Borsack, J., Condit, A.: Effects of OCR errors on ranking and feedback using the vector space model. Inf. Process. Manag. 32(3), 317–327 (1996) 29. Taghva, K., Borsack, J., Condit, A.: Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans. Inf. Syst. (TOIS) 14(1), 64–93 (1996) 30. Taghva, K.,, Borsack, J.,, Condit, A., Erva, S.: The effects of noisy data on text retrieval. J. Am. Soc. Inf. Sci. 45(1), 50–58 (1994) 31. Taghva, K., Nartker, T., Borsack, J.: Information access in the presence of OCR errors. In: Proceedings of the 1st ACM Workshop on Hardcopy Document Processing, pp. 1–8. ACM (2004) 32. Taghva, K., Nartker, T.A., Borsack, J., Condit, A.: Unlv-isri document collection for research in OCR and information retrieval. In: Document recognition and retrieval VII, vol. 3967, pp. 157–164. International Society for Optics and Photonics (1999) 33. Tripathi, A.: A complete guide to principal component analysis – pca in machine learning. towards data science. https://towardsdatascience.com/a-complete-guideto-principal-component-analysis-pca-in-machine-learning-664f34fc3e5a. Accessed 24 Mar 2021 34. Ulvila, J.W.: Postal automation (zip+ 4) technology: a decision analysis. Interfaces 17(2), 1–12 (1987) 35. Wen, Y., Yue, Lu., Shi, P.: Handwritten bangla numeral recognition system and its application to postal automation. Pattern Recogn. 40(1), 99–107 (2007)
Automated Corpus Annotation for Cybersecurity Named Entity Recognition with Small Keyword Dictionary Kazuaki Kashihara1(B) , Harshdeep Singh Sandhu2 , and Jana Shakarian2 1 2
Arizona State University, Tempe, AZ 85281, USA [email protected] Cyber Reconnaissance, Inc., Tempe, AZ 85281, USA [email protected], [email protected]
Abstract. In order to assist security analysts in obtaining information pertaining to the cybersecurity tailored to the security domain are needed. Since labeled text data is scarce and expensive, Named Entity Recognition (NER) is used to detect the relevant domain entities from the raw text. To train a new NER model for cybersecurity, traditional NER requires a training corpus annotated with cybersecurity entities. Our previous work proposed a Human-Machine Interaction method for semi-automatic labeling and corpus generation for cybersecurity entities. This method requires small dictionary that has the pairs of keywords and their categories, and text data. However, the semantic similarity measurement in the method to solve the ambiguous keywords requires the specific category names even if non cybersecurity related categories. In this work, we introduce another semantic similarity measurement using text category classifier which does not require to give the specific non cybersecurity related category name. We compare the performance of the two semantic similarity measurements, and the new measurement performs better. The experimental evaluation result shows that our method with the training data that is annotated by small dictionary performs almost same performance of the models that are trained with fully annotated data.
Keywords: Cybersecurity Semantic similarity
1
· Named Entity Recognition · NER ·
Introduction
In many cybersecurity applications, Named Entity Recognition (NER) has been used to identify the entities of the interest such as the name and versions of the vulnerable software, those of vulnerable components, and those of underlying software systems that vulnerable software depends upon [6,21]. A NER model pinpoints entities based on the structure and semantics of input text, and tracks down entities that have never been observed in the training data. In those works, in general, the training data for a NER model is created by manual annotation. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 155–174, 2022. https://doi.org/10.1007/978-3-030-82199-9_11
156
K. Kashihara et al.
To minimize the manual annotation efforts, the automated labeling method [3], feature engineering methods [4,17], deep learning (DL) methods [3,19,25], and a transfer learning method [6] are introduced. The automated labeling method uses the database matching, heuristic rules, and relevant terms gazetteer. Any of the above methods using feature engineering, DL, or transfer learning methods requires some annotated training dataset to train the model. There are specific terms in this domain which in general English have different meanings and may not be an entity. For instance, “Wine” has meanings of a software name and a drink. The automated labeling method does not support any ambiguous keywords such as “Wine” to label correctly. In this paper, we address the problem of automated labeling method by taking a different approach. We introduce a new semantic similarity measurement that helps to determine the suitable category of an ambiguous keyword. Our method requires small dictionary that has the pairs of keywords and their categories, and raw text data. Then, automatically generates the training data for an NER model. The current NER tools that show state-of-the-art performance in the cybersecurity field are based on feature engineering or the Deep Learning. In addition, they require ample training data, which is generally unavailable for specialized applications, such as detecting cybersecurity related entities. The major issues are: it relies heavily on the experience of the person, the lengthy trial and error process that accompanies that, and it also relies on look-ups or dictionaries to identify known entities [4,17]. These dictionaries are hard to build and harder to maintain especially with highly dynamic fields, such as cybersecurity. For instance, the Common Vulnerabilities and Exposures (CVE) ID is easily extracted by the regular expression: “CVE-\d{4}-\d{4,7}”. However, software names, filenames, version information, and OS names are unique names and they are hard to identify through pattern matching methods. Thus, it requires human experts’ annotations. These activities constitute the majority of the time needed to construct these NER tools. In addition, these tools are domain specific and do not achieve good accuracy when applied to other domains. However, the requirement of the available features to the training and test data will not only slow down the annotation process, but also diminish the quality of results. Our previous work [12] introduced a semantic similarity measurement and generate a new NER corpora for cybersecurity entities with human-machine interactions. The NER model with this corpora performs better than the existing methods in finding undiscovered keywords of given categories. However, the proposed semantic similarity measurement needs to have not only the cybersecurity related category for an ambiguous keyword but also specific category name for the other categories such as “wine” as “software” or “drink”. In this paper, we introduce a new semantic similarity measurement and determine which category the word belongs to based on the semantic similarity of the entire sentence. This measurement does not require to give any specific category name for non cybersecurity related categories. This improves to preprocess for the semantic similarity measure algorithm. We apply this measurement to our previous method. The learning part of the method requires only the list of the
Automated Corpus Annotation
157
pairs of the cybersecurity entities and their categories. This method generates the high quality training dataset from the small number of keywords of the target categories in cybersecurity field. The evaluation with two cybersecurity NER corpus shows that our approach with new semantic similarity measurement and the given small dictionary performs almost same performance of the manually annotated datasets. The main contributions of this paper include: – We present a bootstrapping method to train an NER system for cybersecurity domain entity with small number of initial dictionary. – We introduce a new semantic similarity measurement for solving ambiguous entities case. The semantic similarity measurement helps to determine which category an ambiguous entity should belong to. – We empirically perform experiment. The result shows that our approach with the small number of keyword coverage in each category performs almost similar performance of the other DL methods with full annotated data. The rest of the paper is organized as follows: we introduce several terms and applications related to NER in Sect. 2, proposed new semantic similarity measurement in Sect. 3, then the experimental evaluations in Sect. 4, finally the analysis and discussion of the experimental evaluation in Sect. 5.
2
Background
Various methods have been applied to extract entities and their relations in the cybersecurity related domains. For example, Jones et al. [11] implemented a bootstrapping algorithm that requires little input data to extract security entities and the relationship between them from the text. A SVM classifier has been used by Mulwad et al. [16] to separate cybersecurity vulnerability descriptions from non-relevant ones. They require pre-process or annotated corpus. The automatic labeling method for cybersecurity [3] uses Database Matching (string pattern matching), Heuristic Rules (rule based matching), and Relevant Terms Gazetteer (extended string matching that if a phrase contains a keyword in the database, the phrase is annotated with the label in the database). However, there are specific terms in cybersecurity domain which in general English have different meanings and may not be an entity. For instance, “Windows” and “Wine” are an OS name and an application name in cybersecurity field, but they have different meanings in general English. The above methods do not support any ambiguous keywords to label correctly. Sirotina and Loukachevich [23] provide the corpora of 10 cybersecurity related categories in Russian and the corpora is manually annotated by human experts. Recently, the Deep Learning (DL) methods are used for NER. DL is an enhanced classical neural network model with naturally learning non-linear combinations. For instance, the Conditional Random Fields (CRFs) can just learn linear combinations of the defined features. This reduces the human work of tedious feature engineering [3,19,25]. The recent work by Gasmi et al. [7] relies on Long Short-Term Memory (LSTM) and the Conditional Random Fields (CRFs)
158
K. Kashihara et al.
method for cybersecurity NER that applies the LSTM-CRF architecture suggested by Lampal et al. [13]. The architecture combines LSTM, word2Vec [15] models, and CRFs. The input for this method is an annotated corpus in the same format as the CoNLL-2000 dataset [20]. In the recent days, many applications of DL have been leverage in the field of cybersecurity [18,26,27]. However, any of the above methods using feature engineering or DL methods requires some annotated training dataset to train the model. There are two challenges. First, it requires some certain number of annotated sentences to make the decent performance model. Second, the sentences are annotated by experts (human in many cases) and the human makes the incorrect annotation or miss to annotate some words or phrases. In our previous work [12], we introduced a human-machine interaction framework for semi-automatic labeling and corpus generation for cybersecurity entities. The framework has the Training module and the Evaluation module. The Training module collects sentences, annotates the keywords from the given dictionary and passes the generated corpora to the NER system to train the model. We introduced a semantic similarity measurement named SentCat to judge the suitable category for ambiguous keywords that can be annotated to multiple categories. This measurement requires all of the ambiguous categories since the measurement calculates the similarity of the sentence against each category and determines the highest similarity score’s category as the suitable category. Thus, there is a space to improve this semantic similarity measurement. 2.1
BERT
BERT (Bidirectional Encoder Representations from Transformer) [5] has two steps: pre-training with large raw corpus, and fine-tuning the model for each task. BERT is based on Transformer [24], which can catch the long distance dependency relations, because it is based on self-attention, and does not use RNN or CNN. The input for BERT is a sentence, pair of sentences, or document, and it represents the sequence of tokens in each case. Each token is the summation of token embedding, segment embedding, and position embedding. Each word is divided into sub-words, and the non-head part in the subwords will be assigned “##”. For instance, “playing” is divided into “play” and “##ing” as subwords. If the input is two sentences, segment embedding gets the first sentence token as sentence A embedding, and the second sentence token as sentence B embedding (put “[SEP]” token between two sentences). In addition, the location of each token is learned as position embedding. The head of each sentence is marked with the “[CLS]” token. In the document classification task or two sentences classification task, the final layer of embedding of the token is the representation of the sentence or the two-sentences-set. For text classification tasks, BERT takes the final hidden state h of the first token [CLS] as the representation of the whole sequence. A simple softmax
Automated Corpus Annotation
159
Fig. 1. Architecture of the proposed method
classifier is added to the top of BERT to predict the probability of label c: p(c|h) = softmax(W h), where W is the task-specific parameter matrix. We fine-tune all the parameters from BERT as well as W jointly by maximizing the log-probability of the correct label.
3
Proposed Method
In this section, we propose a new semantic similarity measurement for solving ambiguous keyword categories and apply the measurement to our previous work: Human-Machine Interaction method. The architecture of the Human-Machine Interaction method is shown in Fig. 1. Our proposed new semantic similarity measurement applies to Generate Training Dataset process to improve solving ambiguous keyword annotation. 3.1
Category Classification for Ambiguous Meaning Keywords
Many keywords’ meaning changes within the context. For instance, “Microsoft has released a security update to address an elevation of privilege vulnerability (CVE-2019-1162) in windows” and “an inventory of the network analysis classes for which you can set time windows”. The “windows” in the first sentence means the operating system but the second one means the window of time. To avoid mislabeling, we introduce new approach: text category classification using BERT fine-tuning (CategoryClassifier method).
160
K. Kashihara et al.
Algorithm 1. CategoryClassifier(sentence, categoryList) 1: 2: 3: 4: 5: 6: 7: 8:
Load the fine-tuned BERT Text category classifier model classifier category = classifier (sentence) if category ∈ categoryList then finalCategory = category else finalCategory = NONE end if return finalCategory
In the CategoryClassifier method, we build the text category classifier using BERT fine-tuning. The training data for this text category classifier is the pair of the sentences that contain the known ambiguous keywords and the category of each sentence (it must be one of the ambiguous keyword categories). For instance, let’s assume an ambiguous keyword “wine” and it has two categories, “software” and “non-software”. We use the two sentences and labeled them as follows: “Only if you drink French wine, if it’s radiated Californian wine that makes you an alcoholic mutt.” is labeled as “non-software”, where as “Wine is not a virtual machine, just an api converter, it can also directly call Linux programs.” is labeled as “software”. These sentences and their labels are given to BERT fine-tuning for building the text category classifier for the ambiguous keyword categories. The steps of CategoryClassifier are described in Algorithm 1.
4 4.1
Experimental Evaluation Data
We evaluate our method with Auto-labeled Cyber Security domain text corpus (we call Auto-labeled data) provided by Bridges et al. [3] comprising of around 15 categories was used in this work, and Russian Sec col collection (We call Sec col data) by Sirotina and Loukachevich [23] comparing of 10 categories was used in this work. We use spaCy in the NER model training part. In Auto-labeled data, each word in the corpus is auto-annotated with an entity type. We joint each word in a sentence in a separate line into a sentence in order to feed the data into our method. The total number of the sentences is 15,781. For the evaluation our method, we convert the entities in each word into the categories, for instance, we merge “buffer: B-Relevant Term” and “overflow: I-Relevant Term” into “buffer overflow: Relevant Term”. Table 1 shows the statistics of the number of unique keywords in the dataset. We call the dictionary that contains these unique keywords of each category the unique full dictionary. In addition, the number of ambiguous keywords that have multiple categories is 153. The dataset is divided into three subsets that is training, validation, and testing consisting of 70%, 10%, and 20% sentences respectively. Sec col data consists of 855 texts (posts and forum publications and each text has multiple sentences) from SecurityLab.ru website. Table 2 shows the statistics
Automated Corpus Annotation
161
Table 1. The statistics of unique keywords in the 15 categories of auto-labeled data Category
# of unique keywords
Application
4335
Relevant Term
193
Vendor
605
Version
7733
Update
222
OS
74
Function
1283
File
2426
Hardware
275
Method
107
CVE ID
447
Parameter
270
Edition
58
Programming Language
3
Language
2
Table 2. The statistics of unique keywords in the 10 categories of Sec col data Category
# of unique keywords
Org
1328
Loc Term
420
Person
781
Tech
1029
Program
1884
Device
318
Virus
328
Event
187
Hacker Group
35
Hacker
11
of the number of unique keywords in the dataset. We also call the dictionary that contains these unique keywords of each category the unique full dictionary. In addition, the number of ambiguous keywords that have multiple categories is 224. We follow the evaluation way of [23] and do 4-fold cross-validation. For preprocessing to use CategoryClassifier, we fine-tuned the BERT model. In Auto-labeled data, we fine-tuned the model with the top 10% frequent ambiguous keywords (15 keywords from 153 ambiguous keywords from the original corpus) and 1,819 sentences that contain at least one ambiguous keyword with the ambiguous keyword’s category as the sentence label from the training dataset (70% of the original corpus). This 1,819 sentences are divided into three subsets
162
K. Kashihara et al.
Fig. 2. The Loss and performance graphs of CategoryClassifier’s training and validation with the ambiguous keyword sentences from auto-labeled data.
Fig. 3. The loss and performance graphs of CategoryClassifier’s training and validation with the ambiguous keyword sentences from Sec col data.
that is training, validation, and testing consisting of 70%, 10%, and 20% sentences respectively. Figure 2 shows the loss and performance curves of training and validation of ambiguous Auto-labeled sentences. After the 10 epoch, the accuracy of Training and Validation is as follows. – Training: 98.2% – Validation: 95.1% Then, we compared with the accuracy of Testing data with CategoryClassifier and SentCat [12]. The result is as follows: – SentCat: 82.8% – CategoryClassifier: 88.4% CategoryClassifier performs better than SentCat. In Sec col data, we fine-tuned the BERT model with all ambiguous keywords (224 keywords from the original corpus) and 2,425 sentences that contain at least one ambiguous keyword with the ambiguous keyword’s category as the sentence label from the training dataset (70% of the original corpus). This 2,425 sentences are divided into three subsets that is training, validation, and testing consisting of 70%, 10%, and 20% sentences respectively. Figure 3 shows the loss and performance curves of training and validation of ambiguous Sec col sentences. After the 30 epoch, the accuracy of Training and Validation is as follows.
Automated Corpus Annotation
163
Fig. 4. The graph of the performance of our method with SentCat in the original test data.
– Training: 97.1% – Validation: 63.2% Then, we compared with the accuracy of Testing data with CategoryClassifier and SentCat. The result is as follows: – SentCat: 59.3% – CategoryClassifier: 61.1% CategoryClassifier performs better than SentCat as well. In our method’s evaluation, we pick the most frequent X% of the original unique keywords of each category where X is 10, 20, 30, 40, 50, 60, 70, 80 and 90, and we fix the number of the ambiguous keywords as 10% of the original ambiguous keywords. In the evaluation part, we evaluate the learned model with the validation dataset and add the new keywords that are not listed in the dictionary but they are listed in the full dictionary with the right category to the next iteration dictionary. We use pre-trained models for spaCy; an English model “en core web lg” for Auto-labeled data since all the posts are written in English, and a multi-language model “xx ent wiki sm” for Sec col data since Russian posts are written in not only Russian but also multiple languages including English, and this model is the only model supports Russian. 4.2
Results
In Auto-labeled data, we iterated three times in the experimental evaluation of our method. First, we evaluate the learned models with SentCat and
164
K. Kashihara et al.
Fig. 5. The graph of the performance of our method with CategoryClassifier (BERT) in the original test data.
CategoryClassifier (BERT) approaches for solving the ambiguous keywords through the original annotation. We compare the eight different approaches: LSTM-CRF, CRF [8], CNN-CRF, RNN-CRF, GRU-CRF, Bidirectional GRUCRF, Bidirectional GRU+CNN-CRF [22], and spaCy. The results are shown in Fig. 4, Fig. 5, and Table 3. In both SentCat and CategoryClassifier (BERT) cases, the dictionary size 10% gets the highest precision score in the dictionary size range between 10% and 90%, and the dictionary size 70% gets the highest recall score. The recall performance is higher than CRF method with full annotation, however, the precision performance is not as high as we expected. When we check the original annotation, we found some annotation issues. For instance, some categories like “Version” has “(” and “)” as the part of the keywords (phrase) like “4.0 before 4.0(16)”, however, some cases are missing “)” such as “4.1 before 4.1(7”. This incomplete paired cases are not accepted to annotate by spaCy and our model learned only the paired cases. In addition, the original annotation has many unnecessary characters that are included to the annotation such as comma, quote(s), and double quote(s). Since the performance is calculated by exact matches, our trained models can detect the part of the original annotated entities but they did not count correctly. Moreover, many unique keywords are not annotated in the original annotation, and our models detect them. Since the original annotation accuracy has some doubt, we add the additional annotation from the full unique dictionary if a sentence has missed an annotation from the original. We call this new annotated test data as the fully annotated Test data on Auto-labeled dataset, and we evaluate our learned models with this fully annotated Test data. The results are shown in Fig. 6, Fig. 7, and Table 4.
Automated Corpus Annotation
165
Fig. 6. The graph of the performance of our method with SentCat in the fully annotated test data.
In both SentCat and CategoryClassifier (BERT) cases, the dictionary size 10% gets the highest precision score in the dictionary size range between 10% and 90%, and the dictionary size 70% gets the highest recall score. The recall performance is higher than CRF method with full annotation, however, the precision performance is not as high as we expected. Thus, our method can create the high quality train corpus with the smaller dictionary than the full unique dictionary. In Sec col data, we compare the ten different approaches as follows: (A) CRF (B) BiDirectional LSTM (C) BiDirectional LSTM with a CRF-classifier as an output layer (D) BiDirectional LSTM with BiDirectional LSTM embeddings (E) BiDirectional LSTM with BiDirectional LSTM embeddings and a CRFclassifier as an output layer (F) BiDirectional LSTM with CNN embeddings (G) BiDirectional LSTM with CNN embeddings and a CRF-classifier as an output layer (H) spaCy (CNN) (I) SentCat (number is % of the dictionary size) (J) Our method with BERT (number is % of the dictionary size) and the core layer in (B)-(G) is Bidirectional Long-Short Term Memory (BiLSTM) Neural Network (NN) [9,10]. Models (C), (E) and (G) use a CRF-classifier as an output layer [10,13,14]. Models (D)-(G) also have special layers that build character embeddings [13,14]. While models (D) and (E) use BiLSTM-layer to
166
K. Kashihara et al.
Fig. 7. The graph of the performance of our method with CategoryClassifier (BERT) in the fully annotated test data.
build character embeddings, models (F) and (G) use CNN-layer for the same purpose. The result data of (A)-(G) are from [23]. Table 5 shows the result. In both SentCat and CategoryClassifier (BERT) cases, the dictionary size 30% and 70% performs better in many categories. Since the original annotation accuracy has some doubt on Sec col data as well, we add the additional annotation from the full unique dictionary if a sentence has missed annotation from the original. We call this new annotated test data as the fully annotated Test data on Sec col dataset, and we evaluate our learned models with this fully annotated Test data. Table 6 shows the result with this fully annotated Test data. The performance of each case is better than the original annotation since the original annotation missed some keywords and annotated incorrectly. We got the best performance in Precision or Recall in four out of ten categories: “Person”, “Location”, “Tech”, and “Virus”. Thus, our method can perform better in some cases even if the dictionary size is smaller than original.
5 5.1
Analysis and Discussion Analysis: Auto-labeled Data
We checked the new keywords from the validation data in each model, and noticed that there are so many keywords that are easily identified in the specific category, but they are not annotated in the original corpus [3]. For instance, Some CVE IDs are special and unique such as CVE-2008-2565.1 and CVE-20094083.1, but the models learned from our annotated training data can detect
Automated Corpus Annotation
167
Table 3. The comparison of the recent NER methods with the average weighted performance metrics. P, R and F1 are the represent precision, recall and F1 score respectively. Method
P
LSTM-CRF
85.3 94.1 89.5
R
F1
CRF
82.4 83.3 82.8
CNN-CRF
83.1 93.9 88.2
RNN-CRF
83.5 85.6 84.5
GRU-CRF
86.5 95.7 90.9
Bidirectional GRU-CRF
88.7 95.4 91.9
Bidirectional GRU+CNN-CRF
90.8 96.2 93.4
spaCy
92.3 90.7 91.5
SentCat (10%, 1st)
62.0 75.9 68.2
SentCat (10%, 2nd)
62.1 77.1 68.8
SentCat (10%, 3rd)
62.1 77.3 68.9
SentCat (70%, 1st)
51.5 83.7 63.8
SentCat (70%, 2nd)
51.9 83.4 63.9
SentCat (70%, 3rd)
52.0 83.7 64.2
Our method: BERT (10%, 1st)
62.4 77.3 69.1
Our method: BERT (10%, 2nd) 62.4 77.5 69.2 Our method: BERT (10%, 3rd) 62.6 78.2 69.5 Our method: BERT (70%, 1st)
52.4 85.0 64.8
Our method: BERT (70%, 2nd) 52.3 84.2 64.5 Our method: BERT (70%, 3rd) 52.1 84.6 64.5
them with the correct “CVE ID” category. Since the original annotation uses the simple regular expression to detect CVE IDs, they missed these unique cases in the original annotation. In addition, “Programming Language” category has originally one keyword “JavaScript” and this “JavaScript” is an ambiguous keywords that also belongs to “Function” and “Method” categories. We could find “C++” and “C#” as “Programming Language” from the learned models but they are not annotated in the original dataset. Furthermore, some cyber attack related phrases (“Relevant term” category phrases) such as “Cross-application scripting” and “Crosszone scripting” are detected from our trained models but they are not annotated in the original dataset as well. Moreover, many typos or extended version of application names are detected by our models. For instance, “OpenSSL”, “VLC Media Player”, “Enterprise Manager Grid Control” are in “Application” category of the original dictionary, and our trained models with both SentCat and CategoryClassifier train datasets can detect “openSSL”, “VLC”, “VLC 1.1.8”, and “Enterprise Manager Grid
168
K. Kashihara et al.
Table 4. The average weighted performance metrics of our method with SentCat and CategoryClassifier (BERT) for all entity types on the full annotation by the full dictionary. P, R, F1 are the represent precision, recall and F1 score respectively. Method
P
SentCat (10%, 1st)
90.3 70.3 79.1
R
F1
SentCat (10%, 2nd)
90.5 70.7 79.4
SentCat (10%, 3rd)
90.6 70.8 79.5
SentCat (70%, 1st)
88.1 89.1 88.6
SentCat (70%, 2nd)
88.0 88.9 88.4
SentCat (70%, 3rd)
88.2 89.1 88.6
Our method: BERT (10%, 1st)
83.5 58.4 68.7
Our method: BERT (10%, 2nd) 90.9 70.9 79.6 Our method: BERT (10%, 3rd) 90.9 71.4 79.9 Our method: BERT (70%, 1st)
88.3 90.0 89.1
Our method: BERT (70%, 2nd) 88.5 89.4 88.9 Our method: BERT (70%, 3rd) 88.1 89.7 88.9
Control EM Base Platform”. These detected words are not listed in the original dictionary, and they are not annotated by the original work. On the other hand, the learned models detects file paths such as “apps/admin/ handlers/” and “admin/action/” as “File” category, and file names with unnecessary characters such as “Admin/frmSite.aspx, (” and “admin/OptionsPostsList.php in”. The issue of file paths has come from the frequent patterns of the file names. The frequent substrings of the file paths such as “/Admin/” and “apps/” are considered one of the features in the trained NER models and the phrases that contain the above patterns are extracted. The issue of the additional character is the problem of annotation or original text. For instance, the original text does not have the proper spacing between file name and other words or characters, the chunking the sentence in the learning NER model process affected the inaccurate chunking words from these sentences. The original annotation has some issues such as using regular expressions and annotate wrong words and phrases with wrong categories like “7.50/7.53” as “File” category. However, our models with both SentCat and CategoryClassifier train datasets can annotate them as “Version” category. In evaluation part, on average, about 2,794 entities are newly detected by a model with SentCat and about 438 of them are in the original keyword list, and about 2,960 entities are newly detected by a model with CategoryClassifier and about 422 of them are in the original dictionary. CategoryClassifier can detect more entities but SentCat can detect more entities in the original keyword list. The best ratio of the entities in the original dictionary is 10% of the original dictionary size case for both SentCat and CategoryClassifier, and 33.4% (778 out of 2,327 entities are in the original dictionary) with SentCat and 29.1% (615 out of 2,113 entities are in the original dictionary) CategoryClassifier respectively.
Automated Corpus Annotation
169
Table 5. The result of test data on Sec col Data. P, R, F1 are the represent precision, recall and F1 score respectively. Category Person
Loc
Org
Hacker
HackerGroup
Program
Device
Tech
Virus
Event
5.2
(A)
(B)
(C)
(D)
(E)
(F)
(G)
(H)
(I)-30
(I)-70
(J)-30
(J)-70
P
85.4
28.9
61.2
79.1
85.7
72.8
79.2
67.4
56.4
50.1
55.0
47.1
R
57.8
8.9
30
46.9
54.7
35
49.1
66.3
52.1
59.1
52.9
54.5
F1
68.9
13.5
40.3
58.9
66.8
47.2
60.6
66.7
54.1
54.1
53.8
50.4
P
96.7
90.2
88.1
92.7
92.9
95.5
94.6
79.8
78.2
79.4
80.2
80.1
R
81.9
39.4
53.5
70
82.3
52.5
73.5
83.7
84.7
83.8
83.2
81.4
F1
88.6
54.8
66.6
79.8
87.3
67.6
82.7
81.7
81.3
81.5
81.6
80.7
P
85.9
68.7
73
75.3
78.1
78.3
76.4
66.6
47.2
43.8
66.0
56.4
R
65.5
30.3
38.3
62.1
69.1
48.6
67.5
60.4
45.6
53.0
47.4
52.0
F1
74.3
42
50.2
68.1
73.3
59.9
71.6
63.2
46.3
47.9
55.2
54.1
P
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
R
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
F1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
P
87.5
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
R
14
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
F1
24
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
P
82.1
56.6
65.1
77.6
85.8
71.4
78.5
49.8
40.9
42.3
38.2
39.2
R
61.2
29
40.4
51.3
60
57.1
58.2
55.6
32.6
41.2
43.9
49.3
F1
70
38.4
49.9
61.8
70.6
63.4
66.6
52.5
36.2
41.6
40.8
43.4
P
65.4
0.0
0.0
0.0
11.1
18.8
11.9
20.4
6.9
7.4
0.0
0.0
R
21.9
0.0
0.0
0.0
0.8
2.5
0.8
3.5
6.9
6.8
0.0
0.0
F1
32.5
0.0
0.0
0.0
1.5
4.3
1.3
5.8
6.7
6.8
0.0
0.0
P
71.3
63
67.2
71.8
77.4
70.2
76.6
58.9
46.4
42.4
40.4
38.1
R
53.6
4.1
16.8
55.5
41.9
48
53.7
66.9
44.7
49.3
55.3
59.3
F1
61.1
13.3
26.9
62.6
54.4
57
63.1
62.4
45.5
45.5
46.6
46.4
P
68.5
0.0
0.0
0.0
37.5
3
23.8
50.0
59.0
70.5
0.0
0.0
R
28.3
0.0
0.0
0.0
5.1
0.4
3.8
17.9
12.3
8.0
0.0
0.0
F1
39.6
0.0
0.0
0.0
9
0.7
6.6
24.5
19.8
14.2
0.0
0.0
P
67.8
0.0
0.0
0.0
71.4
0.0
37.6
56.2
50.0
25.0
25.0
25.0
R
27.2
0.0
0.0
0.0
5.9
0.0
7.2
9.6
2.9
0.7
0.9
0.4
F1
38.5
0.0
0.0
0.0
10.9
0.0
12
16.0
5.5
1.4
1.7
0.7
Analysis: Sec col Data
As the results of Sec col data show in Table 5 and Table 6, our models with both SentCat and CategoryClassifier cannot learn and detect entities of “Hacker” and “Hacker Group” categories. We checked the original dictionary and found the following points: “Hacker” and “Hacker Group” categories have very small numbers of their entities (12 entities for “Hacker” and 37 entities for “Hacker Group”, respectively), and half of the hacker names in the original dictionary starts and ends with double quotes. The spaCy’s system could not handle many cases of entities starting and ending with double quotes or parentheses. Thus, these original annotation issues and spaCy’s issue may cause the low performance of “Hacker” and “Hacker Group” categories. In addition, spaCy has only one pre-trained model to support Russian, and the model covers multiple languages widely but not deeply. Since Sec col data has the posts and forum publications from a Russian cybersecurity forum, it has many technical keywords in English and Russian. We suspect that the pre-
170
K. Kashihara et al.
Table 6. The result of the fully annotated test data on Sec col Data. P, R, F1 are the represent precision, recall and F1 score respectively. Category Person
Loc
Org
Hacker
(I)-40 (I)-70 (J)-30 (J)-70 P
70.9
71.2
69.6
69.5
R
63.6
67.5
53.9
64.0
F1 67.0
69.2
60.6
66.4
P
75.3
79.1
80.5
80.5
R
85.3
83.2
82.7
80.8
F1 79.9
81.0
81.5
80.5
P
47.8
47.8
66.0
60.1
R
49.9
53.4
44.3
52.1
F1 48.9
50.5
53.0
55.8
P
0.0
0.0
0.0
0.0
R
0.0
0.0
0.0
0.0
F1 0.0
0.0
0.0
0.0
HackerGroup P
0.0
0.0
0.0
0.0
R
0.0
0.0
0.0
0.0
F1 0.0
0.0
0.0
0.0
P
45.8
44.3
39.8
41.0
R
35.1
40.5
43.3
48.5
F1 39.6
42.1
41.4
44.2
P
13.3
14.1
0.0
0.0
R
12.6
11.6
0.0
0.0
F1 12.8
12.6
0.0
0.0
P
61.5
60.1
53.2
50.9
R
54.5
56.8
58.9
64.0
F1 57.8
58.3
55.9
56.7
P
84.2
72.6
0.0
0.0
R
7.7
6.5
0.0
0.0
F1 13.6
11.9
0.0
0.0
P
8.3
25.0
25.0
25.0
R
0.5
0.8
0.9
0.4
F1 0.9
1.5
1.7
0.8
Program
Device
Tech
Virus
Event
trained model does not have the vector representation of many of these technical words and could not learn the semantic relations of the entities. However, our models with both SentCat and CategoryClassifier can detect some useful but original annotation missed entities. For instance, “Taiwan”, “Korea” and “Province of China” are all some geological words (locations) and categorized as “Loc” category by our models but the original annotation did not have these entities. In addition, the models detected some person’s name and usernames as “Person” category such as “Carlos Almedia” and “Xaker45reg ***kov” but they are also not in the original annotated entities. The Sec col data is manually annotated. Thus, we suspect there is some human mistakes during the annotation process and our models can detect the missed entities.
Automated Corpus Annotation
171
After we carefully checked the original annotations, we found some potential and serious mistakes in the original annotation. For instance, “APT” (Advanced Persistent Threat) is only annotated as “Virus” category in the original annotation. An APT is a stealthy thread actor, so it should be annotated as “Hacker” or “Hacker Group” category. In addition, under “Virus” category, there are so many non virus entities are annotated such as “DDoS”, “0-day” and some CVE IDs. These inaccurate annotations may cause to the models’ performance lower. In evaluation part, on average, about 1,702 entities are newly detected by a model with SentCat and about 188 of them are in the original keyword list, and about 1,559 entities are newly detected by a model with CategoryClassifier and about 200 of them are in the original dictionary. SentCat can detect more entities but CategoryClassifier can detect more entities in the original keyword list. The best ratio of the entities in the original dictionary is 10% of the original dictionary size case for SentCat and 20% of the original dictionary size case for CategoryClassifier, and 31.0% (196 out of 632 entities are in the original dictionary) with SentCat and 31.6% (392 out of 1,242 entities are in the original dictionary) with CategoryClassifier respectively. 5.3
Discussion
Both of the experimental results with Auto-labeled data and Sec col data show that our method can generate some high quality training data for a NER system with smaller dictionary size comparing to the original annotated datasets. In Auto-labeled data, both SentCat and CategoryClassifier methods perform low precision score in the original annotation. However, when we use our annotation for the evaluation data as well, our method performs almost same performances of the most of the other NER methods. Since the original Auto-labeled data is annotated by the automated labeling method, this result shows that our method can generate the higher quality training dataset for a NER system, and our method can annotate more accurately than the original automated method. In Sec col data, both SentCat and CategoryClassifier methods perform highest precision or recall scores in some categories in the original annotation such as “Person”, “Location”, and “Tech”. In addition, when we use our annotation for the evaluation data as well, our performance increased most of the categories. However, the performances of “Organization”, “Hacker”, “Hacker Group”, “Program”, “Device” and “Event” categories are lower than the other methods. The “Virus” category got the highest precision score in SentCat method but 0 score in CategoryClassifier method. This means that CategoryClassifier may not annotate the “Virus” category keywords correctly than SentCat. This preliminary experiment shows the advantage of our method but there is some space to improve. However, our method can annotate more accurately than the automated labeling method in Auto-labeled data, and our method is able to support multiple languages in Sec col data. In addition, we found many issues from the original datasets and original annotations. We suspect that some low performance in some categories may be caused by these inaccurate keywords and these categories. The comparison of the original annotation and our method’s
172
K. Kashihara et al.
annotation with the carefully picked dictionary that has the keywords which the experts carefully evaluate and classify the right category will be needed to reinforce the benefit of our method. We also need to compare with the combinations of our method with other state-of-the-art NER systems to see some of the issues in the above can be solved with the different NER systems.
6
Conclusion
We introduced CategoryClassifier to calculate the semantic similarity of the given keyword’s category and the sentence that includes the keyword to minimize the wrong annotation of ambiguous keywords. The initial experiment shows that CategoryClassifier performs slightly better than our previous measurement: SentCat. The experimental evaluation shows that our method performs well after iterating the process, and reached almost same performance of the state-of-theart methods that use the fully annotated corpus with about 30–70% of the keywords to annotate. In addition, the trained NER models with our method can detect many phrases that are not annotated originally. Thus, our method can generate the high quality training data with the small number of keywords comparing to the original full annotated data. Our method can help to create high quality training data for new cybersecurity domains if users need to create a new model to detect the phrases of new categories. For future work, we will extend the current keyword matching algorithm to find the noun phrase in the given sentence that includes the keyword since some keyword appears as a part of the noun phrase but the current method annotates the keyword itself and not the phrase. This change will increase the quality of annotation. Then, we will extend the NER model from spaCy to the other stateof-the-art NER models [1,2,5], and evaluate the difference of the performance by each NER method. Finally, we will apply this trained NER model for other cybersecurity related tasks such as detecting new malware names, analysis of malware families, and APT Groups supported by more detailed information on such actors.
References 1. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, 20–26 August, 2018, pp. 1638–1649 (2018) 2. Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven pretraining of self-attention networks. CoRR, abs/1903.07785 (2019) 3. Bridges, R.A., Jones, C.L., Iannacone, M.D., Goodall, J.R.: Automatic labeling for entity extraction in cyber security. CoRR, abs/1308.4941 (2013) 4. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 999888, 2493–2537 (2011)
Automated Corpus Annotation
173
5. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019) 6. Dong, Y., Guo, W., Chen, Y., Xing, X., Zhang, Y., Wang, G.: Towards the detection of inconsistencies in public security vulnerability reports. In: Heninger, N., Traynor, P., (eds.) 28th USENIX Security Symposium, USENIX Security 2019, Santa Clara, CA, USA, 14–16 August, 2019, pp. 869–885. USENIX Association (2019) 7. Gasmi, H., Bouras, A., Laval, J.: Lstm recurrent neural networks for cybersecurity named entity recognition. In: ICSEA 2018, p. 11 (2018) 8. Gasmi, H., Bouras, A., Laval, J.: LSTM recurrent neural networks for cybersecurity named entity recognition. ICSEA 11, 2018 (2018) 9. Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada, 26–31 May, 2013, pp. 6645– 6649. IEEE (2013) 10. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. CoRR, abs/1508.01991 (2015) 11. Jones, C.L., Bridges, R.A., Huffer, K.M.T., Goodall, J.R.: Towards a relation extraction framework for cyber-security concepts. In Proceedings of the 10th Annual Cyber and Information Security Research Conference, CISR 2015, Oak Ridge, TN, USA, 7–9 April, 2015, pp. 11:1–11:4 (2015) 12. Kashihara, K., Shakarian, J., Baral, C.: Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1251, pp. 347– 361. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55187-2 28 13. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, pp. 260–270 (2016) 14. Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional lstm-cnnscrf. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, 7–12 August, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics (2016) 15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013) 16. Mulwad, V., Li, W., Joshi, A., Finin, T., Viswanathan, K.: Extracting information about security vulnerabilities from web text. In: Proceedings of the 2011 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Workshops, WI-IAT 2011, Campus Scientifique de la Doua, Lyon, France, 22–27 August, 2011, pp. 257–260 (2011)
174
K. Kashihara et al.
17. Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 2: Short Papers, pp. 365–371 (2015) 18. Vinayakumar R., Alazab, M., Jolfaei, A., Soman, K.P., Poornachandran, P.: Ransomware triage using deep learning: Twitter as a case study. In Cybersecurity and Cyberforensics Conference, CCC 2019, Melbourne, Australia, 8–9 May, 2019, pp. 67–73. IEEE (2019) 19. Vinayakumar, R., Alazab, M., Srinivasan, S., Pham, Q.-V., Padannayil, S., Ketha, S.: A visualized botnet detection system based deep learning for the internet of things networks of smart cities. IEEE Trans. Ind. Appl. 1–1, January 2020 20. Tjong Kim Sang, E.F., Buchholz, S.: Introduction to the conll-2000 shared task chunking. In: Fourth Conference on Computational Natural Language Learning, CoNLL 2000, and the Second Learning Language in Logic Workshop, LLL 2000, Held in cooperation with ICGI-2000, Lisbon, Portugal, September 13–14, 2000, pp. 127–132 (2000) 21. Satyapanich, T., Ferraro, F., Finin, T.: CASIE: extracting cybersecurity event information from text. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp. 8749–8757. AAAI Press (2020) 22. Simran, K., Sriram, S., Vinayakumar, R., Soman, K.P.: Deep learning approach for intelligent named entity recognition of cyber security. In: International Symposium on Signal Processing and Intelligent Recognition Systems, pp. 163–172. Springer (2019) 23. Sirotina, A., Loukachevitch, N.: Named entity recognition in information security domain for Russian. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 1114–1120, Varna, Bulgaria, September 2019. INCOMA Ltd (2019) 24. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4–9 December 2017, Long Beach, CA, USA, pp. 5998–6008 (2017) 25. Vinayakumar, R., Alazab, M., Soman, K.P., Poornachandran, P., Venkatraman, S.: Robust intelligent malware detection using deep learning. IEEE Access 7, 46717– 46738 (2019) 26. Vinayakumar, R., Soman, K.P., Poornachandran, P.: Detecting malicious domain names using deep learning approaches at scale. J. Intell. Fuzzy Syst. 34(3), 1355– 1367 (2018) 27. Vinayakumar, R., Soman, K.P., Poornachandran, P.: Evaluating deep learning approaches to characterize and classify malicious url’s. J. Intell. Fuzzy Syst. 34(3), 1333–1343 (2018)
Text Classification Using Neural Network Language Model (NNLM) and BERT: An Empirical Comparison Armin Esmaeilzadeh and Kazem Taghva(B) University of Nevada Las Vegas, Las Vegas, NV 89119, USA [email protected], [email protected]
Abstract. Text Classification is one of the most cited applications of Natural Language Processing. Classification can save the cost of manual efforts and at the same time increase the accuracy of a task. With multiple advancements in language modeling techniques over the last two decades, a number of word embedding models have been proposed. In this study, we discuss and compare two of the most recent models for the task text classification and present a technical comparison. Keywords: Natural Language Processing · NLP · Text classification Word embedding · Language model · Transformers · BERT
1
·
Introduction
Natural Language Processing (NLP) encompasses of multiple fields of Linguistic, Computer Science, and Artificial Intelligence. The field had been expanding and has absorbed a variety of techniques from these disciplines in addition to statistic and probability. One of the most well researched areas in NLP is text classification [1]. The classification problem takes its input data in the form of text or audio to train models which predict or assign the probability of the given input sequence to belong to a certain class [18]. These models have wide ranging applications such as sentiment analysis, topic labeling, question answering, and information retrieval. Over the last few decades, numerous models and approaches have been developed to tackle the text classification problem. The majority of early techniques took advantage of statistical properties of text documents to model the syntactic and semantic relationship among words, sentences or documents such as n-gram frequencies [2]. However, these models were limited in their capacity to understand the underlying structure and dependencies of natural language [17]. With the rise of neural networks in other fields of machine learning such as image processing, a number of advancements have been made to develop deep neural network based models in the field of NLP. A major contribution of these models has been the idea of mapping words into a vector space representation c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 175–189, 2022. https://doi.org/10.1007/978-3-030-82199-9_12
176
A. Esmaeilzadeh and K. Taghva
known as word embedding [2]. These models have been shown to outperform early models by taking advantage of similarities and dependencies among words in a corpus [3]. These word embedding models can be used to encode sequences of words to create sentence or document embeddings which in turn will be used in text classification problems [17]. In this study, we will first discuss some of the most well known word embedding models that have been developed in Sect. 2. In Sect. 3, we will present a comparison analysis using two of the most influential models known as Neural Network Language Model(NNLM) and Bidirectional Encoder Representations from Transformers (BERT). The comparisons will summarizes our result with accuracy, precision, recall and f1-score. We finally present our the conclusion of the experiments.
2
Related Works
Text embedding building transform text data to a computationally efficient representation in a vector space [2]. The vector space representation have been shown to capture the syntactic and semantic relationship among words [3,5,6]. These word embedding models extensively use neural language to represent each word as vectors in a fixed dimension [17]. In the following sections, we will discuss and explain some of the most important text embedding models. 2.1
Neural Network Language Model (NNLM)
One of the first models that used neural networks to produce word embeddings was proposed by [2]. It was an effort to avoid the curse of dimensionality of traditional statistical language models. By learning a distributed representation of words, the model can detect an exponential number of words and sentences that are semantically similar. Given a text document with |V | number of words, there is a mapping function C which maps a word i to a vector C(i) ∈ Rm that represents the distributed feature vector of the word in m dimensions. For a target word wt , a function g will map a sequence of words represented by feature vectors that are in the context, C(wt−1 ), ..., C(wt−n+1 ), to a conditional probability distribution for the words in the vocabulary V . The ith element in the final vector of g is the probability P (wt = i|w1t−1 ) [2]: f (i, wt−1 , ..., wt−n+1 ) = g(i, C(wt−1 ), ..., C(wt−n+1 ))
(1)
The g function in the given equation is implemented by a feed forward neural network with parameters ω and the overall parameter set of θ = (C, ω). The model will be trained by finding a θ to maximize the following log-likelihood where R(θ) is a regularization form: L=
1 log f (wt , wt−1 , ..., wt−n+1 ; θ) + R(θ) T
(2)
Text Classification
177
The NNLM model outperformed the previous n-gram based language models in understanding the probability distribution of the words in the corpus and the semantic and syntactic relationship between words. However, long training time has been cited as a major obstacle in the development of these specific type of neural networks in language modeling [17]. 2.2
Word2Vec: Continuous Bag of Words (CBOW)
Continuous bag of words was proposed by [3] to train word embeddings. The model predicts a target word wt , with the windows size of c, given its surrounding words known as context words (wt−c , ..., wt−1 , wt+1 , ..., wt+c ), by maximizing the log probability in the following equation where |V | is the number of words in the document [3]: |V | 1 log[P (wt−c , ..., wt−1 , wt+1 , ..., wt+c )] L= V t=1
(3)
The model had been shown to outperform the state of the art methods at the time with higher accuracy and lower computation cost [3]. However, one limitations of the this model is that it relies on a small window around each word and therefore loses information of words farther away from the target word during training [16]. 2.3
Word2Vec: Skip Gram
The Skip Gram model was also introduced by [3]. In this model, we try to predict the context words of a given target word as opposed to continuous bag of words which predicts the context word from a given target word. Given a sequence of words as the training input (wt−c , ..., wt+c ) the model maximizes the log probability in the following equation to predict the context words for the target word wt where c is the size of the window and |V | is the number of words in the document [3]: |V | 1 L= V t=1
t+c
log[P (wj |wt )]
(4)
j=t−c,j=t
The Skip Gram model also has the same shortcomings of CBOW by being limited to a few number of words in the context of the target words, although it outperformed the prior state of the art methods [16]. 2.4
Global Vectors for Word Representation (GloVe)
GloVe was proposed by [4] as a model that takes advantage of the local context window of words and the global matrix factorization. It is a log-bilinear regression model which builds a word-word co-occurrence matrix of tokens in the document
178
A. Esmaeilzadeh and K. Taghva
and performs computation on nonzero elements in the matrix instead of the entire matrix or context windows of target words. The co-occurrence matrix X defines the value of Xij element as the number of time a word j occurred in the context of the word i. Given the vectors wi and wj for the main word and the context word, the model defines the following cost function with biases bi for main word and bj for context word where |V | is the size of the vocabulary [4]: J=
|V |
f (Xij )(wiT wj + bi + bj − log Xij )2
(5)
i,j=1
In the above equation, f is a weighting function which helps the model to avoid weighting all the co-occurrences in the global matrix equally which is defined below [4]: Xij α ( xmax ) , if Xij < xmax (6) f (Xij ) = 1, otherwise. Although GloVe has outperformed similar models that leverage local context windows or co-occurrence matrices, it requires a lot of memory to store the global matrix. Also, with any modifications to the hyper-parameters that are related to the co-occurrence matrix, a new matrix has to be constructed which is also time consuming. Another limitation similar to CBOW and Skip-Gram is that it will not learn representation for out-of-vocabulary words [16]. 2.5
fastText
fastText [5] is an open source model developed at facebook which is based on CBOW. The core difference in fastText compared to CBOW is that it splits each word to multiple n-grams to extract sub word level information. Each one of the n-grams are represented by a vector and each word is a vector representation as the sum of its n-gram vectors. This method of training will have the advantage of learning out-of-vocabulary words as well [5]. In order to predict the context words around a given target word, the model treats the problem as a set of binary classification tasks that are independent. Given a target word wt at position t, the model takes all context words for the wt as positive example and randomly samples words from the dictionary as negatives. The negative log-likelihood for a context word at position c is as follows [5]: log(1 + es(wt ,wc ) ) (7) N LL = log(1 + e−s(wt ,wc ) ) + n∈Nt,c
where Nt,c is the set of negative samples from vocabulary. For any given word w we define two vector representations uw and vw in Rd . To define a scoring function s between a given word wt and its context word wc ,
Text Classification
179
we use their sub vectors, specifically uwt and vwc for word wt and wc . Then the score will be computed as the scalar product between the following vectors [5]: s(wt , wc ) = uTwt vwc
(8)
The fastText algorithm is also inherently limited to the context of each word and therefore does not capture the overall semantic relationship between a given word and document [16]. 2.6
Embeddings from Language Models (ELMo)
ELMo is a language model proposed by [6]. Which generates deep contextualized representations for words. The core difference in this method is that the entire input statement is considered to build word vectors and it is not limited to a window size of the target words. The model is trained on two LSTM neutral networks in both direction of the text input. There, given a sequence of V words (w1 , w2 , ..., wn ) the model will define the following language models for forward and backward training [6]: p(w1 , w2 , ..., wn ) =
V
p(wk |w1 , w2 , ..., wk−1 )
(9)
p(wk |wk+1 , wk+2 , ..., wV )
(10)
k=1
p(w1 , w2 , ..., wn ) =
V k=1
The model will maximize the log likelihood in forward and backward language models [6]: L=
V
(log p(wk |w1 , w2 , ..., wk−1 )) + (log p(wk |wk+1 , wk+2 , ..., wV ))
(11)
k=1
It was shown that ELMo outperformed most of the other RNN and LSTM models trained in single directions. The model also has the advantage of disambiguating the semantic of words and the part of the speech in sentences [6]. However, the usage of RNN and LSTM neural architectures will cause the model to lose important information that are far from a given target word and therefore the semantic relationship are not captured in long distances [16]. 2.7
Transformer
Transformer is a new architecture proposed by [7] that replaces recurrent neural networks and uses attention mechanism and feed forward network for training. The architecture uses encoder and decoder blocks with similar components which are auto-regressive and consume previously generated data as input when predicting next words.
180
A. Esmaeilzadeh and K. Taghva
Attention layer defines query, key and value vectors for each input word and performs scaled dot product to find the importance of the connection between a target word and other words in the sequence. Given the vectors for each word in batches, we have three matrices as Q for query, K for key and V for value and the attention function is defined below where dk is the dimension of vectors [7]: QK T Attention(Q, K, V ) = sof tmax( √ )V dk
(12)
Moreover, it was shown that using 8 attention layers improve the accuracy of the model and we can define the multi-attention heads by concatenating the result of each layer [7]: M ultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O
(13)
where headi = Attention(QWiQ , KWiK , V WiV ) and WiQ,K,V,O are parameter matrices to be trained. The model outperformed state of the art machine translation models and has the advantage of parallelizable tasks which significantly reduces training time. On the other hand, the model is only trained on one direction and does not consider the forward context of a given target word [9]. 2.8
Generative Pre-Training (GPT)
The GPT model, as proposed by [8], is a pre-trained model which uses transformer’s decoder block to perform feature extraction and defines word embeddings. But unlike the ELMo model which uses a bidirectional LSTM, the GPT model is only trained on one forward direction. The objective function for the language model is the same general equation which predicts the next word given a sequence of context words [8]: log P (wi |wi−k , ..., wi−1 ) (14) L1 (X) = i
The output of the self attention layers in the transformer decoders will be used as the distribution vectors of each token. The GPT model takes advantage of the transformer architecture and has outperformed ELMo and other LSTM based networks. However, one of the limitations of this model is that it is only trained on a single direction of sequences and therefore it is not able to capture and forward context of target words [9]. 2.9
Bidirectional Encoder Representations from Transformers
In order to leverage the transformer architecture and bidirectional context of ELMo, [9] proposed the BERT model. By using the encoder block of transformer with attention mechanism coupled with Masked Language Model (MLM) which randomly masks input words in the sequence during training, the model is able
Text Classification
181
to extract deep contextualized semantic information in sentences. A common approach to extract word embedding has been to aggregate a number of encoder outputs from each layer and use that as vector representation of the word. The model has outperformed both GPT and ELMo on a verity of NLP tasks including text classification and it is currently the top state of the art model [9,17]. 2.10
BERT Improvements
There have been many efforts to fine tune the original BERT model and apply it to specific domain areas such as BioBERT [11] for biomedical text mining, VideoBert [13] which is a join model for video and language representation, SciBERT [12] that has fine-tuned BERT model by training on scientific document, etc. Due to the large number of training parameters in the BERT model, there have been improvements such as one proposed by ALBERT [10] that defines a number of parameter-reduction techniques to reduce the computation time and memory consumption. 2.11
Others
There have been researches done on languages such as Chinese that are character based instead of words. One of the first models was proposed by [14], known as the character-enhanced word embedding model (CWE), which includes the characters in the Chinese language in the models. There have been other proposals and improvements such as similarity-based character-enhanced word embedding model (SCWE) proposed by [15] on top of CWE as well.
3
Experiments
In this experiment with have implemented the NNLM model proposed by [2] and the BERT architecture proposed by [9]. These two models will map the input sequence of words into word embeddings and then apply a feed forward softmax layer as the final classifier to predict the class or topic of a given sequence of words. In the following section we will discuss the process from data preparation to final evaluation in more details. 3.1
Data Set
The data set used for text classification in this experiment is the IMDB review data set [19] which is a database of movie reviews provided by the IMDB website. The data set includes 50,000 samples of review contents and their corresponding category as positive or negative.
182
3.2
A. Esmaeilzadeh and K. Taghva
Tools and Frameworks
We use Sklearn and T ensorf low packages and frameworks to implement the models, evaluate performance and perform text analysis and feature extractions. There are other packages used such as P andas and N umpy for numerical computations, Spacy for language detection, M atplotlib for creating visualizations and N LT K for feature extractions. 3.3
Data Analysis and Processing
In this section, we perform data analysis on the data set to understand the composition of categories and number of samples. The data set has been divided into 25000 samples for training, 15000 for evaluation and 25000 for testing. The schema for the data set includes a “text” column which is the content of reviews and a “category” column with two possible categories of either positive or negative and it is used as the target variable for classification. The analysis given below has been performed on the training data set but the result for evaluation and testing data sets have been similar and we only illustrated the training data set analysis results. The number of samples for each category is 7500 rows which is shown in Fig. 1.
Fig. 1. Topic frequency.
We want to focus the experiment on English language; therefore, we use the Spacy library to detect languages in the documents and remove any samples that are not in English. The result is shown in Fig. 2 which detects all samples as English. We perform character, word, and sentence length analysis on the documents to detect any differences in frequencies between two categories which could be used as a feature to build a basic classification model or might impact the performance of neural network models. The character count is illustrated in Fig. 3 with histogram and density distribution for positive and negative reviews.
Text Classification
Fig. 2. Language frequency.
Fig. 3. Character frequency.
Fig. 4. Word frequency.
183
184
A. Esmaeilzadeh and K. Taghva
Fig. 5. Sentence frequency.
The word and sentence frequencies are shown Fig. 4 and Fig. 5 in which distributions for both categories overlap. Figure 6 and Fig. 7 show the average word and sentence length frequencies in which there are no major differences between distributions of two categories.
Fig. 6. Average word length.
Another helpful analysis is N-Gram frequency. We perform 1-, 2- and 3-Gram frequencies on the documents and the results are shown in Fig. 8 and Fig. 9. As we can see in the results the overall distribution of N-Grams are similar in both categories.
Text Classification
185
Fig. 7. Average sentence length.
Fig. 8. N-Gram frequency - positive.
The final analysis is using the T extblob sentiment analysis package which has a trained model and can be applied to the data set without the cost of training. As shown in Fig. 10 we can see there exists a difference in the distribution of sentiments between two categories and we want to model that difference using the NNLM and BERT model in our experiment.
186
A. Esmaeilzadeh and K. Taghva
Fig. 9. N-Gram frequency - negative.
Fig. 10. Sentiment distribution.
3.4
Experiment
In our experiments, we compare the performance of NNLM and BERT language models on the imdb review data set. For the NNLM model we use the nnlm-en-dim50 implementation from T ensorF low hub which is based on feedforward neural net language models and the BERT model implementation is provided by Transformers which also runs on T ensorF low platform. Both models transform the input text to vector representations that will be used for classification. A fully connected dense layer is appended at the end of each model during training with categories as target variable to perform the text classification task. We use Binary Cross entropy as loss function and Adam as optimizer in our final T ensorF low model and run the model for 10 epochs in batches of size 64.
Text Classification
3.5
187
Evaluation
The final result for the NNLM model are shown in the Fig. 11. The overall accuracy of the model is at %85. The model has an f1-score of %86 and recall of %87 in predicting positive reviews.
Fig. 11. NNLM classification report.
The overall performance of the BERT model is shown in Fig. 12. The accuracy of the model is at %92 which is much higher than the NNLM model. Also, the model is able to have a higher precision and f1-score on positive and negative categories. Looking at the AUC and Precision-Recall curves in Fig. 13 also demonstrates that BERT model is outperforming the NNLM model. A more detailed confusion matrix for both models is shown in Fig. 14. The overall comparison between NNLM and BERT models shows that the BERT model outperforms the NNLM model in nearly all metrics discussed. The accuracy has been improved by nearly %6 and improving AUC by approximately %5.
Fig. 12. BERT classification report.
188
A. Esmaeilzadeh and K. Taghva
Fig. 13. ROC and precision-recall curve.
Fig. 14. Confusion matrix.
4
Conclusion
In Sect. 2 of this research, we described some of the well known language models and word embedding techniques that have been developed over the last two decades. While a number of recent models still take advantage of statistical properties in the documents such as word-word co-occurrence matrix in GloVe, most new models are using a form of neural networks to find parameters and learn the dependencies and structures within text documents. We proposed an analysis on implementing and training two of these word embedding models known as Neural Network Language Model(NNLM) and the BERT architecture to extract the vector spaces for words and then use a classification model to train on word embedding vectors and assign the class or label to the given sequence of inputs. The result of the experiment in Sect. 3 shows a great improvement of accuracy by the BERT architecture in comparison to the NNLM model. The BERT architecture showed a %6 improvement in precision compared to the other model mentioned. However, due to the computation cost of running these models the experiment has been limited to 10 epochs and minor variations in hyper-parameters. In order to have a broader view on the performance and also the explainability levels of
Text Classification
189
these models, more research and experiments are needed. One of the interesting aspects of word embedding is to test the word similarities in vector space and the attention layers of neural network models. These inspections are expected to be continued in future studies. Acknowledgment. This research paper is based on works that are supported by National Science Foundation Grant No. 1625677.
References 1. Otter, D.W., Medina, J.R., Kalita, J.K.: A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems (2020) 2. Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) 3. Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 4. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014) 5. Bojanowski, P., et al.: Enriching word vectors with subword information. Trans. Assoc. Comput. Ling. 5, 135–146 (2017) 6. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018) 7. Vaswani, A., et al.: Attention is all you need. Adv. Inform. Process. Syst. 30, 5998–6008 (2017) 8. Radford, A., et al.: Improving language understanding by generative pre-training (2018): 12 9. Devlin, J., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 10. Lan, Z., et al.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019) 11. Lee, J., et al.: BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020) 12. Beltagy, I., Lo, K., Cohan, A.: SciBERT: a pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019) 13. Sun, C., et al.: Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE International Conference on Computer Vision (2019) 14. Chen, X., et al.: Joint learning of character and word embeddings. In: TwentyFourth International Joint Conference on Artificial Intelligence (2015) 15. Xu, J., et al.: Improve Chinese word embeddings by exploiting internal structure. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2016) 16. Wang, S., Zhou, W., Jiang, C.: A survey of word embeddings based on deep learning. Computing 102(3), 717–740 (2020) 17. Li, Q., et al.: A Survey on Text Classification: From Shallow to Deep Learning. arXiv e-prints (2020): arXiv-2008 18. Kowsari, K., et al.: Text classification algorithms: a survey. Information 10(4), 150 (2019) 19. https://www.tensorflow.org/datasets/catalog/imdb reviews
Past, Present, and Future of Swarm Robotics Ahmad Reza Cheraghi1(B) , Sahdia Shahzad1 , and Kalman Graffi2 1 2
Technology of Social Networks, Heinrich Heine University, D¨ usseldorf, Germany {ahmad.cheraghi,sahdia.shahzad}@hhu.de Honda Research Institute Europe GmbH Offenbach am Main, Offenbach, Germany [email protected] https://tsn.hhu.de, https://www.honda-ri.de
Abstract. Swarm Robotics is an emerging field of adapting the phenomenon of natural swarms to robotics and a study of robots to mimic natural swarms, like ants and birds, to form a scalable, flexible, and robust system. These robots show self-organization, autonomy, cooperation, and coordination amongst themselves. Additionally, their cost and design complexity factor must be as low as possible to reach systems similar to natural swarms. Further, the communication amongst the robots can either be direct (robot-to-robot) or indirect (robot-to-environment) and without any central entity to control them. Swarm robotics has a wide range of application fields, from simple household tasks to military missions. This paper reviews the swarm robotics approaches from its history to its future based on 217 references. It highlights the prominent pioneers of swarm robotics and enlights the initial swarm robotics methods. The presence of swarm robotics is shown based on simulators, projects, and real-life applications. For the future, this paper presents visions and ideas of swarm robotics. Keywords: Robot Survey
1
· Swarms · Robot swarms · Swarm robotics ·
Introduction
The collective behavior shown by natural swarms like, honey bees, ants, fishes and others, has inspired humans to build such systems with robots, that can act in the most similar way as the natural swarms. These natural swarms can coordinate their simple behaviors and form complex behaviors with the help of which, they can accomplish tasks that are impossible for single individuals to perform. Swarm of ants can build bridges to cross large gaps, termites can build mounds that can be up to 30 feet high, fishes form shoals to protect them from predators and so on. Figure 1 shows a swarm of ants building a bridge to overcome a gap and Fig. 2 shows a mound built by termites. To realize natural swarm like systems in the field of robotics, it is first important to understand c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 190–233, 2022. https://doi.org/10.1007/978-3-030-82199-9_13
Past, Present, and Future of Swarm Robotics
191
what a swarm actually means. There are several definitions of a swarm in the literature. One simple and straightforward definition is given by [58]: a large group of locally interacting individuals with common goals. This means, it is aimed to build systems with swarms of robots that interact together and have some common goals to accomplish, just like natural swarms work together to accomplish common tasks. This paper deals with the idea of realising natural swarming into real life systems with robotic swarms. This paper summarizes the research in the field of swarm robotics, from the starting till the perspectives of the future. The aim is to give a glimpse of the history of swarm robotics, the recent work in this field and the future plans. Section 2 discusses the history of swarm robotics. It states, what the inspiration for this field was, what were the very first approaches and ideas and other historical aspects. Section 3 gives an overview of the swarm robotics field. It discusses the features, advantages, issues, tasks and application fields for swarm robotics. The present work, that includes different experimental and simulation platforms for swarm robotics, is described in Sect. 4. This section discusses the different types of simulators as well as real life applications in the field of swarm robotics. In Sect. 5, future perspectives, ideas and plans are discussed. Related surveys are mention in Sect. 6. The last section concludes the work.
Fig. 1. Ants building a bridge [191].
2
Fig. 2. A mound built by termites [116].
The History of Swarm Robotics
The term ‘swarm’ in the context of robotics is applied for the first time by G. Beni [62] and Fukuda [105] in 1988. According to G. Beni, cellular robotics is a system composed of autonomous robots, that operate in a n-dimensional cellular space, without any central entity. Additionally, they have limited communication among themselves, and they coordinate and cooperate to accomplish common goals. On the other hand, Fokuda uses swarm as a group of robots that can work together like the cells of a human body and as a result, they can accomplish complex goals. One year later G. Beni and J. Wang [63] introduces the term of swarm intelligence in relation to cellar robotic systems. They claimed that
192
A. R. Cheraghi et al.
Fig. 3. Swarming honey bees [44]
Fig. 4. Birds flocking [5]
Fig. 5. A fish shoal [14]
Fig. 6. Locusts swarming [18]
cellular robotic systems are able to show ‘intelligent’ behavior via coordinating their actions. In 1993, C. Ronald Kube and Hong Zohng [137] constructed a multi-robot system that was inspired by the collective behaviours of natural swarms. At the same year, Gregory Dudek et al. [97] define swarm robotics with respect to different features, including the size of a swarm, communication range amongst the robots in a swarm, communication topology, communication bandwidth, reorganisation rate of a swarm, abilities of swarm members and swarm homo- or heterogeneity. According to the authors ‘swarm’ is a synonym to multi-robotic systems, which is why it was still not clear what properties differ the term ‘swarm robotics’ from other robotic systems. In the early research on swarm robotic systems, the focus remained on the explorations of swarming behaviors in different species, like ants, birds, fish, and others. The researchers examined these behaviors and explored ways on how to realize these behaviors in robotic systems [88,106,136,149,150]. Additionally, research were driven by different inspirations, like the flocking of birds or colonies of ants. Natural swarms have always been the main motivation behind the idea of swarm robotics. Many studies and researches emulate different swarming behaviours like foraging, flocking, sorting, stigmergy, or cooperation. The authors in [61] and [119] are two very old research works (1994 and 1999), that
Past, Present, and Future of Swarm Robotics
193
deal with the topic of stigmergy. Stigmergy refers to the indirect communication amongst species and is introduced by [110] with reference to the behavior of termites. The first paper illustrates several experiments where mobile robots are responsible for collecting randomly distributed objects in an environment via stigmergy. The author in [119] explores the feature of stigmergy and self organization amongst robots, having the same capabilities. However, in 2004 G. Beni [64] made another attempt to describe a swarm more precisely. According to him the robots in a swarm are simple, identical and self-organizing, and the system must be scalable, and only local communication is available amongst swarm members. These are the properties that are still considered as the basics of defining and distinguishing swarm robotic systems from other robotic systems. The robots used for the experimentation had a lot in common to social insects, for example the simplicity and the decentralization of the system. As a result, the word ‘swarm’ was used instead. In the same time period, another research work [180] also dealt with the topic of swarm robotics. The author defined swarm robotics as “Swarm robotics is the study of how large number of relatively simple physically embodied agents can be designed such that a desired collective behavior emerges from the local interactions among agents and between the agents and the environment”. He made some additions to the basic properties of a swarm robotic system. According to him, the robots must be autonomous, that means they should be able to interact with their environment and make decisions accordingly. Secondly, the swarms should consist of a small number of homogeneous groups, and each group should have a large number of robots in it. Further, it is still not clear what size a swarm can or should be. G. Beni gives a brief definition to the size of a swarm as “It was not as large as to be dealt with statistical averages, not as small as to be dealt with as a few-body problem”. According to the author the size of a swarm should be in the order of 102 − 10 (1 + )πθold (at |st ), at which point the objective will update no further than (1 + )Aπθ (st , at ). Similarly, if Aπθ (st , at ) < 0 (if the action performed worse than the critic expected it to), the objective will increase as πθ (at |st ) decreases only until πθ (at |st ) < (1−)πθold (at |st ), at which point the objective will update no further than (1 − )Aπθ (st , at ). In both of these cases, this ensures that the new policy remains reasonably close to the old policy, instead of performing dangerously worse. We used = 0.2 based on the suggested value in [15]. input: a parametrized policy πθ (at |st ) and critic Vφ (st , at ) Initialize parameters θ, φ, and state s0 ; repeat sample a trajectory τ from πθ (at |st ); fit Vˆφ to sampled reward sums; ˆ πθ (st , at ) = γ k rt+k ; Q k ˆ πθ (st , at ) − Vˆφ (st , at ); Aˆπθ (st , at ) = Q Optimize L(θ) using θ := arg max L(θ) with Adam; until training is complete;
θ
Algorithm 3: Training with Proximal Policy Optimization
4
Results
We trained each model across several different hyperparameters including learning rate η and discount factor γ. Due to the large variance present in several of these configurations, we trained each combination ten times for a maximum of 1000 time steps each, or until it learned to balance for 10 s. Smaller values of η took longer to train, so we only report training curves for η = 10−2 , with γ = {0.9, 0.99, 0.999}. Again, success was defined as keeping the virtual pendulum upright (|α| < 12◦ ) without running off the track (|x| < 0.4 m) for 500 time steps (10 s).
Benchmarking Virtual Reinforcement Learning Algorithms
297
Table 1. Training results for all three algorithms across different γ. Bold numbers are the least number of trials for each column, italics are the best for each algorithm. Algorithm
γ
PG PG PG
0.9 0.99 0.999
60% 100% 80%
339 245 340
202 80 118
AC AC AC
0.9 0.99 0.999
50% 100% 90%
300 250 255
91 19 46
PPO PPO PPO
0.9 0.99 0.999
50% 50% 50%
72 258 135
32 23 38
Success % Average # Trials Best # Trials
As seen in Fig. 5 and Table 1, all three algorithms were able to successfully balance the pendulum, although they took different amounts of time to get there. On average, Actor-Critic performed slightly better than Policy Gradient across most values of γ with γ = 0.99 resulting in the fastest learning on average as well as the most consistent learning. PPO was able to achieve very fast learning times, but only infrequently. While half of the successful simulations learned to balance a pole, more than half of all attempts crashed due to numerical instability. Specifically, due to the denominator of the surrogate objective, the ratio occasionally explodes past floating point precision, returning inf or nan. These unrecoverable attempts are not recorded in Table 1 or Fig. 5. However, when PPO does not crash, it trains among the quickest of all the algorithms tested, which is supported by results in the literature [15]. Once the model is trained, we can recreate the neural network in MATLAB by combining the weight matrices and biases along with the appropriate activation function in Simulink, which is then built by Quanser QUARC and loaded to the controller. At this stage of testing, we only care about the mean action predicted by the network; we ignore the rest of the probability distribution. This significantly reduces noise that was added to the model during training, which makes the real model more robust. We also ignore the predicted value function during testing, as we are no longer updating the policy parameters. In Fig. 6, we can see the state and action spaces of the agent controlled by each of our three algorithms. The first 10 s of each graph shows the pendulum balancing on its own before being tapped just hard enough that it is able to remain upright. Due to the fact that rewards were only given for remaining upright (|α| < 12◦ ) and on the track (|x| < 0.4 m), the pole was never encouraged to remain exactly in the centre, so the models oscillate left and right through acceptable states of the environment.
298
D. Bates and H. Tran
Fig. 5. Policy gradient, actor-critic, and proximal policy optimization for a variety of discount factors γ. The mean trial is graphed, along with ±1 standard deviation, up to a maximum of 500 time steps.
Once again, Actor-Critic performed the best, absorbing a 2.5◦ disturbance without even saturating the motor. By comparison, Policy Gradient saturated the motor to recover from a 1.8◦ disturbance, and PPO was unable to recover from any disturbances of 1◦ or greater. Interestingly, it was able to absorb the initial disturbance but overcompensated, moving so quickly in the other direction that it deflected to 8.7◦ , as seen in Table 2. Actor-Critic also had the least amount of control noise, which helps explain why it was able to handle the disturbance so well. Table 2. How each pole reacted to a disturbance. Algorithm Max angle Max voltage Cart movement Overcompensation PG
5
1.8◦ ◦
10 V
282 mm
3.2◦
AC
2.5
7.3 V
175 mm
5.5◦
PPO
0.9◦
9.9 V
247 mm
8.7◦
Discussion
As we have seen, Policy Gradient, Actor-Critic, and Proximal Policy Optimization are all able to train a virtual model well enough that it is able to balance a real inverted pendulum, while being robust to model errors, sensor noise, and disturbances of the system. However, not every hyperparamter combination resulted in a model that learned to balance every time. This optimization process took some trial-and-error in order to find values that would train consistently.
Benchmarking Virtual Reinforcement Learning Algorithms
299
Fig. 6. State response and control effort of each agent over a 20 s timeframe, with a forced disturbance (a light tap on the pole) 10 s in. From left to right the algorithms are: policy gradient, actor-critic, and proximal policy optimization, and the graphs from top to bottom are: position (mm), angle (deg), horizontal cart velocity (mm/s), pendulum angular velocity (deg/s), and voltage (V).
300
D. Bates and H. Tran
Smaller learning rates helped balance the virtual pendulum about as well as the models with η = 10−2 reported in Table 1, but regularly took over 2500 trials to get there, which we determined to be too long. Additionally, the 32 neurons in the hidden layer were initially chosen arbitrarily, as it was enough to consistently balance the pole. Neural networks with fewer neurons are still able to balance the pendulum about half of the time, but having more neurons in the hidden layer (or more layers) tends to increase training speed, even if they are not necessary for learning and can lead to overfitting. The main hyperparameter value we augmented for Sect. 4 was the reward discount factor γ. Intuitively, this controls how much the agent values rewards now over rewards in the future, with γ = 0.9 giving the rewards a half-life of 0.13s, γ = 0.99 a half-life of 1.38s, and γ = 0.999 a half-life of 13.8s. We found that γ = 0.99 consistently trained the fastest, among these choices, which is consistent with the literature [5,13,15]. In addition to these constant discount factors, we also experimented with increasing or decreasing γ as training progresses. A function like γ = 1 − 0.10.03 smoothly increases from γ ≈ 0.822 to γ ≈ 1 as the length of each trial increases from 25 time steps (0.5 s) to 500 time steps (10 s). Intuitively, this should work well since attempts that fail early on need to encourage the agent to take more greedy actions to survive immediately, while longer-lasting trials should consider a longer time-horizon, ensuring the cart does not run off the track while trying to balance. However, this function gets very small for < 10, corresponding − 1 to 12 trials that fail almost immediately. On the other hand, γ = min 1, 50 ranges from γ = 1 for 50 or fewer time steps (≤1 s) and decreases to γ ≈ 0.825 for 500 time steps (10 s). This function was designed so that only the last 25 time steps (0.5 s) of actions are discouraged. Intuitively, only the actions right before failure directly lead to the crash, regardless of trial length, so should be the only ones strongly discouraged. In practice, neither annealing method improved performance for Policy Gradient, taking more than 45% longer to find a reasonable policy on average, so were not attempted for Actor-Critic or PPO. A third attempt, with γ = 1 − 0.10.0036+1 , which interpolates between γ = 0.9 for = 0 and γ = 0.9984 for = 500 was also attempted, but still failed to improve performance. All three alternative γ functions can be seen in Fig. 7. For Actor-Critic, the critic network is trying to learn the normalized discounted rewards, an estimate of the value function V πθ (s), given only the state at each time step. Even though it shares many of the parameters with the policy network defining the distribution each action is sampled from, it effectively faces an impossible task. Given a state 0.5 s before the pole falls and the trial ends, the normalized discounted rewards-to-go associated with that time step will differ depending on the length of the trial. As an example, if the trial lasted 50 time steps, the normalized discounted reward associated with the 25th last state with γ = 0.99 would be R25 ≈ 0.105, and that action would be slightly encouraged. However, if the trial lasted 500 time steps, then R475 ≈ −2.37 and the action would be strongly discouraged even though it corresponds to the exact same state 25 time steps before failure. The critic has no way of knowing how long
Benchmarking Virtual Reinforcement Learning Algorithms
301
that particular trial is, so has no reasonable way to accurately estimate the value function. The critic loss will necessarily be high, meaning less weight will be put in minimizing the actor loss. By discounting but not normalizing the rewards, this seemingly solves that issue entirely. After discounting, when γ = 0.99, R25 ≈ 22.22 if the trial lasts 50 time steps, and R475 ≈ 22.22 if the trial lasts 500 time steps. Importantly, an identical state that leads to failure in the same amount of time will have the same discounted reward regardless of how long the trial is. Since the critic only has access to the state of the environment, it will predict the same value function given the same state, and is theoretically able to learn the function much better. As seen in Fig. 8, this is not the case at all. Despite not being able to accurately learn the value function, performance of the normalized discounted rewards was superior to unnormalized discounted rewards, especially for larger values of γ. Comparing Table 3 to Actor-Critic’s rows in Table 1, we can see that the unnormalized rewards only learned to balance between half and a third as often, usually taking more than twice as long to get there. What the aforementioned argument neglected was that the critic network is not trying to estimate the state-action value function Qπθ (s, a); that would make the gradients identically 0. Rather, it is trying to estimate the expected return for a given state, which the normalized discounted reward only provides a sample estimate of. Additionally, large values of γ result in considerably larger discounted rewards towards the beginning of a long episode. That is, for a trial that lasts 50 time steps the unnormalized R0 < 48.8, but when the trial lasts 500 time steps R0 > 393— more than 8 times as large. Even a small percentage error in the critic’s estimate will result in a massive critic loss, which will give it far too much weight during optimization. Normalizing the discounted rewards keeps the values in a much
Fig. 7. Attempted ways of annealing γ over time to improve performance. Table 3. Training results for unnormalized actor-critic with different γ. Algorithm AC unnormalized
γ
Success % Average # Trials Best # Trials
0.9
30%
548
374
AC unnormalized 0.99
50%
603
345
AC unnormalized 0.999
30%
636
346
302
D. Bates and H. Tran
Fig. 8. Discounted rewards vs normalized discounted rewards for actor-critic.
smaller range, so the optimization can focus primarily on updating the policy parameters instead of the value parameters.
6
Conclusion
Over the course of this paper, we have shown that a multitude of algorithms with a variety of learning rates are able to successfully train a neural network model to balance a virtual inverted pendulum. In each case, the trained models are able to balance a real inverted pendulum, proving that they are robust to sensor noise, motor play, and forced disturbances. Policy Gradient and Actor-Critic were the most reliable, with discount factors of γ = 0.99 and a high learning rate training fastest among hyperparamter configurations tested, while Proximal Policy Optimzation was able to quickly but infrequently balance the pendulum. Models trained with Actor-Critic were best able to recover from forced disturbances, being able to recover from angles in excess of 2.5◦ without saturating the motor, and without specifically being trained to do so. Future work will involve randomizing the domain and dynamics of the trained model in order to make the models more robust to changes in the real pendulum, helping it balance poles of varying length and weight. Since these changes will come at the expense of training time, more efficient implementations like PILCO are also being considered. Notes and Comments. Funding support for this research was provided by the Center for Research in Scientific Computation.
Appendix Table 4 contains the model parameter values used for the Equations of Motion in simulation. Some of these values found in the technical specifications of the Quanser User Manual were found to be incorrect, and were determined experimentally through parameter identification in [8].
Benchmarking Virtual Reinforcement Learning Algorithms
303
Table 4. Model parameter values used in Eqs. 1–3. Parameter
Description
Value
Bp
Viscous damping coefficient, as seen at the pendulum axis
0.0024 N · m · s/rad
Beq
Equivalent viscous damping coefficient as seen at the motor pinion
5.4 N · m · s/rad
g
Gravitational constant
9.8 m/s2
Ip
Pendulum moment of inertia, about its center of gravity
8.539 × 10−3 kg · m2
Jp
Pendulum’s moment of inertia at its hinge
3.344 × 10−2 kg · m2
Jm
Rotor moment of inertia
3.90 × 10−7 kg · m2
Kg
Planetary gearbox ratio
3.71
Kt
Motor torque constant
0.00767 N · m/A
Km
Back Electromotive Force (EMF) constant
0.00767 V · s/rad
p
Pendulum length from pivot to center of gravity
0.3302 m
M
Cart mass
0.94 kg
Mp
Pendulum mass
0.230 kg
Rm
Motor armature resistance
2.6 Ω
rmp
Motor pinion radius
6.35 × 10−3 m
References 1. Achiam, J.: Part 3: Intro to Policy Optimization. OpenAI (2018) 2. Amini, A., et al.: Learning robust control policies for end-to-end autonomous driving from data-driven simulation. IEEE Robot. Autom. Lett. 5(2), 1143–1150 (2020) 3. Bonatti, R., Madaan, R., Vineet, V., Scherer, S., Kapoor, A.: Learning visuomotor policies for aerial navigation using cross-modal representations. CoRR (2020) 4. Deisenroth, M., Rasmussen, C.: PILCO: a model-based and data-efficient approach to policy search. In: ICML, pp. 465–472 (2011) 5. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control. In: ICML (2016) 6. Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Billard, A., Dragan, A., Peters, J., Morimoto, J. (eds.) Proceedings of the 2nd Conference on Robot Learning, Proceedings of Machine Learning Research, vol. 87, pp. 651–673. PMLR (2018) 7. Karpathy, A.: Deep reinforcement learning: Pong from pixels (2016) 8. Kennedy, E.: Swing-up and stabilization of a single inverted pendulum: real-time implementation. Ph.D. thesis, North Carolina State University (2015) 9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR (2015) 10. OpenAI et al.: Solving Rubik’s Cube with a robot hand. CoRR (2019) 11. Osi´ nski, B., et al.: Simulation-based reinforcement learning for real-world autonomous driving. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6411–6418. IEEE (2020) 12. Riedmiller, M.: Neural reinforcement learning to swing-up and balance a real pole. In: 2005 IEEE International Conference on Systems, Man and Cybernetics, vol. 4, pp. 3191–3196 (2005) 13. Schulman, J., Levine, S., Moritz, P., Jordan, M.I., Abbeel, P.: Trust region policy optimization. CoRR (2015) 14. Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. CoRR (2015) 15. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. CoRR (2017)
Configuring Waypoints and Patterns for Autonomous Arduino Robot with GPS and Bluetooth Using an Android App Gary H. Liao(B) Whiting School of Engineering, Johns Hopkins University, Baltimore, USA [email protected]
Abstract. The challenge is to build a low-cost robot as a basis for personal uses such as mowing or line trimming or moving items from one place to another that is easily configured using a ubiquitous user interface like an android app to autonomously move the robot from one location to another. This paper describes such a robot and user interface for use by non-technical users. A user can record a GPS waypoint and then when directed, the robot will move to that location in a relatively straight line, guided by GPS. In addition, a user can record a pattern while remote controlling the robot with their phone, and then when directed, the robot will replay the same pattern without GPS. Keywords: Autonomous · Robot · GPS · Arduino · Android
1 Introduction This paper describes a low-cost autonomous utility robot for personal use that can be easily configured and operated using an android phone and app. GPS controlled Arduino robots have been used for scientific [1] and other uses, but this paper focuses on personal robotic uses requiring low cost, ease of use, in addition to autonomous movement between points in a non-straight line. An Arduino is a low-cost microcontroller with an ecosystem of available low-cost sensors. This is the reason for choosing an Arduino as the basis for the robot in this paper, though there are also alternative low-cost microcontrollers that might work equally as well. A smartphone with apps is ubiquitous in households and users are comfortable with their use. A basic operation for a useful personal robot would be to instruct the robot to go from one place to another. This paper describes two ways a user can configure a robot to go from one place to another with their phone: A) specifying a previously stored waypoint, B) replaying a previously recorded and stored pattern. 1.1 Specifying a Previously Stored Waypoint Using the Android app, a user can use remote control buttons on the app to control the robot. Communication between the robot and app is via Bluetooth. During this time, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 304–312, 2022. https://doi.org/10.1007/978-3-030-82199-9_18
Configuring Waypoints and Patterns for Autonomous Arduino Robot
305
the robot transmits its GPS coordinates to the app. When the user reaches a desired destination waypoint, the user then can specify a name and store the GPS coordinates in the smartphone. Then, later, the user can retrieve the stored waypoint by name and instruct the robot to go to that waypoint. The robot rotates until it is oriented in the correct direction to proceed to the waypoint in a straight line and then moves to the waypoint, using GPS to adjust its heading as necessary for minor adjustments. 1.2 Replaying a Previously Stored Pattern Using the Android app, the user can begin recording a remote control session. The user then moves the robot as desired and the robot logs each motor controller’s control signals in a data log unique to that recorded remote control session. When desired, the user can stop the remote control session, the data logging stops and the file is closed. Then, later, the user can specify the named pattern and instruct the robot to replay the pattern.
2 System Architecture The system architecture is shown in Fig. 1, which illustrates a block diagram of the Android app and robot. The layered architecture consists of 3 layers: A) User Interface Layer, B) Robot Control Layer, C) Robot Physical Layer. 2.1 User Interface Layer The User Interface Layer consists of an Android Phone App. This app was developed using MIT App Inventor, and includes a clock, TinyDB, and Bluetooth. The clock is used to trigger regular events to check for communication with the robot via Bluetooth. TinyDB is used to store the named waypoints and patterns. These stored waypoints and patterns are used to configure the robot for autonomous operation. Bluetooth is used as the communication protocol between the app and robot. The User Interface allows a user to remote control the robot, store a waypoint, move the robot to a waypoint, start and stop a pattern recording, and instruct the robot to replay a stored pattern. The user interface design and programming is described in more detail in “3 Android App Design and Programming.” 2.2 Robot Control Layer The Robot Control Layer consists of an Arduino Mega 2560, HC-05 Bluetooth module, Adafruit LSM303AGR Compass, Adafruit Ultimate GPS with external active antenna, microSD module and card, power subsystem, and 4 Cytron MD10C motor controllers. The Arduino runs a program described in more detail in “4 Robot Design and Programming.” Communication is 2-way with the Android app and is via the Bluetooth module. The compass module is used to determine the robot’s heading. The GPS module is used to determine the robot’s GPS location. The microSD module and card stores the named waypoints and patterns. The power subsystem includes a 12 V to 5 V DC converter and terminal blocks to distribute power. The motor controllers use 12 V to power the DC motors and 5 V for the control signals. The Arduino and modules use 5 V.
306
G. H. Liao
Fig. 1. Block diagram of layered system architecture.
2.3 Robot Physical Layer The Robot Physical Layer consists of the chassis, 4 DC motors, a battery, and wheels. This robot is built using Actobotics U-Channel, 6” wheels, 4 NeveRest Classic 40 Gearmotors, and a Powersonic PS-12180NB 12 v 18 Ah Lead Acid Battery. But many combinations of chassis, battery, motors, and wheels can be substituted.
3 Android App Design and Programming The Android App is developed using MIT App Inventor. Figure 2 shows the Phone App User Interface with “Remote Control” and “Move to Waypoint” functionality on the left and “Replay Pattern” and additional diagnostic information on the right. The image on the right is simply the App user interface scrolled down to reveal the “Replay Pattern” interface. 3.1 Communication Protocol The low-level communication protocol is Bluetooth, but the app level protocol encodes commands beginning with ‘$’, followed by a command code, then a ‘,’, followed by command value, and ending with ‘\n’. For example, the command to move to waypoint named ‘garage is: “$mtw,garage\n”. The Command Codes and Values are shown in Table 1.
Configuring Waypoints and Patterns for Autonomous Arduino Robot
307
Fig. 2. Android app user interface
3.2 Storage App storage uses TinyDB. TinyDB is a simple MIT App Inventor “database” that stores data on the phone. Each data entry in the database is assigned a unique tag and retrieved using the unique tag. TinyDB is used to store the name and lat/long for each waypoint and the names of the previously recorded patterns. 3.3 Remote Control Remote control allows the user to control the speed and direction of the robot. These commands allow the robot to move forward and backwards, along diagonals, and rotating left and right. The speed can be set from 0–100%. 3.4 Move to Waypoint Move to Waypoint requires the user to first save one or more named waypoints. To do this, the user operates the robot via remote control to move the robot to the desired waypoint. Then the user enters a name for the waypoint and clicks the Save Waypoint button. If the name already exists, an error message is displayed. The user may enter a new name or delete the existing entry. To Move to Waypoint, the user selects from a list of previously saved waypoints, then clicks the Move to Waypoint button. When the Move to Waypoint button is pressed, a Move to Waypoint command is sent to the robot with the value of a pipe separated lat and long, e.g. 45.12345|-122.12345.
308
G. H. Liao
The robot will then execute its internal autonomous Move to Waypoint routine described in “4 Robot Design and Programming”. 3.5 Replay Pattern The Replay Pattern user interface allows the user to start recording a remote control session. When the start pattern button is pressed, the start pattern command, ‘startpat’ command is sent to the robot. The robot will begin recording using its local storage using the name of the pattern as the name of the file. When the “Stop Pattern” button is pressed, the ‘stoppat’ command is sent to the robot. The robot then stops recording and closes the recording file. When the Replay Pattern is pressed after choosing a previously recorded Pattern from the Choose Pattern list, the command ‘replaypat’ is sent to the robot with the name of the pattern as the command value. The robot then replays the recorded motor controller control signals. Table 1. Command codes and values. Command code
Command value
Description
pwr
0 or 1
0-Off, 1-On
spd
0–100
Set Speed 0–100%
fwd
0–100
Move Forward, speed 0–100%
bck
0–100
Backward, speed 0–100%
lft
0–100
Rotate Left, speed 0–100%
rgt
0–100
Rotate Right, speed 0–100%
fl
0–100
Move Front Left, speed 0–100%
bl
0–100
Move Back, speed 0–100% Left
fr
0–100
Move Front Right, speed 0–100%
br
0–100
Move Back Right, speed 0–100%
stp
n/a
Stop. Sets all motors to 0
mtw
Lat|Long
Move to Waypoint
startpat
Pattern name
Start Pattern Recording
stoppat
n/a
Stop Pattern Recording
replaypat
Pattern name
Replay recorded pattern
dec
Declination angle
Set combined Declination Angle and Offset
rc
n/a
Remote Control
4 Robot Design and Programming The Arduino robot program is developed using Microsoft Visual Studio Code with the PlatformIO IDE extension. The main components of the robot programming are: an
Configuring Waypoints and Patterns for Autonomous Arduino Robot
309
Arduino board, Bluetooth module, GPS module, Compass module, microSD module, and four motor controllers, each described below. The program, developed in Microsoft Visual Studio Code, is uploaded to the Arduino and is executed as soon as the Arduino is powered up. The program has two main parts, setup and loop. The setup is executed once, initially at power up, then the loop is executed repeatedly in an endless loop. The program flow is described in Sect. 4.7 below. 4.1 Arduino The Arduino board is a Mega 2560. This version includes four UART (Universal Asynchronous Receiver Transmitter) Serial communication ports, 3 of which are used. Serial Port 0 is used for Debugging. Serial Port 1 is used for the Bluetooth module. Serial Port 2 is used for the GPS module. The Mega 2560 also supports the I2C (Inter-Integrated Circuit) protocol which is used by the compass module, and the SPI (Serial Peripheral Interface) used by the microSD module. 4.2 Bluetooth The Bluetooth module is a DSD TECH HC-05 Bluetooth Serial Pass-through Module, and connected to Serial Port 1. This module provides Bluetooth communication between the Phone App and robot. This communication is 2-way. Command Codes and Command Values are sent from the Phone App to the robot. GPS coordinates and debug information is sent to the Phone App. Debug information sent to the App includes the robot heading, the target waypoint course heading, and distance to target waypoint. 4.3 GPS The GPS module is an Adafruit Ultimate GPS with external active antenna and connected to serial port 2. Earlier tests included working with a SIM33EAU GPS Module with a passive antenna. One of the challenges with the GPS modules was getting consistent and accurate readings. The accuracy is approximately 3 m. But sometimes spurious outlier readings would occur. To filter out these outliers, a median filter was used. Multiple GPS readings were taken, then sorted using qsort, and the median reading was used. 4.4 Compass The Compass module is an Adafruit LSM303AGR Compass. The compass communicates with the Arduino over I2C and is connected to the SDA (Serial Data line) and SCL (Serial Clock line) pins of the Arduino. The Compass module is very sensitive to magnetic fields on the robot and particularly near the motors and DC Converter. To avoid these areas, the compass was placed on an elevated shelf 6 in. above the center of the robot. Also, compass readings had to be adjusted due to the orientation of the compass module from true north, in addition to the magnetic declination based on the current location of the compass. Magnetic declination is the difference between magnetic north and true north and varies over time and location. The combination of these 2 adjustments was tuned empirically and set from the Phone App User Interface using the ‘dec’ command code.
310
G. H. Liao
4.5 microSD The microSD module is an HiLetgo MicroSD Card Adapter. The microSD module communicates with the Arduino over SPI and is connected to the SPI pins of the Arduino. A 32GB microSD card is inserted into the adapter. The card is used for file storage. Each file is a data log recording of the control signals to the 4 motor controllers. The control signals are a PWM (Pulse Width Modulation) signal and a direction signal. PWM is explained in more detail in the Motor Controllers section below. Each file is named using the Pattern Name entered in the Phone App User Interface and sent via the ‘startpat’ command code. The file is a CSV (Comma Separated Value) file. Each row contains 8 comma separated values corresponding to the front left motor PWM and direction, front right motor PWM and direction, rear right motor PWM and direction, and rear left motor PWM and direction. Each row is written during an update iteration of remote control. On replay of the pattern, the values of the signals are read from the file then sent to each corresponding motor controller on every update iteration. 4.6 Motor Controllers The motor controllers are Cytron MD10C motor controllers. These are 10 Amp H-Bridge motor controllers and require no heat sink. The control signals for each are a PWM signal and direction signal. PWM is a way to control the speed of the motor by controlling the pulse width of the signal. The longer the pulse width is high, the faster the motor speed. The direction signal controls the direction the motor turns. The Arduino Mega 2560 has 14 PWM pins (pins 0–13). The motor controllers are connected to pins 2–9 for the left rear motor PWM and direction, left front PWM and direction, right rear PWM and direction, and right front PWM and direction respectively. The direction pins do not require PWM, but are connected adjacent to the PWM pin for convenience. 4.7 Program Flow The program flow includes a setup and loop. The setup is executed once upon startup and the loop is executed repeatedly in an endless loop. The setup sets the serial port speeds and initializes the GPS module and compass module. The loop checks Bluetooth for any incoming Command Codes and Command Values, then operates in one of 4 modes: Off, Remote Control, Move to Waypoint, Replay Pattern. The purpose of each of these modes is to determine how to set the PWM and direction for each of the 4 motor controllers. Off Mode. Motor controller PWM set to 0 for all motor controllers. Remote Control Mode. Motor controller PWM and direction set based on last Command Code. The Phone App User Interface when in Remote Control Mode sends a command when a Remote Control Command button is pressed and sends a Stop when the button is released. So the robot will continue to move in the direction of the last Remote Control button press until it is released and at the speed set by the Speed slider.
Configuring Waypoints and Patterns for Autonomous Arduino Robot
311
If the “Start Pattern Recording” button is pressed, the Mode is set to Remote Control if not already in Remote Control Mode. In addition to setting the PWM and direction for each motor controller, the PWM and direction control signals are logged. If the “Stop Pattern Recording” button is pressed, the pattern data log is closed and recording is stopped. Move to Waypoint Mode. In Move to Waypoint Mode, the PWM and direction control signals are computed based on the robot GPS coordinates, the robot compass heading, and the waypoint GPS coordinates. Adafruit provides libraries to retrieve the robot heading. Course and distance calculations are performed using TinyGPS++. TinyGPS++ is a small GPS library for Arduino providing universal NMEA parsing based on work by and courtesy of Maarten Lamers. Two specific methods used from this library are “distanceBetween” and “courseTo”. “distanceBetween” calculates the distance between to GPS coordinates. “courseTo” calculates a heading between 2 GPS coordinates where 0° is a due north heading and 270° is a due west heading. Subtracting the robot heading from the “courseTo” heading gives the degrees the robot must turn. A positive value is a right rotation and a negative value is a left rotation. If a calculated rotation is more than 180°, the value is mapped to a rotation in the other direction less than 180°. First the robot rotates until the difference is less than a tolerance of 3° and then begins to move forward. The left and right side motor speeds are adjusted slightly based on the magnitude of the difference between the “courseTo” heading and robot heading, so that the robot veers slightly left or right to adjust, but otherwise moves in a relatively straight line until the “distanceBetween” is less than 3 m. Replay Pattern Mode. In Replay Pattern Mode, the PWM and direction control signals are set based on the file named in the Command Value of the “replaypat” Command Code. The same code pattern is executed as in Remote Control to replicate the timing and for each update iteration a row is read and replayed. When an EOF (End of File) is reached, the robot is stopped, and the Mode is set to Off.
5 Results 5.1 Remote Control The Phone App User Interface is easy to use and straight forward. There are clear sections for Remote Control, Move to Waypoint, Record and Replay patterns, and debug information. Improvements for the future would be to use the phone orientation sensors for remote control rather than using buttons. While holding the phone in landscape mode, rotation about the longitudinal axis of the phone could control speed forward and backward. Tilting the phone left and right could control left and right direction. But overall a phone app is a highly available and convenient way to control a robot. 5.2 Move to Waypoint The success and usability of Move to Waypoint is highly affected by GPS accuracy. Because GPS accuracy is currently only 3 m or more, application requiring greater
312
G. H. Liao
accuracy is not possible. Also, found that GPS signal reception is less indoors than outdoors. Application of Move to Waypoint is really best suited for outdoor application in open spaces where endpoint accuracy of 3 m is acceptable. 5.3 Replay Pattern The replay pattern closely matches what was recorded during remote control. Variations occurred due to wheel slippage and small timing variations. Overall, however, replay pattern proved to be very useful in navigating the robot in a non-straight line fashion and was more accurate than Move to Waypoint for reaching its destination closer than 3 m. 5.4 Improvements Improvements include storing and replaying patterns based on motor encoder values rather than PWM and direction signals. Storing encoder values allows the calculation of distance if the size of the wheel is configured as a known value in the robot. If distance is used then the replay could be sped up or slowed down and still maintain the same pattern. Another improvement is using the same single interface to configure 2 or more robots as a swarm [2] to move autonomously between two locations and include collision avoidance strategies [3] so the robots don’t collide. An advantage of this is that multiple robots would get a job done faster and having a simple interface configure autonomous patterns in many robots could make this possible. 5.5 Overall Using a smartphone to easily configure autonomous patterns for a low-cost Arduino robot proved to be a useful low cost combination for possible personal robotic uses. Moving a robot autonomously from one place to another is just the beginning. With attachments, additional possible outdoor uses involving repeated patterns include mowing, line trimming, and cleaning.
References 1. Salman, H., Rahman, M.S., Tarek, M.A.Y., Wang, J.: The design and implementation of GPS controlled environment monitoring robotic system based on IoT and ARM. In: 2019 4th International Conference on Control and Robotics Engineering (ICCRE), Nanjing, China, pp. 93–98 (2019). https://doi.org/10.1109/ICCRE.2019.8724268 2. Mohamad Nor, M.H., Ismail, Z.H., Ahmad, M.A.: Broadcast control of multi-robot systems with norm-limited update vector. Int. J. Adv. Rob. Syst. 17(4), 1729881420945958 (2020) 3. Alam, M.S., Rafique, M.U., Khan, M.U.: Mobile robot path planning in static environments using particle swarm optimization (2020). arXiv preprint arXiv:2008.10000
Small Scale Mobile Robot Auto-parking Using Deep Learning, Image Processing, and Kinematics-Based Target Prediction Mingxin Li and Liya Grace Ni(B) California Baptist University, Riverside, CA 92504, USA [email protected]
Abstract. Autonomous parking is a valuable feature in many mobile robot applications. As compared to self-driving automobiles, auto-parking is more challenging for a small scale robot equipped with a front camera only, due to the camera view limited by the height of robot and the narrow Field of View (FOV) provided by the inexpensive camera. In this research, auto-parking of such a small scale mobile robot is accomplished in a four-step process: identification of available parking space using transfer learning based on the AlexNet; image processing for the detection of parking space boundary lines; kinematics-based target prediction when the parking space disappears from the camera view partially or completely; and motion control on the robot navigating towards the center of the parking space. Results show that a 95% accuracy has been achieved on identification of available parking spaces. The detection of boundary lines and prediction of target have also been successfully implemented in MATLAB. The testing of motion control and image capture for deep learning is performed on a self-built small-scale mobile robot. Keywords: Autonomous parking · Deep learning · Image processing
1 Introduction Autonomous parking is a valuable feature applicable to many robot applications such as tour guide robots, UV sanitizing robots, food delivery robots, and warehouse robots. Auto-parking will allow the robots to park in the charging zone autonomously and charge themselves whenever necessary without human intervention. Depending on the application, other tasks can be performed following the auto-parking. For instance, a vacuum robot can automatically empty dust at the docking station. As compared to selfdriving automobiles, auto-parking is more challenging for a small-scale mobile robot equipped with a front camera only due to the camera view limited by the height of robot and the narrow Field of View (FOV) provided by the inexpensive camera. In this research, auto-parking of a small-scale mobile robot with a front camera only is accomplished in a four-step process: Firstly, transfer learning [1] is performed based on the AlexNet [2], a popular pre-trained Convolutional Neural Network (CNN). An average success rate of 95% on the identification of empty parking slot has been © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 313–322, 2022. https://doi.org/10.1007/978-3-030-82199-9_19
314
M. Li and L. G. Ni
achieved. Secondly, the image of detected empty parking slot is processed with edge detection followed by the computation of parametric representations of the boundary lines using the Hough Transform algorithm [3]. Thirdly, the position of the entrance point of an available parking space is predicted based on the robot kinematic model as the robot drives closer to the parking space, because the boundary lines may disappear partially or completely from the camera view due to the height and FOV limitations. Lastly, the predicted entrance point of parking space is used as the set point for the motion control of the robot until it is replaced by the actual center of the parking space which becomes visible again by the robot. The linear and angular velocities of the robot chassis center are computed based on the error between the current chassis center and the set point. The left and right wheel speeds are obtained using inverse kinematics and sent to the motor driver. The main advantage of this four-step approach is the minimal requirement on sensing. Instead of mapping-based localization and navigation methods such as self-docking by the Roomba cleaner, the proposed approach only needs an inexpensive front camera rather than LiDAR for range detection. It does not need multiple cameras to build a surrounding view either. Also in applications involving a large number of identical smallscale mobile robots, identification of available parking space can be better realized with the proposed approach rather than mapping-based methods. The remainder of this paper is organized as follows: in Sect. 2, related research is summarized, and background knowledge of applied techniques is described. Hardware and software setup is presented in Sect. 3. The four-step process is discussed in Sect. 4, followed by conclusions and future directions in Sect. 5.
2 Background and Related Research 2.1 Related Research As self-driving technologies mature gradually in the automobile industry over recent years, auto-parking has drawn a lot of interest from researchers. Many solutions have been proposed in this field. An algorithm that can generate surround view by taking images from front, back, left, and right cameras was developed in [4]. The parking-slot is then detected based on the surround view. Compared to their solution which requires four wide-view cameras, this study focuses on an affordable solution for small-scale robots equipped with an inexpensive front camera only. In [5], Google map was utilized to generate the top view of the parking lot, which involves the use of GPS. However, some small-scale robots are constrained to indoor environment which limits the accessibility and accuracy of GPS signal. In [6], the front, rear, and right ultrasound sensors on a smart wheeled mobile robot (SWMR) were used to detect parking space environment, simulate drivers’ parking strategy, and complete roadside parking and reverse parking. Similarly, the mobile robot in [7] uses three optical proximity sensors placed at the front and the two sides to detect obstacle and make decisions between parallel parking and perpendicular parking. The company ROBOTIS has developed an auto-parking application for their educational robot called TurtleBot. The auto-parking feature of TurtleBot is achieved by a combination of Augmented Reality (AR) parking sign detection and LiDAR-based localization [8]. The auto-parking mechanism proposed in this paper only uses a front
Small Scale Mobile Robot Auto-parking
315
camera, instead of ultrasonic sensors, proximity sensors, or LiDAR sensor. Compared to above-mentioned approaches, the solution proposed in this paper suggests a lower cost and simpler construction. However, in order to achieve comparable reliability and accuracy as other solutions, the object recognition step needs further optimization for more complicated surrounding environment. This solution is also limited to small-scale robots. It is not suitable for cars. 2.2 Transfer Learning with AlexNet Transfer learning is commonly employed in deep learning applications. A pre-trained network can be used as a starting point to learn a new task. Fine-tuning a network with transfer learning is usually much faster and easier than training a network with randomly initialized weights from scratch. The learned features can be transferred to a new task with a smaller number of training images. The network adopted in this study is the AlexNet, a popular pre-trained Convolutional Neural Network (CNN). It has been trained with over a million images and can classify images into 1000 object categories, such as keyboards, coffee mugs, pencils, and many types of animals. The network has learned rich feature representations for a wide range of images. The network takes an image as input and outputs a label for the object in the image. The details of applying transfer learning with AlexNet in this research will be discussed in Sect. 3. 2.3 The Hough Transform The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing [3]. The purpose of this technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform. In this research, the boundary lines of an available parking slot are detected using the MATLAB built-in function hough from the Image Processing Toolbox™. The parametric representation of a line is: ρ = xcosθ + ysinθ,
(1)
where ρ is the distance from the origin to the line along a vector perpendicular to the line, θ is the angle between the x-axis and this vector, and (x, y) are the coordinates of an arbitrary point on the line, as shown in Fig. 1. The hough function generates a parameter space matrix whose rows and columns correspond to these ρ and θ values respectively.
316
M. Li and L. G. Ni
Fig. 1. Parametric representation of a line in Hough transform
3 Hardware and Software Setup 3.1 Hardware Setup A small scale differential drive robot, shown in Fig. 2, was built to test the proposed auto-parking method. A Raspberry Pi 4 with the Debian operating system installed is utilized as the processing unit. Four DC motors are controlled by a L298N dual H-Bridge motor driver and powered by a power bank through a USB power module. A Pi Camera is mounted at the front of the robot. Images acquired by the Pi Camera while the robot drove itself around were included in the dataset for transfer learning.
Fig. 2. Small scale robot built for this research
3.2 Software Setup A variety of techniques and tools have been employed to perform the tasks in the autoparking process. The motion control and image acquisition modules are implemented
Small Scale Mobile Robot Auto-parking
317
in Python and run on the Raspberry Pi 4. The identification of available parking slot is developed using the functions provided by the Deep Learning Toolbox™ in MATLAB. The boundary lines of the available parking slot is detected and visualized using the Image Processing Toolbox™ in MATLAB. The kinematics based target prediction and simulation of the target-tracking navigation are both performed in MATLAB.
4 The Four-Step Auto-parking Process In this section, the four-step auto-parking process for a small scale mobile robot is described in details. Testing results for each step are presented. 4.1 Identification of Available Parking Space As mentioned earlier, the AlexNet is a Convolutional Neural Network (CNN) pre-trained with over a million images and can classify images into 1000 object categories. The early layers of the AlexNet learned low-level features such as edges and blobs, and the last layers learned task-specific features. The network takes an image as input and outputs a label for the object in the image. First the AlexNet was loaded into workspace, then it was trained with a dataset of 150 empty parking space images and 150 occupied parking space images, with the output labels being “Parking Slot” and “Taken Slot”. The final layers in the AlexNet were then replaced to learn features specific to this new dataset. The dataset was divided into two groups, with 70% for training and the remaining 30% for validation. The network was trained for 6 epochs. An epoch is a full training cycle on the entire training dataset. When performing transfer learning, it is not necessary to train for as many epochs. Validation was performed upon the completion of training. The validation accuracy can be visualized by the provided function in MATLAB. The MATLAB interface in Fig. 3 shows the six epochs and the accuracy rate. It outputs a validation accuracy of 95.77%. Figure 4 shows eight randomly selected pictures from the validation set with labels indicating the results of classification. For this sample, all eight images are correctly labeled. However, in order to collect a more reliable accuracy rate, this procedure was run 100 times and the overall accuracy rate is about 95%. In this test, the classifier only takes two classes: available slots and occupied slots. A new class of partially occluded slots can be added for future improvements, and a corresponding new action plan should be created. 4.2 Determination of Parking Slot Boundaries The original image of an empty parking slot was captured by the Pi camera mounted at the front of the small scale robot. A black and white image only containing the edges was generated by applying the built-in Canny edge detection [9] function in MATLAB. Even though the boundaries can be seen from the edge detection result, the robot needs parametric information of the boundary lines, i.e. ρ and θ values. The built-in hough function in MATLAB provided a data structure that contains the position information
318
M. Li and L. G. Ni
Fig. 3. Training process of transfer learning based on AlexNet
Fig. 4. Validation results
of detected boundary lines. In Fig. 5, (a) shows the original image, (b) shows the black and white image after edge detection, and (c) shows green lines plotted over the original image to highlight the boundaries. The green lines were plotted based on the ρ and θ values obtained with the highest number of votes from the Hough Transform. For each line, the starting point, ending point, as well as the ρ and θ values are listed in Table 1. 4.3 Kinematics-Based Target Prediction With the identification of available parking slot and the determination of boundary lines, the robot is supposed to drive itself towards the center of the parking slot. However, due to its height constraint and the narrow FOV, when the robot is moving closer, the parking slot will partially or completely disappear from its camera view. At this point, the robot
Small Scale Mobile Robot Auto-parking
319
Fig. 5. Parking slot boundary detection: (a) original image; (b) black and white image after edge detection; (c) green lines with identified parameters plotted over original image Table 1. Parameters of identified boundary lines. Boundary line
Parameters Starting point
Ending point
θ
ρ
1
[300, 38]
[635, 38]
−90
−37
2
[290, 42]
[318, 445]
−4
285
3
[653, 39]
[755, 396]
−16
616
does not know where to go. The target position, namely entrance point of the parking slot, can be predicted based on the robot kinematic model [10]. Denote the position and orientation of the differential drive robot as (x, y) and φ respectively. With respect to its local frame where the x axis is defined as its heading direction, its velocities can be calculated as functions of the wheel speeds and wheel/chassis parameters as follows: ⎤ ⎡ rωR +rωL ⎤ ⎡ ⎤ ⎡ x˙ vC 2 ⎦ ⎣ y˙ ⎦ = ⎣ 0 ⎦ = ⎣ (2) 0 rω −rω R L ωC φ˙ d where vc and ωc are the linear and angular velocities of the chassis center, ωR and ωL are the right and left wheel speeds, r is the wheel radius, and d is the distance between the two wheels. The position of the target, i.e. entrance point of the available parking slot, is fixed with respect to the global frame but keeps changing with respect to the robot local frame. When the parking slot is still within the camera view, the target in the image frame can be estimated based on the detected boundary line parameters. As the boundary lines are about to disappear from the camera view, the target position in the image frame is mapped to an initial position of (x 0 , y0 ) with respect to the robot local frame at this particular moment. Then the target position (x 0 , y0 ), (x 1 , y1 )… (x t , yt ), (x t+1 , yt+1 )… with respect to the changing local frame is predicted as follows: cos(ωC t) sin(ωC t) xt − vC t xt+1 = (3) − sin(ωC t) cos(ωC t) yt+1 yt where t is the sampling period.
320
M. Li and L. G. Ni
In Fig. 6, it shows that the robot detected the parking space at the origin. And then it moved to the position p where it can no longer see the parking space due to the FOV limit. With the proposed method, the robot will still be able to track the predicted target position until the entrance point appears in its camera view again. The target prediction and motion planning can be updated at a slower rate than the motion control. In other words, the motion control algorithm receives the actual or predicted target position every 5 s while the motor speeds are adjusted at a faster rate in order to track the target.
Fig. 6. Kinematics-based target prediction
4.4 Motion Control The robot uses a L298N dual H-Bridge motor driver to control the motors. The two wheels on the same side are wired such that they are turning with the same speed. The motion control algorithm is programmed in Python. The desired linear velocity of the robot chassis vc is computed as the difference between the robot position (origin of the local frame) and the predicted target position multiplied by a gain k v , while the desired angular velocity ωc is obtained as the difference between the current heading angle and the direction pointing towards the target multiplied by another gain k ω . The left and right wheel speeds are then calculated based on the inverse kinematics model in (4): vC + 21 d ωC vR = (4) vL vC − 21 d ωC MATLAB simulation was performed to verify the motion control algorithm. Figure 7 shows the simulation result of the robot moving towards a given target position. Many target positions are shown in the simulation result to demonstrate the trajectory.
Small Scale Mobile Robot Auto-parking
321
Fig. 7. MATLAB simulation of motion control
5 Conclusions and Future Directions In conclusion, the four subtasks of auto-parking for a small scale robot are well-achieved. By applying transfer learning with AlexNet, the empty parking slot is identified with high accuracy. The parametric information of the boundaries is successfully obtained by line-detection with Hough transform. The target location is predicted based on the robot kinematic model. Simulation results demonstrate smooth navigation to the target location with inverse-kinematics based motion control. One of the future directions for this research is to integrate the subsystems into one hardware/software solution. To have the deep learning and image processing run on the robot, a more powerful board instead of the Raspberry Pi 4 is needed, for example, the Jetson Nano by Nvidia which is designed to have better performance for machine learning and image processing tasks. Another direction of future research is to improve the robustness of the parking slot identification. For example, the neural network should be trained with more pictures that are taken from different view angles or with obstacles in the scene. In addition, other pre-trained neural networks such as DenseNet or ResNet can be used to improve the identification accuracy of vacant parking space.
References 1. West, J., Ventura, D., Warnick, S.: Spring research presentation: a theoretical foundation for inductive transfer. Brigham Young Univ. Coll. Phys. Math. Sci. 1, 8 (2007) 2. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017) 3. Duda, R.O., Hart, P.E.: Use of the Hough transformation to detect lines and curves in pictures. Commun. ACM 15(1), 11–15 (1972) 4. Li, L., Zhang, L., Li, X., Liu, X., Shen, Y., Xiong, L.: Vision-based parking-slot detection: a benchmark and a learning-based approach. In: Proceedings of 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, pp. 649–654(2017) 5. Singh, A., Chandra, A., Priyadarshni, D., Joshi, N.: Self parking car prototype. In: Proceedings of 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, pp. 2251–2254 (2018)
322
M. Li and L. G. Ni
6. Wu, T.-F., Tsai, P.-S., Hu, N.-T., Chen, J.-Y.: Research and implementation of auto parking system based on ultrasonic sensors. In: Proceedings of International Conference on Advanced Materials for Science and Engineering (ICAMSE), Taiwan (2016) 7. Shet, A.J., Killikyatar, A., Kumar, A., Ropmay, F.B.: Autonomous self-parking robot, In: Proceedings of 2018 International Conference on Design Innovations for 3Cs Compute Communicate Control (ICDI3C), Bangalore, India (2018) 8. ROBOTIS e-Manual: TurtleBot3. https://emanual.robotis.com/docs/en/platform/turtlebot3/ autonomous_driving/#autonomous-driving 9. Canny, J.: A computational approach to edge detection. IEEE Trans. Pattern Anal. Mach. Intell. 6, 679–698 (1986) 10. Beggs, J.S.: Kinematics, Springer-Verlag, Berlin (1983)
Local-Minimum-Free Artificial Potential Field Method for Obstacle Avoidance Abbes Tahri and Lakhdar Guenfaf(B) LRPE Laboratory, University of Science and Technology Houari Boumediene (USTHB), BP 32 Alia, 16111 Algiers, Algeria {a.tahri,lakhdar.guenfaf}@usthb.dz
Abstract. In this paper we present a new formulation of the local minimum problem which characterizes the artificial potential field method for obstacle avoidance. Our approach consists of detecting the local-minimum situation by measuring the angle formed by the two force vectors (repulsive force and attractive force vectors). Afterward, we suggest adding a new vector expressed in term of the angle formed by two vectors. The modified APF method we are suggesting ensures a local minimum-free obstacle-free path planning algorithm. The developed algorithm is verified through MATLAB platform simulations. One robot is put in some position moves within a 50 × 50 m2 area to reach a goal point while avoiding three stationary obstacles. Four different initial configurations are simulated. The robot dynamics are represented by first-order integrator model. The obtained trajectories showed a clear effectiveness of the proposed algorithm. Keywords: Obstacle avoidance · Artificial Potential Field (APF) · Path planning
1 Introduction Obstacle avoidance is presenting a challenging problem for robot navigation since the past decades. Many techniques were already proposed aiming to ensure optimal solutions for the obstacle-free path planning problem in terms of low computation complexity, solution optimality, path length and path completeness. Artificial Potential Field (APF) method proposed by [1] is one of the earliest and very popular methods to be suggested in this context. Originally proposed for real-time sensor-based obstacle-avoidance algorithm, its simplicity and intuitiveness allowed the extension of the method for path planning problem based on an environment offline mapping. Despite its simplicity, low-computation complexity, the APF method suffered from the local minimum problem which may lead the robot to a trap situation and does not reach its final goal. Aiming to overcome the local minima issue, several attempts of improvements such as: Virtual Force Field (VFF) [6], Virtual force Histogram (VFH) [5], VFH+ [8] have been developed. In this paper, we present a different formulation of the local minimum problem. Therefore a modification of the original APF method is introduced. The new variant © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 323–331, 2022. https://doi.org/10.1007/978-3-030-82199-9_20
324
A. Tahri and L. Guenfaf
that we suggest is proven to resolve the local minimum problem and ensure an optimal obstacle-free path planning solution. Our theoretic formulation and development are confirmed through MATLAB simulations which are showing reassuring performances. The rest of this paper is organized as follows: in Sect. 2, a short literature review of related works and recent achievements about the topic is presented. Next, we review the local-minimum problem that characterizes the APF technique and introduce our modification to overcome this issue in Sect. 3. In Sect. 4, we assess the effectiveness of our solution through multiple MATLAB simulations for different scenarios. Finally, we summarize our main remarks and outcomes in Sect. 5.
2 State of the Art Mobile robot navigation has been gaining a major focus in robotics’ research works due to its implication in many interesting robotic applications such as: robotic rescue systems, outdoor service robots, UAVs formations, underwater Autonomous Vehicles for Pipeline monitoring. Mobile robot navigation problem can be divided into two main tasks: path planning and obstacle avoidance. Some scientific works has tracked the scientific advances and given overview of obstacles avoidance techniques [2–4]. In [4], an enlarged overview of main obstacle avoidance techniques is given, it has confirmed that the VFH and VFH + presented in [7] and [8] respectively, presented good alternatives to traditional APF presented in [1]. Nevertheless, the time computation gets longer for larger grid sizes since both methods rely on the certainty grid approach. Furthermore, other techniques such as Dynamic Window approach (firstly presented in [9]), Nearness Diagram, curvature velocity and elastic Band concept has been covered. Authors is [3] give a particular attention to different methods inherited from the Artificial potential field method and analysed their performances. The authors come up with similar conclusions about drawbacks and advantages of those methods. Additionally, the authors qualified the harmonic potential functions presented in [10] to ensure better performances for eliminating local minimum configurations despite their high time-computation cost. Furthermore, works of [13] and [14] were recently taken back by [11] to develop a sensor-based reactive navigation solution. The suggested technique doesn’t ensure an optimal solution in terms of time-computation regarding the complexity of the robot’s local workspace definition based on the generalized Voronoi diagrams. Therefore, the necessity of new improvements for better results is confirmed. In the next section, we present a new formulation of the local-minimum problem and suggest a simple and efficient modification on the traditional APF to ensure a local minimum-free obstacle avoidance algorithm.
3 Problem Formulation In this section we present the main idea of our work in aim to guarantee a local-minimafree path planning. First, we shall recall the formulation of the traditional APF method.
Local-Minimum-Free Artificial Potential Field Method
325
In [1], O. Khatib suggested to consider the robot navigation space provided with artificial field in such a way that obstacles are developing repulsive potentials within a limited area, the goal point in turn is applying an attractive potential. A robot moving in such area will be attracted to its final goal by the attractive force generated by the latter. If the robot enters the obstacle influence zone, it will be “pushed away” by the obstacle repulsive force. The direction and speed of the robot movement is determined by the resultant force of both repulsive and attractive force (see Fig. 1).
Obstacle Repulsive force influence zone Goal point
Robot Repulsive force Attractive force Resultant force
Fig. 1. Artificial potential field method
Consider a 2-D Cartesian system, let: q = (x,y)T ; qg = (xg ,yg )T ; qobs = (xobs ,yobs )T the coordinates of the robot represented by its gravity centre, the goal point, and an obstacle point respectively. The Artificial Potential function is given by [3]: U (q) = Uatt (q) + Urep (q)
(1)
Where: U(q): global Artificial potential field. Uatt(q): attractive potential field. Urep(q): repulsive potential field. The attractive potential function is designed to ensure dragging the robot to the goal point, the following expression is suggested: Uatt (q) =
2 1 *ka *q − qg 2
ka : Attractive potential coefficient, ka > 0. q − qg : Euclidean distance between robot location and final goals.
(2)
326
A. Tahri and L. Guenfaf
In the other hand, the repulsive potential function is designed to ensure repelling the robot away once it enters the obstacle’s influence zone and vanishes outside this zone: 1 1 1 2 2 kb ( q − qobs − d0 ) , q − qobs ≤ d0 Urep (q) = (3) 0 , q − qobs > d0 kb : Repulsive potential coefficient, kb > 0. q − qg : Euclidean distance between robot location and obstacle. d0 : Rayon of the obstacle repulsive field disc. The different forces are taken as negative gradients of respective potentials, thus: F att (q) = −∇Uatt (q) F att (q) = −ka *q − qg F rep (q) =
⎧ ⎨k b
⎩
1 q − qobs
− 0
1 d0
∗
1 q − qobs 2
− q→ cq , q − qobs
(4) q − qobs ≤ d0
, q − qobs > d0
(5) F sum (q) = F att (q) + F rep (q)
(6)
F sum : represents the resultant force that determines the direction and speed of the robot. 3.1 Local Minima Problem The point qg is supposed to be the only point where: F sum = F att qg + F rep qg = F att qg + 0 = 0 This ensures that the robot reaches its final goal point. However, in some situations, we may have: F att (q) = −F rep (q) in some location, which leads to: F sum (q) = 0 (see the Fig. 2). In such location, U(q) has a local minimum value, which makes the robot trapped at this location and not being able to reach its goal point. 3.2 The Modified APF Our main observation that leads to resolving the local minimum problem is formulated as following: “Local minimum configuration only occurs if the two force vectors are in the same direction”. This statement can easily be verified by observing that: ⎧ ⎨ Fatt (q) = F rep (q) (7) Fatt (q) = −Frep (q) ⇔ ⎩ F att (q), F rep (q) = (2k + 1) ∗ π, k ∈ Z
Local-Minimum-Free Artificial Potential Field Method
Robot
327
Obstacle
F_rep
F_att
Q-goal
Fig. 2. Local minimum configuration
The second equation in system (7) gives an intuitive tool to detect local minimum occurrence by measuring the angle formed by the two vectors, hence overcoming trap configuration drawback. This could be done by adding a third vector perpendicular to the resultant force vector to push the robot ‘exit’ local minimum location. The new vector has to be in term of (8) α = F att (q), F rep (q) And also F rep (q), to ensure that it vanishes outside obstacle influence zone. For a robot moving on R2 , The modified sum of forces F sum (q) expression is given as follow:
kα *cos(α) Frep_y (9) * Fsum (q) = Fatt (q) + Frep (q) + −Frep_x 1 + F rep (q) kα > 0 Where Frep_x and Frep_y are repulsive force vector components on x-axe and y-axe respectively. In the next section, we present simulation results for an implementation of the presented modified APF.
4 Simulation Results In this section we present MATLAB simulation results for the proposed algorithm. Considering a 50 × 50 m2 surface, we put one robot in different starting Positions, for each case, the vehicle has to reach the goal point by avoiding 3 stationary obstacles put in positions to ensure the occurrence of a local-minimum configuration. Each obstacle is represented by a red dot, and its influence zone is delimited by a red circle whose the rayon R = 3 m.
328
A. Tahri and L. Guenfaf
We model the robot by a first-order integrator model: q˙ = u; u ∈ R2
(10)
And the control law is given as the resultant Artificial Force: u = Fsum (q)
(11)
In Figs. 3, 4, 5 and 6, we plotted the planned trajectory for each case:
Fig. 3. Robot’s trajectory; case 1
We can observe that the vehicle reaches the goal point while avoiding the obstacles and following an optimal trajectory without being trapped in any local-minimum configuration. As discussed in Sect. 2, the existing recent techniques lack of simplicity and involve a high time-computation cost. However the simulations results show that our method guarantees an optimal solution in terms of time-computation cost regarding the simplicity of expression (9). Consequently, it presents a suitable solution for the online obstacle avoidance problem.
Local-Minimum-Free Artificial Potential Field Method
Fig. 4. Robot’s trajectory; case 2
Fig. 5. Robot’s trajectory; case 3
329
330
A. Tahri and L. Guenfaf
Fig. 6. Robot’s trajectory; case 4
5 Conclusion In this paper we presented a modified Artificial Potential Field method to ensure a local minimum-free obstacle avoidance path planning algorithm. Our approach consists of detecting the local-minimum situation by measuring the angle formed by the two force vectors (repulsive force and attractive force vectors). Afterward, we suggest adding a new vector expressed in term of the angle formed by two vectors. The simulation results have shown the effectiveness of the proposed technique. Our method guarantees an optimal solution in terms of time computation regarding the simplicity of expression (9), contrary to other methods, for instance, the harmonic potential functions method presented in [10], where the computation time increases fast with the grid size as stated by [3]. Regarding its simplicity and effectiveness, we believe that our modified APF method presents a suitable solution for the online obstacle avoidance problem. Satisfied by the obtained results, we intend to discuss the extension of this technique to ensure dynamic-obstacle avoidance, which renters in a global objective of developing an efficient navigation robot algorithm for a multi-robot system accomplishing loads transportation tasks within a dynamic partially-known environment.
References 1. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. Int. J. Robot. Res. 5, 90–98 (1986) 2. Patle, B., Babu, L.G., Pandey, A., Parhi, D., Jagadeesh, A.: A review: on path planning strategies for navigation of mobile robot. Def. Technol. 15, 582–606 (2019)
Local-Minimum-Free Artificial Potential Field Method
331
3. Sabudin, E.N., Omar, R., Melor, C.K., CKAN, H.: Potential field methods and their inherent approaches for path planning. ARPN J. Eng. Appl. Sci. 11(18), 10801–10805(2016) 4. Kunchev, V., Jain, L., Ivancevic, V., Finn, A.: Path planning and obstacle avoidance for autonomous mobile robots: a review. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds.) KES 2006. LNCS (LNAI), vol. 4252, pp. 537–544. Springer, Heidelberg (2006). https://doi.org/ 10.1007/11893004_70 5. Minguez, J., Lamiraux, F., Laumond, J.-P.: Motion planning and obstacle avoidance. In: Siciliano, B., Khatib, O. (eds.) Springer Handbook of Robotics, pp. 1177–1202. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32552-1_47 6. Borenstein, J., Koren, Y.: Real-time obstacle avoidance for fast mobile robots. IEEE Trans. Syst. Man Cybern. 19, 1179–1187 (1989) 7. Borenstein, J., Koren, Y.: The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Trans. Robot. Autom. 7, 278–288 (1991) 8. Ulrich, I., Borenstein, J.: VFH+: reliable obstacle avoidance for fast mobile robots. In: Proceedings of 1998 IEEE International Conference on Robotics and Automation (Cat. No.98CH36146). vol. 2, pp. 1572–1577 (1998) 9. Fox, D., Burgard, W., Thrun, S.: The dynamic window approach to collision avoidance. IEEE Robot. Autom. Mag. 4, 23–33 (1997) 10. Kim, J., Khosla, P.: Real-time obstacle avoidance using harmonic potential functions. IEEE Trans. Robot. Autom. 8, 338–349 (1992) 11. Arslan, O., Koditschek, D.: Sensor-based reactive navigation in unknown convex sphere worlds. Int. J. Robot. Res. 38, 196–223 (2018) 12. Keyu, L., Yonggen, L., Yanchi, Z.: Dynamic obstacle avoidance path planning of UAV based on improved APF. In: 2020 5th International Conference on Communication, Image and Signal Processing (CCISP). pp. 159–163 (2020) 13. Rimon, E., Koditschek, D.E.: Exact robot navigation using artificial potential functions. IEEE Trans. Robot. Autom. 8, 501–518 (1992) 14. Koditschek, D.: Exact robot navigation by means of potential functions: some topological considerations. In: Proceedings of 1987 IEEE International Conference on Robotics and Automation, pp. 1–6. IEEE (1987)
Digital Transformation of Public Service Delivery Processes in a Smart City Pavel Sitnikov1,2 , Evgeniya Dodonova2 , Evgeniy Dokov1 , Anton Ivaschenko3(B) , and Ivan Efanov4 1 SEC “Open Code”, Yarmarochnaya 55, Samara 443001, Russia 2 ITMO University, Kronverksky Pr. 49, bldg. A, St. Petersburg 197101, Russia 3 Samara State Technical University, Molodogvardeyskaya 244, Samara 443100, Russia 4 Administration of the Governor of the Samara Region,
Molodogvardeyskaya 210, Samara 443006, Russia
Abstract. The paper presents a solution for automated analysis and improvement of public service delivery processes in the course of their digital transformation. The proposed solution is based on processes specification using business processes modeling notation (BPMN) and their consequent digitalization on the basis of enterprise content management (ECM) software platform. An original system of efficiency indicators was developed to improve the methodology of digital processes evaluation. The proposed approach was implemented in Samara region for digital transformation of the processes of licensing the retail sale of alcoholic beverages, child monthly support provision, issuance of permits for the use of land owned and processing of the citizens’ appeals by the Ministry of health, etc. The resulting solution is recommended for decision-making support of regional management digital transformation under the programs of Smart Cities’ development. Keywords: Digital transformation · Business processes · Enterprise content management · Decision-making support
1 Introduction Digital transformation of public service delivery processes is one of the necessary stages of Smart City development. The main problems in this area are concerned with high inertness and inflexibility of the traditional processes. Instead of total reconstruction and re-engineering based on the large-scale implementation of information and communication technologies most processes are automated without considerable changes. As a result, the expected benefits of digital transformation remain unachievable. To solve this problem there is proposed a reasonable system of efficiency indicators of service delivery processes modernization at the regional level of management and their evaluation. The scientific novelty of the study is in the capability of the proposed system of indicators of reflect both performance and transaction costs, which meets the original goals of digitalization. In this paper there is proposed a software solution for public service delivery processes design, analysis and optimization using an enterprise content management (ECM) software platform as a basis. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 332–343, 2022. https://doi.org/10.1007/978-3-030-82199-9_21
Digital Transformation of Public Service Delivery Processes
333
2 State of the Art Modern trends of social and economic systems digital transformation including public service delivery processes modernization are described in [1, 2]. Their implementation in state and regional management requires “digital thinking”. The main approaches to the study of socio-cultural factors in the economy, as well as key modern trends in empirical research (including in connection with the identification of cause-and-effect relationships) are presented in [3]. Deep implementation computer technologies helped to increase the efficiency of accounting and process control and integrated information systems started playing a key role in building business processes [4, 5]. With the help of the Internet modern citizen are provided with almost unlimited opportunities for virtual interaction with public affairs and each other including the area of governmental services. Further steps were closely related to the electronic government development [6–8], in which a significant proportion of the governmental functions are transferred to the information system. Given the current trends in digital transformation, state information systems are becoming open platforms for information interaction between suppliers and consumers of public services. The processes of interaction in business or public service delivery processes are described and analyzed using business processes modeling notation (BPMN) [9, 10]. This notation is fairly common in business engineering and software development and is broadly used for formalization and documentation of organizational structures, events, activities, gateways, sequence flows, and data models. It is understandable and easy to operate for both business and technical users. Processes modeling and modernization is a part of e-Government development [11, 12]. Detailed workflow analysis gives necessary information to provide electronic interconnection of public bodies to effectively organize collaboration of their business processes in order to enable effective and efficient delivery of services. BPMN developed considering the influence of human factor and influence of social and economic systems over the business processes into a subject-oriented approach for business processes management (S-BPM), which conceives a process as a collaboration of multiple subjects organized via structured communication [13, 14]. This concept allows using process oriented notations to describe self-organized complex systems and provide their automation and simulation using multi-agent technologies [15]. According to above mentioned approaches all the digital processes require a software platform for support and management. This platform should include a special component for business processes modeling and analysis in order to provide adequate and sufficient feedback. In real applications such a platform can be instantiated as an enterprise content management (ECM) system [16, 17]. According to the classical definition, ECM extends the concept of content management by adding a time line for each content item and possibly enforcing processes for the creation, approval and distribution of them. Therefore the problem of a digital transformation of service organization can be studied from the perspective of business processes modeling and their subsequent optimization placing the ECM system to the basis. In this paper it is proposed to implement such an approach using the model of an intermediary service provider [18] being extended by a new system of key performance indicators.
334
P. Sitnikov et al.
3 The Formal Model and System Objectives The formal model of public service delivery processes extends the concept of an intermediary service provider. It was developed to formalize the problem statement of public service delivery processes evaluation and optimization using BPM notation. Let us designate the core BPMN elements si,j for the process pi = si,j , where i is the process instance number and j is the number of the state. Event is specified as a Boolean variable: ei,j,k = ei,j,k si,j , ti,j,k = {0, 1}, (1) where ti,j,k represents the moment of occurrence. From public services delivery perspective three types of events have special significance: • ei,0,0 - service request, received from the customer; • ei,1,1 - service delivery process start event; • ei,Ni ,Ni - service delivery process end event. Consequently there can be formulated the first process efficiency indicator, which is maximum waiting time: (2) max ti,1,1 − ti,0,0 → min . i
Activity or task is specified as: ai,j,n = ai,j,n si,j , ti,j,n , wi,j,x , qi,j,y (ti,j,y ) ,
(3)
where ti,j,n represents activity duration; wi,j,x point at the responsible executor (involved staff); qi,j,y denotes external interagency request if needed; ti,j,y represents time interval between the interagency request and response. Process costs regarding the required resourcing are estimated by the involved staff: 1 wi,j,x → min, Ni x
(4)
i
the number of interagency requests: 1 qi,j,y → min, Ni y
(5)
i
and interagency request/response time:
max max ti,j,y → min . i
y
(6)
Digital Transformation of Public Service Delivery Processes
335
Gateway is specified as: gi,j,m = gi,j,m ci,j,m , si,j ,
(7)
where ci,j,m represent the conditions for optional paths. Sequence flow combines events, activities and gateways to a solid process and determine the order of activities to be performed: (8) fi,j1 ,j2 = fi,j1 ,j2 si,j1 , si,j2 . Introduction of virtual states for the core elements si,j and dynamic characteristics ti,j,k and ti,j,n allows problem statement in the form of the target functions. Service term defines: Ni i
ti,j,n · P si,j → min,
(9)
j=1
where the indicator P si,j = 1 represents that the element si,j belongs to the critical path. The proposed model allows using PBMN to analyze and improve the processes of public service delivery in the spirit of digital transformation. Five introduced process digital transformation efficiency indicators were appended by another five parameters that characterize the share of improved services in the total number of considered applications for public services. As a result there was developed a following original system of key performance indicators (see Table 1). These indicators mainly represent the quality and availability of public services. The calculation of the values is based on the statistics received by the regional government. The frequency of service provision largely depends on its popularity with the consumer. The indicators affecting popularity are presented in Fig. 1. Table 1. Processes digital transformation efficiency indicators. No
Indicator
Parameter type
Unit
1
Service delivery term, max
Performance
Days
2
Involved staff: the number of officials involved in the process and responsible for the provision of the service, per 1 copy of the process
Transaction costs
Staff units
3
Number of interagency requests
Transaction costs
Units
4
Interagency request/response time
Performance
Days
5
Maximum waiting time
Performance
Minutes (continued)
336
P. Sitnikov et al. Table 1. (continued)
No
Indicator
Parameter type
Unit
6
Share of automated sub-processes (steps)
Transaction costs
%
7
Share of cases of the public services provision in violation of the established deadline in the total number of considered applications for public services
Performance
%
8
Share of complaints of applicants received in the procedure of pre-trial appeal of decisions made in the course of the provision of public services and actions (inaction) of officials of the department, in the total number of applications for public services
Quality and availability
%
9
Share of violations of the implementation of Quality and availability the regulations, other regulatory legal acts, identified as a result of the control measures
%
10
Share of applications for the provision of public services received in electronic form (of the total number of applications received)
%
Transaction costs
Fig. 1. The indicators affecting popularity.
High values of demand, availability and awareness of a service can push the consumer to use it. At the same time, accessibility should be considered an infrastructure indicator, and demand and awareness – information. It is worth noting that an increase in demand leads to an increase in the popularity of all dependences on other indicators. The same
Digital Transformation of Public Service Delivery Processes
337
can be said for accessibility. Only high awareness with low demand and availability does not affect popularity. The values of these indicators can be found from various sources, such as social networks, media, open information services, restricted information base and video analytics.
4 Methodology Solution of the stated problem requires automated calculation and analysis of the processes digital transformation efficiency indicators. On the basis of ECM system there was developed analytical software to describe the processes using BPMN notation and automatically analyze their corresponding to the basic criteria of digital transformation. The basic methodological, organizational and technical principles of processes design that are used for their evaluation and improvement include the following activities: 1. Describe the existing business processes using BPMN 2.0. 2. Implement the formalized processes on the basis of Enterprise Content Management (ECM) platform, which provides their digitalization and organization of members’ negotiation primarily in electronic form. 3. Evaluate business processes using the new KPI system oriented to the goals of digital transformation and find the weaknesses and bottle necks. 4. Recommend the corresponding improvements and validate the revised version of business processes. User interface is presented in Fig. 2. Using the standard library of BPMN elements the user can construct the business process and perform its evaluation. The issues are visualized by alerts and notifications allocated near the corresponding diagram fragments. In such a way the ECM system provides decision-making support and motivates the user to improve the process according to digital transformation goals. The typical procedure of building processes development is based on modeling the process of transfer of documents and information from one participant in the process (employee, department, ministry, department, etc.) to another. Thus, the process controls to a large extent the actions of people based on rules, regulations and instructions. However, in fact, each participant in a digital process actually manages the data: introduces, changes, supplements, and processes information. The methodology provides that the process of, for example, issuing permission does not begin with the receipt of an application, verification, registration and transfer to the contractor. All these are typical steps that will either disappear or become regular automated procedures as all public services are transferred to electronic form. The first stage is the analysis of the final product – a document on the public service. Any document contains constant text and variables: names, addresses, amounts, as well as key parameters that determine whether the requested permission is satisfied or denied. ECM system allows you to display processes of any complexity with a wide variety of connections and dependencies. Thus, having formed a pool of sub-processes for generating data, the process developer will ensure the completion of the final task and determine the exact parameters of a negative answer-refusal to the applicant’s request.
338
P. Sitnikov et al.
Fig. 2. BPMN model of a public service delivery process developed using a ECM system.
Despite the diversity and heterogeneity of the processes of forming the products of the authorities, there are blocks of the same type that are repeated in different public services, e.g. full name, address, sets of document packages, data on the availability of benefits, verification of various data, etc. In order to optimize development, the designer provides the ability to save typical stages for use in various processes. This approach allows the formation of flexible processes from templates, selecting the necessary typical details for various areas of government activities. Focusing on the reflection of the aspects of process digital transformation, like data management, should stimulate the developer to design the process in the shortest and most optimal way. The ECM system provides the developer with a self-control tool that calculates the efficiency indicators of the process. Therewith efficiency indicators or KPIs should be aimed at the consumer of the service, and not the performer of the business process becoming convenient, fast and high-quality primarily for the client. ECM system implements the SIPOC model, which is used in Six Sigma and Quality Management theory to define project boundaries, and view processes from above. This model allows describing the processes in terms of the sequence of actions, the movement of information/goods/services between the stages of the process, as well as the relationships that arise as a result of the process between different participants. The model provides tracing the business logic of the process, with a high but manageable level of
Digital Transformation of Public Service Delivery Processes
339
abstraction. Therefore this model can serve as a supplier of material for formalizing business processes in generally accepted notations, e.g. BPMN 2.0. According to the extended SIPOC model ECM system provides several interconnected tables. The first table contains general information about the process, such as process name; process owner; the result of the process; and maximum allowable process time. The second table is intended to describe the process indicators. The number of indicators is determined by the developer of the diagram based on the normative documents regulating the process, or is determined by expert advice. The table contains the following columns: serial number; indicator name; type of indicator; unit; and value. The third table is directly a modified SIPOC table, which contains: step number; sign (Start Event, Task, Gateway); provider; inputs; process; outputs; clients; executor; the next step; indicators; comments; and instrumentation. The table “Data” describes the data that is used in the process (its steps) or is created (enriched) as a result of the process or its steps. Each step contains information about the input data, which have: type (e.g. a document); title (e.g. a statement); the data contained (name, passport data, registration address, etc.); data source; channel of receipt (personal appeal, portal of public services, automated information system); and tool used for data processing (manual, automated, automatic). The output is described similarly. The fifth table contains the names, numbers and dates of regulatory and local regulatory acts that govern the process. Thus, the ECM system allows us to consider the process not just as a graphical display of the process, but as an object that accumulates heterogeneous data for objective analysis and making managerial decisions and using them in other processes. Each charting element has a properties panel, specified by the user according to SIPOC. Time indicators of the process are displayed in the form of a table located in the lower left corner of the working area, and include the minimum, maximum and average time of the process. The minimum and maximum times are calculated as the sum of similar terms specified when filling out the properties panel for each element. It is possible to translate information from the SIPOC tabular version into documents in doc, docx, xls, xlsx, pdf format, schematic images in png, jpeg formats. This allows releasing a structured set of documents, which contains textual descriptions of the process, data, indicators and a graphic image of the process.
5 Processes Verification and Evaluation Additional description of business processes according to SIPOC is useful for modeling and optimization. Within this context the following optimization criteria becomes obvious: the speed/duration of the process, the throughput of the process, the cost of the process, stability (number of errors), controllability of the process, flexibility of the process (availability of route options), the possibility of scaling the process (growth of the processed volume). The following issues are considered: long processing time, duplication, unnecessary movements, and lack of information. Considering these optimization criteria most business processes have a potential of up to 30% improvement. Analysis of processes stages in terms of efficiency makes it possible to determine an integral assessment of the developed process as a whole. Integral assessment allows
340
P. Sitnikov et al.
formation a unified development concept, establish a rating of processes, create a competitive development environment, using best practices, etc. However, some factors are difficult to interpret mathematically, but they significantly affect the process, e.g. the complexity of obtaining and processing the information, large amounts of data, multi variance of actions and relationships, etc. To take these factors into account, a complexity factor was provided in the ECM system, according to which the total sum of indicators for the process is weighed and an integral assessment is determined. The value of the weight coefficient for each process should be set collegially in accordance with the regulations on “transparent”, clear to all criteria. ECM system implements a post-control tool for the developers themselves in terms of the efficiency of the processes they have created. For this purpose, the various typical stages of the processes are assigned numerical normalized values in the format of valid “from – to” forks. Each developed process becomes a subject to an automated compliance control procedure. Its purpose is to identify deviations of the indicators of the developed stages from the established norms. In this case, the deviation for the “better” side should be analyzed for the qualification of the development as the best practice. The processes are indicated by colors with each process being assigned to the “yellow” middle zone, “green” – the best indicators and “red” – for which optimization is needed. The indicators of efficiency of business processes form the basis for a unified rating system of employees’ effectiveness analysis with hierarchical monitoring up to the level of the regional leadership. To form a unified rating system of efficiency of business processes, ECM system implements an automatic assessment of the diagrams by users for compliance with the rules of BPMN 2.0 notation. The mechanism is activated by pressing the “Check Diagram” button located on the toolbar after the block of buttons responsible for aligning the elements. This function becomes available immediately after opening. Checking can be carried out both readymade diagrams, opened in the designer, and newly created in it. The list of the issues considered includes the following: • No elements: no item is placed in the work area or unnamed item is placed in the workspace; • Any element other than the start event: adding any element other than the start event and the end event to an empty stage when there are no start and end events on the stage; • Start event: adding by the user from the toolbox, provided that there are already one or more start events connected to end events in the workspace; multiple start events in the pool/lane; initial event has no name; • Pool/Lane: adding the expanded pool/lane to the blank work area (not applicable to collapsed pool); • Task: no name; no input control flows; no output control flows; more than one input flow; more than one output flow; • Gateway: no name (except for collecting gateways); no input and output control flows; one input and one output flow; no default flow (except parallel gateway); more than one input flow;
Digital Transformation of Public Service Delivery Processes
341
• Control flow: connects items from top to bottom or connects items from bottom to top; • End event: no name; no starting events on the diagram; no control flow connection from start event; located to the left of the start event; • Data object: no name; • Storage: no name. Issue notifications appear at the moment of pressing the “Check Diagram” button or when moving an element to the workspace if this button is pressed and disappears after clicking on the problematic element. After making corrections according to the recommendations specified in the notification, the warnings are no longer displayed.
6 Implementation and Tests Solution of the stated problem requires automated calculation and analysis of the processes digital transformation efficiency indicators. On the basis of ECM system there was developed analytical software to describe the processes using BPMN notation and automatically analyze their corresponding to the basic criteria of digital transformation. The proposed approach was implemented in Samara region for description, modeling and analysis of the public service delivery processes considering the goals of digital transformation. The first group of processes for optimization includes the provision of child monthly support, licensing the retail sale of alcoholic beverages, issuance of permits for the use of land owned and processing of the citizens’ appeals by the Ministry of health. Figure 3 presents the optimized process of licensing the retail sale of alcoholic beverages as an example. This administrative regulation is provided by the regional Ministry of industry and trade state services for licensing of the retail sale of alcoholic beverages. The regulation was developed in order to improve the quality and accessibility of the provision of state services, creating favorable conditions for participants in relations arising from the implementation of licensing of the specified type of activity, establishing the order, timing and sequence of administrative procedures when providing public services, and also establishes the procedure for interaction of the Ministry with legal persons upon issuance, extension, renewal and early termination licenses. After the process of formal description and improvement of this process using ECM systems considering the possibilities of digital transformation its efficiency indicators were improved as presented in Table 2. In addition to formalization and decision-making support ECM system give the possibility of evaluation and checking of the process to correspond the goals of digital transformation. The provided approach proves the efficiency of BMPN based description of AS-IS and TO-BE processes, which is beneficial in terms of visualization and improvement.
342
P. Sitnikov et al.
Fig. 3. Licensing the retail sale of alcoholic beverages process (fragment).
Table 2. Digital transformation efficiency indicators for initial and optimized processes of licensing the retail sale of alcoholic beverages Indicator
Initial processes
Optimized processes
Service delivery term, days
30
18
5
5
Involved staff, staff units Number of interagency requests, units
6
6
Interagency request/response time, days
5
3
15
5
Maximum waiting time, min
7 Conclusion The results of ECM system implementation and probation in practice illustrate the necessity to perform digital transformation instead of formal application of modern information and communication technologies to automate the existing processes. The proposed set of processes digital transformation efficiency indicators is recommended for analysis and evaluation of the existing procedures and processes of public services delivery. Study limitations include the lack of feedback from the real sector of economy, which is considered to be achieved at the next steps of the research. The resulting
Digital Transformation of Public Service Delivery Processes
343
solution is recommended for decision-making support of regional management digital transformation.
References 1. Digital Russia. New Reality. Digital McKinsey, 133p. https://www.mckinsey.com/ru/ourwork/mckinsey-digital (2017) 2. Patel, K., McCarthy, M.P.: Digital transformation: the essentials of e-business leadership, 134p. KPMG/McGraw-Hill (2000) 3. Auzan, A.A., et al.: Sociocultural factors in economics: milestones and perspectives. Vopr. Ekon. 7, 75–91 (2020) 4. Bourgeois, D.T.: Information systems for business and beyond. The Saylor Academy, p. 163 (2014) 5. Romero, D., Vernadat F.: Enterprise information systems state of the art: past, present and future trends, Comput. Ind. vol. 79. https://doi.org/10.1016/j.compind.2016.03.001 (2016) 6. Cordella, A., Iannacci, F.: Information systems in the public sector: the e-Government enactment framework. J. Strat. Inf. Syst. 19, 52–66 (2010) 7. Rose, W.R., Grant, G.G.: Critical issues pertaining to the planning and implementation of e-government initiatives. Gov. Inf. Q. 27, 26–33 (2010) 8. Weerakkody, V., El-Haddadeh, R., Sabol, T., Ghoneim, A., Dzupka, P.: E-government implementation strategies in developed and transition economies: a comparative study. Int. J. Inf. Manage. 32, 66-74 (2012) 9. Pantelic, S., Dimitrijevic, S., Kosti´c, P., Radovi´c, S., Babovi´c, M: Using BPMN for modeling business processes in e-government – case study. In: The 1st International Conference on Information Society, Technology and Management, ICIST 2011 (2011) 10. Recker, J.: BPMN research: what we know and what we don’t know. In: Mendling, J., Weidlich, M. (eds.) BPMN 2012. LNBIP, vol. 125, pp. 1–7. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-33155-8_1 11. Palkovits, S., Wimmer, A.M.: Processes in e-government – a holistic framework for modeling electronic public services. Lect. Notes Comput. Sci. 2739, 213–219 (2003) 12. Scholl, H.J.: E-government: a special case of ICT-enabled business process change. In: Proceedings of the 36th Annual Hawaii International Conference on System Sciences, Big Island, Hawaii, 12p. (2003) 13. Fleischmann, A., Kannengiesser, U., Schmidt, W., Stary, C.: Subject-oriented modeling and execution of multi-agent business processes. In: Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2013 IEEE/WIC/ACM International Joint Conferences, pp. 138–145 (2013) 14. Fleischmann, A., Schmidt, W., Stary, C. (eds.): S-BPM in the wild. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17542-3 15. Gorodetskii, V.I.: Self-organization and multiagent systems: I. Models of multiagent selforganization. J. Comput. Syst. Sci. Int. 51(2), 256–281 (2012) 16. Kampffmeyer, U.: ECM enterprise content management, Verlag/Herausgeber Project Consult, 84 p. (2006) 17. Surnin, O.L., Sitnikov, P.V., Ivaschenko, A.V., Ilyasova, N.Yu., Popov, S.B.: Big Data incorporation based on open services provider for distributed enterprises. In: CEUR Workshop Proceedings, vol. 1903, pp. 42–47 (2017) 18. Ivaschenko, A., Lednev, A., Diyazitdinova, A., Sitnikov, P.: Agent-based outsourcing solution for agency service management. In: Bi, Y., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2016. LNNS, vol. 16, pp. 204–215. Springer, Cham (2018). https://doi.org/10.1007/978-3-31956991-8_16
Prediction of Homicides in Urban Centers: A Machine Learning Approach José Ribeiro1,2(B) , Lair Meneses2 , Denis Costa2 , Wando Miranda3 , and Ronnie Alves1,4 1 Federal University of Pará, Belém, PA, Brazil
[email protected]
2 Federal Institute of Pará, Ananindeua, PA, Brazil
{lair.meneses,denis.costa}@ifpa.edu.br
3 Secretariat of Public Security and Social Defense, Belém, PA, Brazil 4 Vale Institute of Technology, Belém, PA, Brazil
[email protected]
Abstract. Relevant research has been highlighted in the computing community to develop machine learning models capable of predicting the occurrence of crimes, analyzing contexts of crimes, extracting profiles of individuals linked to crime, and analyzing crimes over time. However, models capable of predicting specific crimes, such as homicide, are not commonly found in the current literature. This research presents a machine learning model to predict homicide crimes, using a dataset that uses generic data (without study location dependencies) based on incident report records for 34 different types of crimes, along with time and space data from crime reports. Experimentally, data from the city of Belém - Pará, Brazil was used. These data were transformed to make the problem generic, enabling the replication of this model to other locations. In the research, analyses were performed with simple and robust algorithms on the created dataset. With this, statistical tests were performed with 11 different classification methods and the results are related to the prediction’s occurrence and non-occurrence of homicide crimes in the month subsequent to the occurrence of other registered crimes, with 76% assertiveness for both classes of the problem, using Random Forest. Results are considered as a baseline for the proposed problem. Keywords: Prediction · Homicide · Crime · Tabular · Classification
1 Introduction New Technologies in smart cities are increasingly part of our daily lives. Several technologies are helping cities to become green, clean, organized, and safe. It is important to note that intelligent machine learning models provide part of these technological advances present in our daily lives [1]. With the large accumulation of data by institutions in the area of public security (criminological data), researchers have been able to create models based on machine learning that perform crime prediction [2–5]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 296, pp. 344–361, 2022. https://doi.org/10.1007/978-3-030-82199-9_22
Prediction of Homicides in Urban Centers
345
Computational problems in the area of criminology, such as identifying a criminal profile, exploring the context of crime, and predicting crimes, have shown interesting challenges for the computing area. These challenges have enabled the development of research focused on these themes, through different forms and perspectives [6–10]. Criminology data is directly linked to time and space characteristics of different regions where crimes were recorded [11], which makes it inappropriate for the same machine learning model created and trained for a specific region to be used in another unknown region. Seeking to make the creation of different data exploration surveys possible, cities located in different important cities and countries around the world have made available their data related to the occurrence of crimes, such as: Ontario – Canada [12], Toronto – Canada [13], England and Wales [14], San Francisco - United States [15], and Boston - United States [16]. In the creation of a machine learning model, for regression or classification, all stages of its development are directly linked to the type and nature of data’s context and the proposed problem [11]. Thus, when working with criminology data, for example in a homicide prediction model, its characteristics and development stages are unique and adapted to the specificities of time and space, linked to the context of a region. However, part of the model development strategy may be adapted and used for other regions in the same country or even in the world. For example, we can cite [2] and [5], those present similar strategies for crime type prediction, but databases from different locations. The dynamics of how different crimes occur in a city can be explained by different theories in the area of Social Sciences, such as those related to a single person, called “Theory of Understanding the Motivations of Individual Behavior”, and also from the “Theory of Associated Epidemiology”, related to the study of how criminal behavior is distributed and displaced in time and space of a locality [17, 18]. The ways we seek to understand the relationship between the different crimes dealt with in this research are inspired by this second theory. Research shows the existence of an interrelation between the number of occurrences of different types of crimes, through a comparison of historical data of specific single crimes and also through groupings of crimes, such as crime against the person (consummated murder, attempted homicide, consummated rape, attempted rape, kidnapping, etc.) and crimes against property (theft, armed robbery, theft followed by death, etc.) [19–21]. That is, with data referring to the number of different crimes, it becomes possible to carry out a prediction process for a crime of interest. Data related to the general context of a crime, which may explain different aspects that motivated the occurrence of it, can assume high dimension because, in addition to time and space [11], data from social networks [22], specific personal conditions of individuals can be used [23], as well as specific contexts of a problem to be solved [24] and even climatic data from the environment [2]. However, we focus on working with time and space data of the occurrence of crimes in to predict homicides. This research presents a way, different from that existing in the literature, regarding the transformation of criminology data that allows the creation of a dataset aimed at generically predicting homicide crimes, without dependencies on specific attributes of the city of study. To standardize the proposal of attributes used by the model (input) presented by here, an experimental study was carried out with data from occurrence
346
J. Ribeiro et al.
bulletins in the city of Belém do Pará, from the years 2016 to 2019, which can be replicated to other locations. From the elaboration of the proposed model, analysis, and discussions, the main contributions of this research are: • For the machine learning area related to the problem of prediction of specific crimes. In this aspect, the data transformations performed for the creation of the proposed tabular database are different from those found in the current literature, since they confront the numbers of different types of crimes (independent variables) with the prediction of the occurrence or non-occurrence of homicide crimes in the near future. Considering that this dataset was developed to minimize the dependence on characteristic or specific attributes of the study region so that it can be more easily replicated to other regions, since there is a description of the generic attributes used (inputs) of the models in this work; • For the community that works with criminology data, as this study offers a baseline of performance of machine learning models of different complexities related to the problem of predicting homicide crimes, which can be very well adapted and replicated in other cities, that have criminology data similar to those used by this research.
2 Related Work This research did not find works related to the prediction of homicide crimes using data similar to those worked here. However, some researches approach ways of predicting the types of crimes (in general, not just a specific one), something similar but still different from what is presented in this study. In this sense, machine learning surveys using crime data are important both for the computing community and for society in general. With this, research such as: • “Crime Type Prediction”. Description: Research objective is to identify the spatial, temporal, weather, and event features most often associated with specific crime types, using machine learning models based on different algorithms: Logistic Regression, Random Forest, Decision tree, and XGBoost. The data used were Chicago Crime Data, Historical Hourly Weather Data 2012–2017, all referring to the United States [2]; • “San Francisco Crime Classification”. Description: Proposal of classificatory machine learning models capable of predicting the type of crime that may occur in the city of San Francisco in the United States, from the time and place of the crime in question. The tested algorithms are: Logistic Regression, Naive Bayes, and Random Forest [25]; • “A Prediction Model for Criminal Levels Specialized in Brazilian Cities”. Description: This paper proposes a model of data mining, predicting criminal levels in urban geographic areas. The model was proposed to work using Brazilian data (from the city of Fortaleza – CE) between 2007 and 2008, specifically criminal and socioeconomic ones. Several algorithms were tested, but the best results were collected with neural networks [3];
Prediction of Homicides in Urban Centers
347
• “Addressing Crime Situation Forecasting Task with Temporal Graph Convolutional Neural Network Approach”. Description: Article on a proposed machine learning model based on Graph Convolutional Neural Network and Recurrent Neural Network to capture the spatial and temporal dynamics of crime statistics recorded in the city of San Francisco - California, USA, from 2003 to 2018 [10]; • “Crime Data Analysis and Prediction of Perpetrator Identity Using Machine Learning Approach”. Description: A complete article on analyzing and predicting the profile of perpetrators of crimes using machine learning. In this study, homicide data from the city of San Francisco (United States) from 1981 to 2014 were used [26]; • “Crime Pattern Detection, Analysis & Prediction”. Description: A complete article on analyzing crime data by detecting patterns and predictions. The data used in this research refer to six different categories of crimes registered in the interval of 14 years (2001–2014), referring to the city of Thane (India) [9]; • “Predictive Policing Software - PredPol”. Description: Software for police use, focused on the monitoring and analysis of several variables in a micro-region, enabling the prediction of the probability of occurrence of specific crimes with a location and time suggested by the tool. This tool is a success story of the application of intelligent algorithms to criminology data, as it is currently used by security institutions in countries such as the United States of America. Details about the data analyzed by this tool to make predictions are omitted by the developers [27]; From reading the works cited above, there are different approaches applied to problems in the area of criminology, highlighting methodological strategies based on algorithms for prediction, time series analysis, use of spatial data as input of the model, non-use personal data of individuals, and use of data from police reports from different cities around the world. It is important to highlight that in all studies cited machine learning models, directly or indirectly, use information from crimes related to time and space, as well as the research described here. Another important observation is that in the studies [2, 3, 9, 10, 25, 26] the dataset created and used by machine learning models have attributes of a specific nature (dependency) of the city of study to which the research is linked, making the models less replicable to other locations. In this way, this research proposes a machine learning model, based on data time, space, and different amounts of crimes registered in the city of study, without presenting significant dependencies on specific variables of the locality in question.
3 Crime Understanding Explaining how crimes arise and are related to each other in an urban center is not an easy task, as there is a large set of data that can be explored in to seek a coherent explanation of what leads individuals to commit crimes [11]. Researches highlight the existence of correlations between the high occurrences of crimes and socioeconomic variables in some regions, especially when the group evaluated in question are people in social vulnerability [28, 29]. This being one of the
348
J. Ribeiro et al.
ways to explain how the process of disseminating crimes occurs in urbanized cities. However, it has not yet been used in the current maturity of this work. The existence of older studies and theories in the area of Social Sciences, which explain social behavior linked to crimes, called Associated Epidemiology - AE, which studies aspects of how criminal behavior is distributed in space and displaces over time [17]. More recent research [18] explained how one of the aspects referring to AE can be described through the Theory of Social Disorganization, which is defined as a systemic approach, in which the main focus is the local communities, understanding these as a system of networks of formal and informal associations regarding friendships, relatives, jobs, cultures, economies and even crimes that influence the human living in society [30]. The City of Belém, capital of the state of Pará in Brazil, the scenario of this study, is a city with 1,393,399 inhabitants, according to its last survey carried out in 2010, and has a human development index of 0.746 [31]. This city has institutions in the security area that carry out various actions aimed at tackling crime. Even so, it still presented the third worst homicide rate among all Brazilian captains (74.3) in a study carried out in 2017 [32]. In this way, the data from the records of occurrence bulletins in this city prove to be interesting sources of information for the study carried out here. Inspired by the above, this work applied a series of cleaning, transformation, and pre-processing steps to the database of police reports, allowing the generalization (by an algorithm) of different associations between different crimes registered in the city of study, facing the challenge of predicting homicide crimes in a predefined period of time.
4 Machine Learning Approach 4.1 The Data The data used in this research were provided by the Assistant Secretariat of Intelligence and Criminal Analysis - SIAC of the state of Pará, Brazil. Such data refer to the police reports registered during the years 2016 to 2018 in the city of Belém - Pará. The raw data can be characterized as transactional tabular data containing information such as crime’s id, occurrence date, registration date, time, time, type, description, cite, neighborhood, unit from the register, and others 31 administrative context attributes. In terms of size, the database has 41 attributes per 507,065 instances. Where each instance represents a police report registered in the city. Only 4 of the 41 attributes mentioned above were used to create the new database, the main reason being related to the lack of filling in of the other data, as well as the non-direct relationship with the crime context. The 4 attributes in question are crime’s occurrence date, type, municipality, and neighborhood. The database has records related to more than 500 different types of crimes. Highlighting the crimes of theft, damage in traffic, threat, other atypical facts, bodily injury, embezzlement, and injury, as the eight most common crimes in the base, nomenclatures defined by the Civil and Criminal codes of Brazil [33, 34]. Pre-processing and specific data transformations made it possible to analyze how different crimes are dynamically related to homicides in the city of study. Because the
Prediction of Homicides in Urban Centers
349
occurrence of specific crimes is related to a context of the conflict between individuals and one crime may influence homicides [17, 24, 30]. 4.2 Pre-process The pre-processing procedures applied in the database of the proposed model are presented in the next paragraphs. Attributes exclusion: 9 Sparse attributes (with unregistered data) and id; 21 Attributes not directly related to the crime context; 2 Attributes related to personal data of registered individuals; 2 Attributes related to the location (street) of the crime that occurred due to inconsistency; And 2 attributes of crime’s georeferencing (latitude and longitude) due to inconsistency. Exclusion of records related to occurrences in neighborhoods located on small islands or in rural areas due to high inconsistency; Police reports instances considered noncrimes; Duplicated records (since a crime can be registered in more than one police report, in this case by different people); We also make the removal of special characters (such as: @#$%ˆ&*áíóúç?!ºª•§∞¢£™¡) and the consolidation of neighborhoods in the city of study. In this preprocessing and cleaning, there was a decrease from 507,065 instances to 453,932; attributing such data loss to higher quality and consolidation. Then, we do the transformation of the database from tabular transactional to tabular based on time and space with minimum granularity equal to a month. At this stage, only neighborhoods that had records of crimes in all months of the years analyzed were considered, aiming to minimize loss of information in specific neighborhoods. Among the preprocessing procedures listed above, more characteristics of the transformation of the tabular transactional database to a tabular based in time and space database were inspired by Online Analytical Processing – OLAP [35]. This transformation was necessary because the objective model needed to perform the prediction of crimes according to time in months. In this way, the tabular transactional database was transformed into a new tabular database, considering the year of the crime, the month of the crime, and the neighborhood of the crime concerning the numbers of each of the registered crimes, Table 1. In Table 1, the year and neighborhood columns appear to facilitate the understanding of the tabular dataset’s construction. However, as presented in the pre-processing, these attributes are not used as model inputs. In other words, only attributes related to time (month) together with the various amounts of other crimes are used as input for the machine learning models developed. This research emphasizes that it carried out preliminary analyses of the data in order to verify the possibility of working with data granularity equal to day and week, yet no strong correlations were identified between the model’s input attributes and the prediction class (homicide in the next day or homicide the next week) in both cases. To perform data fairness, it was decided to leave the neighborhood attribute outside the algorithm entries, as it was identified that this attribute was able to map the objective class with an accuracy close to 80% for some specific neighborhoods.
350
J. Ribeiro et al. Table 1. Illustration of tabular dataset construction.
Year*
Month
Neighborhood*
Threat count
Theft count
Homicide count
…
Class
2016
1
1
3
5
5
…
1
2016
1
2
5
7
0
…
0
2016
1
3
4
20
4
…
1
…
…
…
…
…
…
…
…
2016
2
1
5
40
5
…
1
2017
2
2
1
39
4
…
1
Note in Table 1, the Class attribute (with values between 0 and 1) defines whether or not homicide occurred in the month following the date of crime instance, taking into account the year, month, and neighborhood of the instance.
In summary, the attributes year and month were used to organize the data, as well as the spatial attribute neighborhood. However, both year and neighborhood (marked with a ‘*’ in Table 1) are considered meta-attributes by this research and only participate in data modeling (not being passed on as input to the algorithms). The remaining data, except for the class, refer to the count of specific crimes (in years, months, and specific neighborhoods). To better explain the treatment, follows the example: in Table 1, line 1 shows a record of the year 2016, month 1 (one), neighborhood 1 (one) and threat count 3 (three), theft count 5 (five), homicide count 5 (five) and Class 1 (one). The Class had a value 1 (one) because there was a homicide in the following month (month 2) of this record in the same neighborhood 1 (one), as can be seen in the penultimate line of same figure, specifically in the column homicide count 5 (five). If this last cited column had a value of 0 (zero), the class in question would be 0 (zero). Thus, the class only presented a value equal to 1 (one) because the homicide count was greater than 0 (zero). The new tabular database class was processed to become binary, now showing values of 0 (zero) or 1 (one). Being 0 (zero) the absence of homicide and 1 (one) the existence of homicide, taking into account a specific year, month, and neighborhood. The balance between classes in the dataset was: 970 instances for class 0 (non-homicide) and 1,034 for class 1 (homicide). Nonetheless, the new database suffered a significant dimensionality reduction, with 2,004 instances, 36 attributes (34 quantitative of different crimes, 1 ordinal attribute referring to the month), and 1 binary class (0 and 1). All 34 attributes of crime numbers went through min-max normalization, obtaining values between 0 and 1 [36]. The month attribute has been converted to an entire numeric ranging from 1 to 12 (equivalent to each year’s months). This research did not carry out any process of reducing the dimensionality by automatic methods such as feature selection, considering it unnecessary at this moment, since it is desired to obtain the maximum information from a crime context. The 34 attributes of crimes used in these studies are: bodily injury, threat, assault, injury, theft, traffic injury, traffic damage, defamation, homicide, abandonment of the
Prediction of Homicides in Urban Centers
351
home, vicinal conflicts, marital conflicts, escape from home, rape vulnerable, other atypical facts, vehicle theft, embezzlement, damage, civil damage, slander, family conflicts, drug trafficking, aggression/fight, misappropriation, physical aggression, reception, rape, the disappearance of people, attempted murder, pollution sound, other frauds, disobedience, contempt, and disturbances of tranquility. As shown, the data used in this research are the most reliable as possible, since they were provided by the public security institution of the study city. However, the presence (even if minimal) of noise in the data must be admitted, with this it is emphasized that the models developed in this research are tolerant of data errors. As seen, the data used as inputs for machine learning models are completely generic, since they are made up of the month variable (1 variable) along with the different numbers of crimes that occurred in neighborhoods in a specific month (34 variables), which makes this methodology of using criminology data for the prediction of homicides easily replicable to other cities that have the same data. More details about the dataset and analysis of this study Git: >>> https://github.com/josesousaribeiro/Pred2Town. 4.3 Analyzed Algorithms After the database was cleaned, consolidated, pre-processed, transformed, and with dimensionality reduction previously presented, this research carried out a series of tests with algorithms of different potentials: lazy learners (represented by K-Nearest Neighbors – KNN [37]), eager learners (represented by Support Vector Machine – SVM [38], Decision Tree – DT [39], Neural Network – NN [40], Naive Bayes – NB [41], Logistic Regression – LR [42]) and ensemble learners (represented by Gradient Boosting – GB [43], Random Forest – RF [44], Extreme Gradient Boosting – XGB [45], Light Gradient Boosting - LGBM [46], and CatBoosting – CB [47]), all implemented in python. As seen above, the idea is to carry out the process of creating machine learning models not only using robust algorithms, such as the cases of ensemble learners, but also simpler and faster learning algorithms, as is the case with lazy and eager learners, seeking to evaluate which of the algorithms can better exploit the database. An important observation is that only Decision Tree, K-Nearest Neighbors, Naive Bayes, and Logistic regression algorithms are considered transparent algorithms (with high explicability), a characteristic desired in the context of the proposed homicide prediction problem. The other algorithms are considered black boxes, so they end up not being self-explanatory. The database was divided into training (70%) and test (30%) using stratification by the objective class, being careful so that records from all neighborhoods and all years existed (in similar proportions) in both training data and test data. The tuning process, Table 2, was performed using the Grid Search [48] based on cross-validation with folds size equal to 7, using the variation of common parameters among the tested algorithms, as well as exclusive parameters of each algorithm, to promote equality of variation of parameters (as far as possible), without minimizing differentials specific to each algorithm. This was performed using only the training data to carry out this process, aiming at a fairer comparison between models. Table 2 presents all the best parameters found from the execution of the grid search process based on cross-validation with folds size equal to 7, and the metric used to
352
J. Ribeiro et al.
measure the performance of each fold execution was the Area Under ROC – AUC [49]. Table 2. Tuning process description. Model Range of parameters
Best parameters found
CB
Learning rate: [0.1, 0.5]; Depth: [1, 6, 12]; Learning rate: 0.1; Iterations: 200; Grow Iterations: [10, 100, 200]; Grow policy: policy: ‘SymmetricTree’; Bagging [‘SymmetricTree’, ‘Depthwise’, temperature: 0 ‘Lossguide’]; Bagging temperature: [0, 0.5, 1]
DT
Min samples leaf: [1, 10, 20, 40]; Max depth: [1, 6, 12]; Criterion: [‘gini’,‘entropy’]; Splitter: [‘best’, ‘random’]; Min samples split: [2, 5, 15, 20, 30]
Min samples leaf: 40; Criterion: ‘entropy’; Splitter: ‘random’; Min samples split: 2
GB
Max d.: [1, 6, 12]; N. estimators: [10, 100, 200]; Min samples leaf: [1, 10, 20, 40]; Learning r.: [0.1, 0.5]; Loss: [‘deviance’, ‘exponential’]; Criterion: [‘friedman_mse’, ‘mse’, ‘mae’]; Max f.: [ ‘sqrt’, ‘log2’]
Max depth: 12; N. estimators: 100; Min samples leaf: 40; Learning rate: 0.1; Loss: exponential; Criterion: ‘mae’; Max features: ‘sqrt’
LGBM Learning r.: [0.1, 0.5]; Max d.: [1, 6, 12]; Bootstrap: [True, False]; N. estimators: [10, 100, 200]; Min data in leaf: [1, 10, 20, 40]; Boosting t.: [‘gbdt’,‘dart’,‘goss’,‘rf’]; Num. l.: [31,100,200]
Learning rate: 0.1; Max depth: 1; Bootstrap: True; N. estimators: 100; Min data in leaf: 40; Boosting type: ‘goss’
KNN
Leaf size: [1, 10, 20, 40]; Algorithm:[‘ball_tree’, ‘kd_tree’, ‘brute’]; Metric: [‘str’, ‘callable’,‘minkowski’]; N._neighbors: [2, 4, 6, 8, 10, 12, 14, 16]
Leaf size: 1; Algorithm: ‘ball_tree’; Metric: ‘minkowski’; N._neighbors: 8
LR
Solver: [‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’]; Penalty: [‘l1’, ‘l2’]; C:[0.001,0.008,0.05,0.09,0.1]; Max iter.: [50, 200, 400, 500,600]
Solver: ‘sag’; Penalty: ‘l2’; C: 0.1; Max iter.: 50
NB
Var. smoothing: [1e-5, 1e-7, 1e-9, 1e-10,1e-12]
Var. smoothing 1e-5:
NN
Learning r.: [‘constant’, ‘invscaling’, Learning rate: ‘invscaling’; Solver: ‘adaptive’]; Solver: [‘lbfgs’, ‘sgd’, ‘adam’; Activation: ‘tanh’; Max iter.: ‘adam’]; Activation: [‘identity’, ‘logistic’, 300; hidden layer sizes: 3 ‘tanh’, ‘relu’]; Max iter.: [200, 300, 400]; Alpha: [0.0001, 0.0003]; hidden layer sizes: [1, 2, 3, 4, 5] (continued)
Prediction of Homicides in Urban Centers
353
Table 2. (continued) Model Range of parameters
Best parameters found
RF
Max depth: [1, 6, 12]; Bootstrap: [True, False]; N. estimators: [10, 100, 200]; Min samples leaf: [1, 10, 20, 40]; CCP alpha: [0.0, 0.4]; Criterion: [‘gini’, ‘entropy’]; Max features: [‘sqrt’, ‘log2’]
Max depth: 12; Bootstrap: True; N. estimators: 100; Min samples leaf: 1; CCP alpha: 0.0; Criterion: ‘gini’; Max features: ‘log2’
SVM
C: [0.001, 0.01, 0.1, 1, 10]; Kernel: [‘linear’, ‘poly’, ‘sigmoid’]; Shrinking: [True, False]; Degree: [1, 2, 3, 4, 5]
C: 10; Kernel: ‘poly’; Shrinking: True; Degree: 1
XGB
Max d.: [1, 6, 12]; N. estimators: [10, 100, 200]; Min s. le.: [1, 10, 20, 40]; Booster: [‘gbtree’, ‘gblinear’, ‘dart’]; Sampling m.: [‘uniform’, ‘gradient_based’]; Tree m.: [‘exact’,‘approx’,‘hist’]
Max depth: 1; N. estimators: 200; Min samples leaf: 1; Booster: ‘gbtree’; Sampling method: ‘uniform’; Tree method: ‘approx’
It was decided to use cross-validation at this stage of creation to identify the most stable machine learning models in the face of data as input. We chose to use the AUC evaluation metric because it takes into account the successes and errors identified in both classes (1 and 0) of the problem in question. Thus, the AUC measures both successes and errors of homicides and non-homicides that occurred, a characteristic that is desirable given the nature of the problem — Since predicting a homicide is just as important as predicting a non-homicide.
5 Discussion The tests performed with the 11 algorithms were divided into two moments: A) Performance Analysis: Comparison of performances based on Accuracy - ACC and Confusion Matrices; B) Statistical Analysis: based on the Friedman test and score AUC. 5.1 Performance Analysis Note, accuracy was chosen to measure the correctness, in this stage of the tests, since it considers both true positives and true negatives in the metric calculation. That is, it considers the two classes of the problem to indicate the best performance. Table 3 shows the results of the base tests with each of the listed algorithms (ordered by ACC). As it can be noted that the algorithms had results with accuracy values fluctuating between 0.69 and 0.76. The best model (RF), obtained accuracy equal to 0.76, accompanied by the LGBM algorithm with 0.75 and XGB with 0.75, this is a minimal difference and although it exists, it is not considered relevant. Considering the values presented in Table 3, there is difficulty in identifying which model is better than the other, since the accuracy values are very close to each other. For even checking, the confusion matrix of all algorithms is presented in Table 4.
354
J. Ribeiro et al. Table 3. Accuracy by model. Model ACC Model ACC Model ACC RF
0.76 LR
0.74 DT
0.71
LGBM 0.75 SVM
0.74 NB
0.69
XGB
0.75 CB
0.74 KNN
0.69
NN
0.74 GB
0.72 –
–
In Table 4, it can be noted that the three highest accuracies were found in algorithms that balanced both class 0 and class 1 (RF, LGBM, and XGB algorithms). The NN, LR, SVM, and DT algorithms showed considerable accuracy, but they tend to classify class 0 better than class 1. What is not desirable for research, since it is understood as important for both classes of the problem, Table 4. Table 4. Comparison of confusion matrices. RF
LGBM
XGB
NN
LR
SVM
0
1
0
1
0
1
0
1
0
1
0
1
0
76%
24%
76%
24%
75%
25%
80%
20%
82%
18%
85%
15%
1
24%
76%
25%
75%
24%
76%
31%
69%
33%
67%
36%
64%
CB
GB
DT
NB
KNN
0
1
0
1
0
1
0
1
0
1
0
73%
27%
74%
26%
78%
22%
84%
16%
89%
11%
1
26%
74%
30%
70%
34%
66%
46%
54%
50%
50%
Note: The test class has 602 instances (311 of class 1 and 291 of class 0);
The results found by CB and GB, present relevant accuracy, but with slightly lower values than those presented by RF, LGBM, and XGB. Still, concerning Table 4, NB and KNN were the algorithms that showed the greatest successes in class 0, but also the greatest errors in class 1. Based on the analyses presented in Tables 3 and 4, the RF, LGBM, and XGB algorithms are considered the best classifiers for the homicide prediction problem. However, to identify the significance of the differences between the models analyzed, statistical analyses are performed below. 5.2 Statistical Analysis A statistical analysis of the 11 machine learning models was carried out to identify two main aspects of the created models: stability and statistical significance.
Prediction of Homicides in Urban Centers
355
To assess the stability of the models, cross-validation runs were performed with fold = 7 for each model. After this execution, each model analyzes its outputs analyzed by the AUC metric, and, finally, a graph of Kernel Density Estimate (Gaussian with bandwidth 0.6) of executions was created in order to identify the AUC value for each of the 7 folds tested. This analysis is shown in Fig. 1.
Fig. 1. Density of AUC scores for execution of cross-validation with fold’s size equal to 7 for RF, LGBM, and XGB.
Analyzing Fig. 1, it can be noted that the three algorithms present similar stability since for each of the 7 executions a similar performance variation of the algorithm between the models was identified, with values between 0.775 and 0.875 of AUC. Still, for Fig. 1, it can be verified that the RF algorithm showed greater stability when compared to LGBM and XGB since RF presented slightly higher density (higher values on the y-axis) and concentrated (interval of the x-axis) than the LGBM and XGB models. To analyze the significance of the different classificatory results found by each algorithm, in the cross-validation runs with size 7 folds, mentioned above, statistical analyses were performed based on the Friedman test to compare each tested algorithm pair. In the Friedman test, only p-value values