Intelligent Systems and Applications: Proceedings of the 2020 Intelligent Systems Conference (IntelliSys) Volume 2 [1st ed.] 9783030551865, 9783030551872

The book Intelligent Systems and Applications - Proceedings of the 2020 Intelligent Systems Conference is a remarkable c

361 85 89MB

English Pages XI, 783 [794] Year 2021

Report DMCA / Copyright


Table of contents :
Front Matter ....Pages i-xi
CapsNet vs CNN: Analysis of the Effects of Varying Feature Spatial Arrangement (Ugenteraan Manogaran, Ya Ping Wong, Boon Yian Ng)....Pages 1-9
Improved 2D Human Pose Tracking Using Optical Flow Analysis (Aleksander Khelvas, Alexander Gilya-Zetinov, Egor Konyagin, Darya Demyanova, Pavel Sorokin, Roman Khafizov)....Pages 10-22
Transferability of Fast Gradient Sign Method (Tamás Muncsan, Attila Kiss)....Pages 23-34
Design of an Automatic System to Determine the Degree of Progression of Diabetic Retinopathy (Hernando González, Carlos Arizmendi, Jessica Aza)....Pages 35-44
Adaptive Attention Mechanism Based Semantic Compositional Network for Video Captioning (Zhaoyu Dong, Xian Zhong, Shuqin Chen, Wenxuan Liu, Qi Cui, Luo Zhong)....Pages 45-55
Estimated Influence of Online Management Tools on Team Management Based on the Research with the Use of the System of Organizational Terms (Olaf Flak)....Pages 56-72
Liveness Detection via Facial Expressions Queue (Bat-Erdene Batsukh)....Pages 73-76
Java Based Application Development for Facial Identification Using OpenCV Library (Askar Boranbayev, Seilkhan Boranbayev, Askar Nurbekov)....Pages 77-85
Challenges in Face Recognition Using Machine Learning Algorithms: Case of Makeup and Occlusions (Natalya Selitskaya, Stanislaw Sielicki, Nikolaos Christou)....Pages 86-102
The Effects of Social Issues and Human Factors on the Reliability of Biometric Systems: A Review (Mohammadreza Azimi, Andrzej Pacut)....Pages 103-110
Towards Semantic Segmentation Using Ratio Unpooling (Duncan Boland, Hossein Malekmohamadi)....Pages 111-123
Adaptive Retraining of Visual Recognition-Model in Human Activity Recognition by Collaborative Humanoid Robots (Vineet Nagrath, Mossaab Hariz, Mounim A. El Yacoubi)....Pages 124-143
A Reasoning Based Model for Anomaly Detection in the Smart City Domain (Patrick Hammer, Tony Lofthouse, Enzo Fenoglio, Hugo Latapie, Pei Wang)....Pages 144-159
Document Similarity from Vector Space Densities (Ilia Rushkin)....Pages 160-171
Food Classification for Inflammation Recognition Through Ingredient Label Analysis: A Real NLP Case Study (Stefano Campese, Davide Pozza)....Pages 172-181
Classification Based Method for Disfluencies Detection in Spontaneous Spoken Tunisian Dialect (Emna Boughariou, Younès Bahou, Lamia Hadrich Belguith)....Pages 182-195
A Comprehensive Methodology for Evaluating Conversation-Based Interfaces to Relational Databases (C-BIRDs) (Majdi Owda, Amani Yousef Owda, Fathi Gasir)....Pages 196-208
Disease Normalization with Graph Embeddings (D. Pujary, C. Thorne, W. Aziz)....Pages 209-217
Quranic Topic Modelling Using Paragraph Vectors (Menwa Alshammeri, Eric Atwell, Mhd Ammar Alsalka)....Pages 218-230
Language Revitalization: A Benchmark for Akan-to-English Machine Translation (Kingsley Nketia Acheampong, Nathaniel Nii Oku Sackey)....Pages 231-244
A Machine Learning Platform for NLP in Big Data (Mauro Mazzei)....Pages 245-259
Recent News Recommender Using Browser’s History (Samer Sawalha, Arafat Awajan)....Pages 260-276
Building a Wikipedia N-GRAM Corpus (Jorge Ramón Fonseca Cacho, Ben Cisneros, Kazem Taghva)....Pages 277-294
Control Interface of an Automatic Continuous Speech Recognition System in Standard Arabic Language (Brahim Fares Zaidi, Malika Boudraa, Sid-Ahmed Selouani, Mohammed Sidi Yakoub, Ghania Hamdani)....Pages 295-303
Emotion Detection Throughout the Speech (Manuel Rodrigues, Dalila Durães, Ricardo Santos, Cesar Analide)....Pages 304-314
Understanding Troll Writing as a Linguistic Phenomenon (Sergei Monakhov)....Pages 315-334
Spatial Sentiment and Perception Analysis of BBC News Articles Using Twitter Posts Mining (Farah Younas, Majdi Owda)....Pages 335-346
Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity (Kazuaki Kashihara, Jana Shakarian, Chitta Baral)....Pages 347-361
Predicting University Students’ Public Transport Preferences for Sustainability Improvement (Ali Bakdur, Fumito Masui, Michal Ptaszynski)....Pages 362-376
Membrane Clustering Using the PostgreSQL Database Management System (Tamás Tarczali, Péter Lehotay-Kéry, Attila Kiss)....Pages 377-388
STAR: Spatio-Temporal Prediction of Air Quality Using a Multimodal Approach (Tien-Cuong Bui, Joonyoung Kim, Taewoo Kang, Donghyeon Lee, Junyoung Choi, Insoon Yang et al.)....Pages 389-406
Fair Allocation Based Soft Load Shedding (Sarwan Ali, Haris Mansoor, Imdadullah Khan, Naveed Arshad, Safiullah Faizullah, Muhammad Asad Khan)....Pages 407-424
VDENCLUE: An Enhanced Variant of DENCLUE Algorithm (Mariam S. Khader, Ghazi Al-Naymat)....Pages 425-436
Detailed Clustering Based on Gaussian Mixture Models (Nikita Andriyanov, Alexander Tashlinsky, Vitaly Dementiev)....Pages 437-448
Smartphone Applications Developed to Collect Mobility Data: A Review and SWOT Analysis (Cristina Pronello, Pinky Kumawat)....Pages 449-467
A Novel Approach for Heart Disease Prediction Using Genetic Algorithm and Ensemble Classification (Indu Yekkala, Sunanda Dixit)....Pages 468-489
An Improved Algorithm for Fast K-Word Proximity Search Based on Multi-component Key Indexes (Alexander B. Veretennikov)....Pages 490-510
A Feedback Integrated Web-Based Multi-Criteria Group Decision Support Model for Contractor Selection Using Fuzzy Analytic Hierarchy Process (Abimbola H. Afolayan, Bolanle A. Ojokoh, Adebayo O. Adetunmbi)....Pages 511-528
AIS Ship Trajectory Clustering Based on Convolutional Auto-encoder (Taizheng Wang, Chunyang Ye, Hui Zhou, Mingwang Ou, Bo Cheng)....Pages 529-546
An Improved Q-Learning Algorithm for Path Planning in Maze Environments (Shimin Gu, Guojun Mao)....Pages 547-557
Automatic Classification of Web News: A Systematic Mapping Study (Mauricio Pandolfi-González, Christian Quesada-López, Alexandra Martínez, Marcelo Jenkins)....Pages 558-574
Big Data Clustering Using MapReduce Framework: A Review (Mariam S. Khader, Ghazi Al-Naymat)....Pages 575-593
A Text Extraction-Based Smart Knowledge Graph Composition for Integrating Lessons Learned During the Microchip Design (Hasan Abu Rasheed, Christian Weber, Johannes Zenkert, Peter Czerner, Roland Krumm, Madjid Fathi)....Pages 594-610
Clustering Approach to Topic Modeling in Users Dialogue (E. Feldina, O. Makhnytkina)....Pages 611-617
Knowledge-Based Model for Formal Representation of Complex System Visual Models (Andrey I. Vlasov, Ludmila V. Juravleva, Vadim A. Shakhnov)....Pages 618-632
Data Mining Solutions for Direct Marketing Campaign (Torubein Fawei, Duke T. J. Ludera)....Pages 633-645
A Review of Phishing URL Detection Using Machine Learning Classifiers (Sajjad Jalil, Muhammad Usman)....Pages 646-665
Data Mining and Machine Learning Techniques for Bank Customers Segmentation: A Systematic Mapping Study (Maricel Monge, Christian Quesada-López, Alexandra Martínez, Marcelo Jenkins)....Pages 666-684
Back to the Past to Charter the Vinyl Electronic Market: A Data Mining Approach (Sara Lousão, Pedro Ramos, Sérgio Moro)....Pages 685-692
Learning a Generalized Matrix from Multi-graphs Topologies Towards Microservices Recommendations (Ilias Tsoumas, Chrysostomos Symvoulidis, Dimosthenis Kyriazis)....Pages 693-702
Big Data in Smart Infrastructure (Will Serrano)....Pages 703-732
Academic Articles Recommendation Using Concept-Based Representation (Dina Mohamed, Ayman El-Kilany, Hoda M. O. Mokhtar)....Pages 733-744
Self-organising Urban Traffic Control on Micro-level Using Reinforcement Learning and Agent-Based Modelling (Stefan Bosse)....Pages 745-764
The Adoption of Electronic Administration by Citizens: Case of Morocco (Fadwa Satry, Ez-zohra Belkadi)....Pages 765-779
Back Matter ....Pages 781-783
Recommend Papers

Intelligent Systems and Applications: Proceedings of the 2020 Intelligent Systems Conference (IntelliSys) Volume 2 [1st ed.]
 9783030551865, 9783030551872

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Advances in Intelligent Systems and Computing 1251

Kohei Arai Supriya Kapoor Rahul Bhatia Editors

Intelligent Systems and Applications Proceedings of the 2020 Intelligent Systems Conference (IntelliSys) Volume 2

Advances in Intelligent Systems and Computing Volume 1251

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at

Kohei Arai Supriya Kapoor Rahul Bhatia •


Intelligent Systems and Applications Proceedings of the 2020 Intelligent Systems Conference (IntelliSys) Volume 2


Editors Kohei Arai Saga University Saga, Japan

Supriya Kapoor The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

Rahul Bhatia The Science and Information (SAI) Organization Bradford, West Yorkshire, UK

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-3-030-55186-5 ISBN 978-3-030-55187-2 (eBook) © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Editor’s Preface

This book contains the scientific contributions included in the program of the Intelligent Systems Conference (IntelliSys) 2020, which was held during September 3–4, 2020, as a virtual conference. The Intelligent Systems Conference is a prestigious annual conference on areas of intelligent systems and artificial intelligence and their applications to the real world. This conference not only presented state-of-the-art methods and valuable experience from researchers in the related research areas, but also provided the audience with a vision of further development in the fields. We have gathered a multi-disciplinary group of contributions from both research and practice to discuss the ways how intelligent systems are today architectured, modeled, constructed, tested and applied in various domains. The aim was to further increase the body of knowledge in this specific area by providing a forum to exchange ideas and discuss results. The program committee of IntelliSys 2020 represented 25 countries, and authors submitted 545 papers from 50+ countries. This certainly attests to the widespread, international importance of the theme of the conference. Each paper was reviewed on the basis of originality, novelty and rigorousness. After the reviews, 214 were accepted for presentation, out of which 177 papers are finally being published in the proceedings. The conference would truly not function without the contributions and support received from authors, participants, keynote speakers, program committee members, session chairs, organizing committee members, steering committee members and others in their various roles. Their valuable support, suggestions, dedicated commitment and hard work have made the IntelliSys 2020 successful. We warmly thank and greatly appreciate the contributions, and we kindly invite all to continue to contribute to future IntelliSys conferences.



Editor’s Preface

It has been a great honor to serve as the General Chair for the IntelliSys 2020 and to work with the conference team. We believe this event will certainly help further disseminate new ideas and inspire more international collaborations. Kind Regards, Kohei Arai Conference Chair


CapsNet vs CNN: Analysis of the Effects of Varying Feature Spatial Arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ugenteraan Manogaran, Ya Ping Wong, and Boon Yian Ng Improved 2D Human Pose Tracking Using Optical Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aleksander Khelvas, Alexander Gilya-Zetinov, Egor Konyagin, Darya Demyanova, Pavel Sorokin, and Roman Khafizov Transferability of Fast Gradient Sign Method . . . . . . . . . . . . . . . . . . . . Tamás Muncsan and Attila Kiss Design of an Automatic System to Determine the Degree of Progression of Diabetic Retinopathy . . . . . . . . . . . . . . . . . . . . . . . . . Hernando González, Carlos Arizmendi, and Jessica Aza Adaptive Attention Mechanism Based Semantic Compositional Network for Video Captioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhaoyu Dong, Xian Zhong, Shuqin Chen, Wenxuan Liu, Qi Cui, and Luo Zhong Estimated Influence of Online Management Tools on Team Management Based on the Research with the Use of the System of Organizational Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olaf Flak Liveness Detection via Facial Expressions Queue . . . . . . . . . . . . . . . . . . Bat-Erdene Batsukh Java Based Application Development for Facial Identification Using OpenCV Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Askar Boranbayev, Seilkhan Boranbayev, and Askar Nurbekov






56 73





Challenges in Face Recognition Using Machine Learning Algorithms: Case of Makeup and Occlusions . . . . . . . . . . . . . . . . . . . . . Natalya Selitskaya, Stanislaw Sielicki, and Nikolaos Christou


The Effects of Social Issues and Human Factors on the Reliability of Biometric Systems: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Mohammadreza Azimi and Andrzej Pacut Towards Semantic Segmentation Using Ratio Unpooling . . . . . . . . . . . . 111 Duncan Boland and Hossein Malekmohamadi Adaptive Retraining of Visual Recognition-Model in Human Activity Recognition by Collaborative Humanoid Robots . . . . . . . . . . . . 124 Vineet Nagrath, Mossaab Hariz, and Mounim A. El Yacoubi A Reasoning Based Model for Anomaly Detection in the Smart City Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Patrick Hammer, Tony Lofthouse, Enzo Fenoglio, Hugo Latapie, and Pei Wang Document Similarity from Vector Space Densities . . . . . . . . . . . . . . . . . 160 Ilia Rushkin Food Classification for Inflammation Recognition Through Ingredient Label Analysis: A Real NLP Case Study . . . . . . . . . . . . . . . 172 Stefano Campese and Davide Pozza Classification Based Method for Disfluencies Detection in Spontaneous Spoken Tunisian Dialect . . . . . . . . . . . . . . . . . . . . . . . . 182 Emna Boughariou, Younès Bahou, and Lamia Hadrich Belguith A Comprehensive Methodology for Evaluating Conversation-Based Interfaces to Relational Databases (C-BIRDs) . . . . . . . . . . . . . . . . . . . . 196 Majdi Owda, Amani Yousef Owda, and Fathi Gasir Disease Normalization with Graph Embeddings . . . . . . . . . . . . . . . . . . . 209 D. Pujary, C. Thorne, and W. Aziz Quranic Topic Modelling Using Paragraph Vectors . . . . . . . . . . . . . . . . 218 Menwa Alshammeri, Eric Atwell, and Mhd Ammar Alsalka Language Revitalization: A Benchmark for Akan-to-English Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Kingsley Nketia Acheampong and Nathaniel Nii Oku Sackey A Machine Learning Platform for NLP in Big Data . . . . . . . . . . . . . . . 245 Mauro Mazzei



Recent News Recommender Using Browser’s History . . . . . . . . . . . . . . 260 Samer Sawalha and Arafat Awajan Building a Wikipedia N-GRAM Corpus . . . . . . . . . . . . . . . . . . . . . . . . . 277 Jorge Ramón Fonseca Cacho, Ben Cisneros, and Kazem Taghva Control Interface of an Automatic Continuous Speech Recognition System in Standard Arabic Language . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Brahim Fares Zaidi, Malika Boudraa, Sid-Ahmed Selouani, Mohammed Sidi Yakoub, and Ghania Hamdani Emotion Detection Throughout the Speech . . . . . . . . . . . . . . . . . . . . . . 304 Manuel Rodrigues, Dalila Durães, Ricardo Santos, and Cesar Analide Understanding Troll Writing as a Linguistic Phenomenon . . . . . . . . . . . 315 Sergei Monakhov Spatial Sentiment and Perception Analysis of BBC News Articles Using Twitter Posts Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Farah Younas and Majdi Owda Human-Machine Interaction for Improved Cybersecurity Named Entity Recognition Considering Semantic Similarity . . . . . . . . . . . . . . . 347 Kazuaki Kashihara, Jana Shakarian, and Chitta Baral Predicting University Students’ Public Transport Preferences for Sustainability Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362 Ali Bakdur, Fumito Masui, and Michal Ptaszynski Membrane Clustering Using the PostgreSQL Database Management System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 Tamás Tarczali, Péter Lehotay-Kéry, and Attila Kiss STAR: Spatio-Temporal Prediction of Air Quality Using a Multimodal Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389 Tien-Cuong Bui, Joonyoung Kim, Taewoo Kang, Donghyeon Lee, Junyoung Choi, Insoon Yang, Kyomin Jung, and Sang Kyun Cha Fair Allocation Based Soft Load Shedding . . . . . . . . . . . . . . . . . . . . . . . 407 Sarwan Ali, Haris Mansoor, Imdadullah Khan, Naveed Arshad, Safiullah Faizullah, and Muhammad Asad Khan VDENCLUE: An Enhanced Variant of DENCLUE Algorithm . . . . . . . 425 Mariam S. Khader and Ghazi Al-Naymat Detailed Clustering Based on Gaussian Mixture Models . . . . . . . . . . . . 437 Nikita Andriyanov, Alexander Tashlinsky, and Vitaly Dementiev Smartphone Applications Developed to Collect Mobility Data: A Review and SWOT Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 Cristina Pronello and Pinky Kumawat



A Novel Approach for Heart Disease Prediction Using Genetic Algorithm and Ensemble Classification . . . . . . . . . . . . . . . . . . . . . . . . . 468 Indu Yekkala and Sunanda Dixit An Improved Algorithm for Fast K-Word Proximity Search Based on Multi-component Key Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 Alexander B. Veretennikov A Feedback Integrated Web-Based Multi-Criteria Group Decision Support Model for Contractor Selection Using Fuzzy Analytic Hierarchy Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Abimbola H. Afolayan, Bolanle A. Ojokoh, and Adebayo O. Adetunmbi AIS Ship Trajectory Clustering Based on Convolutional Auto-encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 Taizheng Wang, Chunyang Ye, Hui Zhou, Mingwang Ou, and Bo Cheng An Improved Q-Learning Algorithm for Path Planning in Maze Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 Shimin Gu and Guojun Mao Automatic Classification of Web News: A Systematic Mapping Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558 Mauricio Pandolfi-González, Christian Quesada-López, Alexandra Martínez, and Marcelo Jenkins Big Data Clustering Using MapReduce Framework: A Review . . . . . . . 575 Mariam S. Khader and Ghazi Al-Naymat A Text Extraction-Based Smart Knowledge Graph Composition for Integrating Lessons Learned During the Microchip Design . . . . . . . 594 Hasan Abu Rasheed, Christian Weber, Johannes Zenkert, Peter Czerner, Roland Krumm, and Madjid Fathi Clustering Approach to Topic Modeling in Users Dialogue . . . . . . . . . . 611 E. Feldina and O. Makhnytkina Knowledge-Based Model for Formal Representation of Complex System Visual Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 618 Andrey I. Vlasov, Ludmila V. Juravleva, and Vadim A. Shakhnov Data Mining Solutions for Direct Marketing Campaign . . . . . . . . . . . . . 633 Torubein Fawei and Duke T. J. Ludera A Review of Phishing URL Detection Using Machine Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 Sajjad Jalil and Muhammad Usman



Data Mining and Machine Learning Techniques for Bank Customers Segmentation: A Systematic Mapping Study . . . . . . . . . . . . 666 Maricel Monge, Christian Quesada-López, Alexandra Martínez, and Marcelo Jenkins Back to the Past to Charter the Vinyl Electronic Market: A Data Mining Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 Sara Lousão, Pedro Ramos, and Sérgio Moro Learning a Generalized Matrix from Multi-graphs Topologies Towards Microservices Recommendations . . . . . . . . . . . . . . . . . . . . . . . 693 Ilias Tsoumas, Chrysostomos Symvoulidis, and Dimosthenis Kyriazis Big Data in Smart Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703 Will Serrano Academic Articles Recommendation Using Concept-Based Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733 Dina Mohamed, Ayman El-Kilany, and Hoda M. O. Mokhtar Self-organising Urban Traffic Control on Micro-level Using Reinforcement Learning and Agent-Based Modelling . . . . . . . . . . . . . . 745 Stefan Bosse The Adoption of Electronic Administration by Citizens: Case of Morocco . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765 Fadwa Satry and Ez-zohra Belkadi Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 781

CapsNet vs CNN: Analysis of the Effects of Varying Feature Spatial Arrangement Ugenteraan Manogaran(B) , Ya Ping Wong, and Boon Yian Ng Faculty of Computing and Informatics, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia [email protected], [email protected]

Abstract. Despite the success over the recent years, convolutional neural network (CNN) has a major limitation of the inability to retain spatial relationship between learned features in deeper layers. Capsule network with dynamic routing (CapsNet) was introduced in 2017 with a speculation that CapsNet can overcome this limitation. In our research, we created a suitable collection of datasets and implemented a simple CNN model and a CapsNet model with similar complexity to test this speculation. Experimental results show that both the implemented CNN and CapsNet models have the ability to capture the spatial relationship between learned features. Counterintuitively, our experiments show that our CNN model outperforms our CapsNet model using our datasets. This implies that the speculation does not seem to be entirely correct. This might be due to the fact that our datasets are too simple, hence requiring a simple CNN model. We further recommend future research to be conducted using deeper models and more complex datasets to test the speculation.

Keywords: CapsNet relationship


· Convolutional Neural Network · Spatial


Ever since Krizhevsky et al. [1] demonstrated the outstanding performance of a convolutional neural network (CNN) model on ImageNet, CNN has become the center of attraction for computer vision researchers to solve problems such as image segmentation, object detection, object localization, image classification, and image retrieval. Some of the well-known CNN models are AlexNet [1], GoogleNet [2], VGG-16 [3], YOLO [4], and RCNN [5]. A significant advantage of CNN is its ability to maintain translation invariance for feature detection [6]. This means that the position of an object that is known by CNN in the input image does not affect the performance of CNN. CNN is able to recognize the object regardless of its position in the image. This is achieved by the use of pooling layers in CNN which are also responsible for reducing the size of feature maps. c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 1–9, 2021.


U. Manogaran et al.

However, this advantage of CNN inevitably leads to its drawbacks. Two major drawbacks are the lack of rotational invariance [8] and the failure to retain spatial relationships between features [7]. The failure to be invariant to rotations would cause CNN to produce false negatives when an object that is known by the network is rotated to a certain extent. On the other hand, the failure of CNN to retain spatial relationships between features would cause the network to produce false positives [9]. Even though CNNs can achieve state-of-the-art results on many challenges despite these drawbacks, the drawbacks can become serious concerns in applications such as in security systems. To overcome the lack of rotational invariance, augmenting the training data by rotations became a standard practice in the training of CNN models [10]. However, the training time would also increase tremendously. Therefore, in order to solve the drawbacks of CNN, Sabour, S. et al. [11] proposed a novel neural network architecture known as capsule network with dynamic routing (CapsNet). Unlike CNN, CapsNet produces outputs in the form of vectors instead of scalars. This allows CapsNet to retain the properties of an object such as rotation, skewness, and thickness. It has also been explicitly speculated in a number of research papers [9–11], that CapsNet would be able to retain the spatial relationships between features in contrast to CNN. In other words, features of an object in the wrong places such as a face with eyes below the nose instead of above the nose would still be a face to CNN. However, CapsNet is speculated to be able to identify it as a non-face. To the best of our knowledge, there has been no research literature that has produced comprehensive experiments and analysis on this speculation. In this paper we present our experiments and analysis to test both CapsNet and CNN on this speculation. We generated our own dataset for the purpose of our experiment and implemented a CapsNet and a CNN model. Evaluation of these models were done on the generated dataset in such a way that the speculation is tested.


Related Works

The concept of CNN was first proposed by LeCun et al. [12] in 1989. However, due to the lack of computational power and availability of dataset, it was not until recent years that researchers are able to develop feasible models utilizing modern high-performance computers. One notable breakthrough was the work of Krizhevsky et al. [1] which achieved state-of-the-art performance in the ImageNet challenge [13] in 2012. Since then, countless researches have been conducted to develop more advanced CNN models to be used successfully in real-world applications such as speech recognition [14], gait recognition [15], steering controls in self-driving cars [16], human crowd detection [17], and medical image segmentation [18]. Despite successful demonstrations of CNN models, one of the pioneers, Geoffrey Hinton argued that the current CNNs “are misguided in what they are

CapsNet vs CNN


trying to achieve” [19], due to the use of pooling layers for subsampling in CNN models. The models lose the ability to compute precise spatial relationships between learned features in the deeper layers. When a pooling layer is used in between convolutional layers, only the most active neuron in a local region of a feature map would be retained while the rest of the neurons are disregarded. Such disregard of neurons causes the loss of spatial information of the features. Furthermore, due to the use of scalars instead of vectors, properties of features such as orientation, thickness, and skewness are lost. Therefore, Hinton, G. E. et al. [19] proposed to group neurons together as vectors and use them to represent the features of an object. These vectors are called capsules. In 2017, Hinton and his team [11] proposed an architecture called capsule networks with dynamic routing (CapsNet) that performed better than CNN on the MNIST dataset. It achieved state-of-the-art results in MNIST with only 0.25% of test error. The CapsNet model which achieved 99.23% on the expanded MNIST set were able to reach an accuracy of 79% on affNIST test set while a CNN that achieved 99.22% accuracy on the expanded MNIST test set only achieved 66% accuracy on affNIST test set. This proves that CapsNet is more robust to affine transformations. We have implemented CapsNet based on the original research paper [11] and through the reconstruction network as mentioned in the paper, it can be seen that CapsNet preserves the properties of the features well as shown in Fig. 1.

Fig. 1. Original (Top Row) vs Reconstructed (Bottom Row) images.

Following the success of CapsNet on MNIST, studies have been conducted to push the capability of CapsNet. LaLonde, R. et al. [20] proposed a novel architecture called SegCaps which is based on CapsNet to perform image segmentation. SegCaps outperformed previous state-of-the-art networks even though SegCaps had lower number of parameters on the LUNA16 subset of the LIDCIDRI database. In a different research [7], a CapsNet model outperformed a CNN model with similar complexity in Human Action Recognition task on KTH and UFC-sports dataset. One of the speculated properties of a CapsNet model is its ability to retain spatial relationships between learned feature unlike a CNN model [9–11]. In other words, the relative positions of features are insignificant to a CNN model. This causes a CNN model to produce false positives such as labelling an image of a face with eyes below the nose as a face. In contrast to CNN, CapsNet is


U. Manogaran et al.

speculated to be able to avoid such false positives. However, to date, no studies have been conducted to test this speculation. In this paper, we seek to test this speculation to gain deeper insights into CapsNet.



We implemented a CNN model and a CapsNet model for this study. In general, a CNN model consists of several convolutional layers with pooling layers between them followed by fully-connected layers as explained in Sect. 3.2. A CapsNet model consists of several convolutional layers without any pooling layers followed by a primary capsule layer and a digit capsule layer as explained in Sect. 3.3. Both of the models were designed to have the same number of layers in order for them to be comparable. In order to test the speculation, we need to design a dataset in such a way that there are two classes of images containing the same features but the features from different classes have different spatial arrangements. Training our models directly on such a dataset may not yield any insight into the models as the models will learn to identify the shape of the entire objects successfully instead of the distinct features as intended. Therefore, we prepared two groups of datasets whereby the first group contains images of only distinct features while the second group contains objects formed by the composition of the features. Our models are first trained on the dataset from Group 1. Once the training is completed, the weights of the convolutional layers in both models will be frozen while the weights of the rest of the layers will be re-trained on the dataset from Group 2. This will ensure that our models learn the distinct features first before learning to identify the objects using the learned features. This strategy is known as transfer learning. Below we describe in detail regarding the dataset generation, testing of convolutional neural network model, and testing of capsule network with dynamic routing model. Since our datasets only consist of objects with simple features, relatively simple models should be sufficient to achieve good accuracy on the evaluations. 3.1

Dataset Generation

Our dataset consists of two groups. Figure 2 shows samples from Group 1, which contains images of arrows and non-arrows. Figure 3 shows samples from Group 2, which contains images of equilateral triangles and rectangles. Each image is of 64 × 64 pixels. We chose to use generated images in our dataset because there is too much ambiguity in real-life images. Furthermore, simple polygon objects were chosen as they are well-defined mathematically. This would enable us to test out different ideas on how to design our experiments. Table 1 shows the organizations of our datasets.

CapsNet vs CNN


(b) Non-Arrows

(a) Arrows

Fig. 2. Samples from Group 1

(b) Rectangles

(a) Triangles

Fig. 3. Samples from Group 2 Table 1. Organizations of our datasets


Dataset Description


Number of images

Group 1 Contain images of arrows and non-arrows

Training Set 1 500 Training Set 2 1000 Training Set 3 2000 Testing Set 2000

Group 2 Contain images of equilateral triangles and rectangles

Training Set 1 500 Training Set 2 1000 Training Set 3 2000 Testing Set 2000

Convolutional Neural Network (CNN)

We implemented a CNN model using Tensorflow that has 3 convolutional layers and 2 fully-connected layers. Max pooling layer was implemented after each convolutional layer. Rectified Linear Unit (ReLU) was used as the activation function on every layer except for the output layer. Dropouts were also applied to prevent the model from overfitting. As mentioned in the methodology section above, we carried out our experiment by first training our CNN model on the dataset from Group 1 and once the model was trained, we re-trained the weights of the fully-connected layers of the model on the dataset from Group 2 while freezing the weights of the convolutional layers. After each training, the trained model was evaluated on the testing sets from their respective groups. 3.3

Capsule Network with Dynamic Routing (CapsNet)

Our CapsNet model was also implemented using Tensorflow. We implemented 3 convolutional layers, 1 primary capsule layer and 1 digit capsule layer. The


U. Manogaran et al.

architecture of CapsNet is similar to the original paper [11] except that we added an extra convolutional layer and we used 16-D capsules on primary capsule layer and 32-D capsules in digit capsule layer. We used the activation function as proposed in the paper. To prevent overfitting, a reconstruction network [11] was used. There were no pooling layers used. Similar to CNN, our experiment was carried out by first training our CapsNet model on the dataset from Group 1 and once the model was trained, we re-trained the weights of the primary capsule layer and digit capsule layer of the model on the dataset from Group 2 while freezing the weights of the convolutional layers. The trained model was evaluated on the testing sets from their respective groups after each training.


Experimental Results and Discussion

The trainings and the evaluations of the models were performed on a workstation running on Ubuntu 16.04 equipped with 16 GB RAM, Core i7-6700K processor, and two NVIDIA GTX1080Ti GPUs. The models were trained using the training subsets and were evaluated on their respective testing sets. The evaluation results in terms of accuracy (acc), precision (prec), recall (rec) and F1-score (F1) for both models are shown in Table 2 below. Table 2. (a) Evaluation Results for CapsNet. (b) Evaluation Results for CNN

(a) (%) Group 1 Triangles vs Rectangles Group 2 Arrows vs Non-Arrows

Subset 1 Acc Prec Rec


88.4 88.7 87.2 87.9

Subset 2 Acc Prec Rec 91




Subset 3 Acc Prec Rec


90.7 90.4 92.8 87.6 90.1

67.6 70.6 59.7 64.7 77.8 86.6 62.1 72.3 80.4 83.9 75.1 79.3

(b) (%) Group 1 Triangles vs Rectangles Group 2 Arrows vs Non-Arrows

Subset 1 Acc Prec Rec


Subset 2 Acc Prec Rec


Subset 3 Acc Prec Rec


98.5 99.4 97.1 98.2 99.3 99.5 98.9 99.2 99.6 99.8 99.2 99.6



87.2 90.9 95.8 95.2 95.8 95.5 96.6 96.8 92.5 94.6

CapsNet vs CNN


All the images were shuffled in their respective sets and normalized before they were used for training and evaluation purposes. From Table 2(a), it is evident that CapsNet is able to determine whether a given image contains an arrow or non-arrow by computing the spatial relationship between the learned features. It can also be seen in Table 2 (b) that CNN has achieved near-perfect accuracies. This is due to the fact that the generated datasets do not contain any real-world noise. We expected the CNN model to perform worse than CapsNet based on the speculation stated earlier but it can be seen from the results that CNN actually performed better than CapsNet. This might be due to the dataset being too simple hence not requiring a deeper CNN model. The use of pooling layers in between the convolutional layers should cause the loss of spatial information of the features in a CNN. Hence, it might be the case where our model is not deep enough. We expected our CNN model to perform poorly to at least some degree due to the use of 3 pooling layers but based on the results this is not the case. We chose a CNN model with only 3 pooling layers due to the simplicity of the datasets. From the results, it is evident that the problem of retaining the spatial relationship between features is not a serious issue for a relatively shallow model such as a model with only 3 pooling layers. However, it is questionable whether a deeper CNN model would perform well on a more complex dataset or not. In our experiment, the objects in the images are formed by composing simple features. There is only one equilateral triangle and one rectangle in every image. Given the success of CNN, identifying such generated simple objects without real-world noise is rather a trivial task for CNN. This could be another reason for the high accuracy that CNN models have achieved in this experiment despite the use of pooling layers. Our implementations are publicly available in this github link.1


Conclusions and Future Work

In this work, we have designed an experiment to test the speculation that CapsNet is able to retain the spatial relationship between features better than CNN. In order to carry out the experiment, we have generated our own datasets. From our results, both the shallow CNN and CapsNet models have shown the capability to retain the spatial relationship between features. However, the speculation that CapsNet is able to retain spatial relationship between features better than CNN does not seem to be true for shallow models on simple datasets. It is still uncertain whether this speculation is true for deeper models on more complex datasets and on noisy datasets. Considering the fact that CNN has been developed extensively since its invention in 1989 [12], it is possible that our experiment was too simple for CNN. CapsNet on the other hand, is still at a rudimentary stage and the fact that its 1


U. Manogaran et al.

performance level is close to CNN in this experiment means that CapsNet has great potential. Future research in this area should consider the usage of more complex features to represent the objects in the datasets and deeper models in order to further understand the capabilities and limitations of these models. Gaining deeper insights on these models in the retention of spatial relationship between features will guide future developments in a better way. Acknowledgment. The authors are grateful to the Ministry of Higher Education, Malaysia and Multimedia University for the financial support provided by the Fundamental Research Grant Scheme (MMUE/150030) and MMU Internal Grant Scheme (MMUI/170110).

References 1. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 2. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 3. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 4. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 5. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014) 6. Nair, P., Doshi, R., Keselj, S.: Pushing the limits of capsule networks. Technical note (2018) 7. Algamdi, A.M., Sanchez, V., Li, C.T.: Learning temporal information from spatial information using CapsNets for human action recognition. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 3867–3871 (2019) 8. Xi, E., Bing, S., Jin, Y.: Capsule network performance on complex data. arXiv preprint arXiv:1712.03480 (2017) 9. Xiang, C., Zhang, L., Tang, Y., Zou, W., Xu, C.: MS-CapsNet: a novel multi-scale capsule network. IEEE Signal Process. Lett. 25(12), 1850–1854 (2018) 10. Chidester, B., Do, M.N., Ma, J.: Rotation equivariance and invariance in convolutional neural networks. arXiv preprint arXiv:1805.12301 (2018) 11. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp. 3856–3866 (2017) 12. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989) 13. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a largescale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)

CapsNet vs CNN


14. Palaz, D., Magimai-Doss, M., Collobert, R.: Analysis of CNN-based speech recognition system using raw speech as input. In: Sixteenth Annual Conference of the International Speech Communication Association (2015) 15. Zhang, C., Liu, W., Ma, H., Fu, H.: Siamese neural network based gait recognition for human identification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2832–2836 (2016) 16. Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Zhang, X.: End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316 (2016) 17. Tzelepi, M., Tefas, A.: Human crowd detection for drone flight safety using convolutional neural networks. In: 25th European Signal Processing Conference (EUSIPCO), pp. 743–747. IEEE (2017) 18. Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: IEEE Fourth International Conference on 3D Vision (3DV), pp. 565–571 (2016) 19. Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Proceedings of the 21th International Conference on Artificial Neural NetworksVolume Part I, pp. 44–51 (2011) 20. LaLonde, R., Bagci, U.: Capsules for object segmentation. arXiv preprint arXiv:1804.04241 (2018)

Improved 2D Human Pose Tracking Using Optical Flow Analysis Aleksander Khelvas1(B) , Alexander Gilya-Zetinov1 , Egor Konyagin2 , Darya Demyanova1 , Pavel Sorokin1 , and Roman Khafizov1 1

Moscow Institute of Physics and Technologies, Dolgoprudnii, Russian Federation [email protected] 2 Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, Moscow, Russian Federation

Abstract. In this paper, we propose a novel human body pose refinement method that relies on an existing single-frame pose detector and uses an optical flow algorithm in order to increase quality of output trajectories. First, a pose estimation algorithm such as OpenPose is applied and the error of keypoint position measurement is calculated. Then, the velocity of each keypoint in frame coordinate space is estimated by an optical flow algorithm, and results are merged through a Kalman filter. The resulting trajectories for a set of experimental videos were calculated and evaluated by metrics, which showed a positive impact of optical flow velocity estimations. Our algorithm may be used as a preliminary step to further joint trajectory processing, such as action recognition.

Keywords: Video processing motion


· Human pose detection · Skeleton


Human motion tracking is an important application of machine vision algorithms that could be used for many business purposes. The most popular tasks in the digital world include distributed video surveillance system, solutions for digital marketing, solutions for human tracking in an industrial environment. This task can have different levels of details. The high-level approach is object detection, when the position of human as a whole object is extracted and its bounding box in 2D or 3D space is estimated. A more interesting approach would be to detect a human pose in motion. This task is more complicated because human pose has substantially more dimensions compared to a bounding box. Recent advances in deep learning have resulted in efficient single-frame pose tracking algorithms, such as [6,14]. By applying them sequentially to a video stream, a set of trajectories for joints may be obtained. However, since these c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 10–22, 2021.

Improved 2D Human Pose Tracking Using Optical Flow Analysis


algorithms usually analyze input frames independently, the obtained trajectories usually have various artifacts, such as discontinuities or missing points. In the reported research, we solve a task of enhancing obtained joint trajectories for multiple persons in a scene by leveraging the temporal information using an optical flow algorithm.


Related Work

The task of retrieving pose dynamics for all persons in the video may be considered as a variant of multiple object tracking (MOT) task, where the considered objects are not persons but individual pose keypoints. There are two major paradigms in the field of MOT - detection-based tracking and detectionfree tracking [11]. In the first case, machine vision algorithm capable of detecting individual objects is applied to every frame separately and then individual detections are linked into trajectories. The second approach has no detection algorithm and instead relies on temporal changes in the video stream to detect objects. With the development of efficient real-time object detection algorithms in recent years, the detection-based approach has become dominant in the literature. However, independent analysis of video frames results in inevitable loss of information conveyed by temporal changes in the video. This information may be relevant to object detection and could help improve the tracker performance. Various approaches were suggested to combine these individual frame and temporal features. For example, in [12] a novel approach to combine temporal and spatial features was proposed by adding recurrent temporal component to a convolutional neural network (CNN) designed to detect objects in a single frame. The outputs of object detection network in sequential frames were fed into recurrent neural network (RNN). The resulting architecture is then trained to predict the refined tracking locations. In [1] a tracker using prior information about possible person pose dynamics is proposed. This information is modelled as a hierarchical Gaussian process latent variable model, and allows to impose some temporal coherency in detected articulations. In [17] a method leveraging optical flow for a pose tracking is proposed. The velocities obtained from flow data are used to generate expected coordinates of a pose in next frame. Predicted coordinates are used later to form tracks by greedy matching. Our research is based on OpenPose as a body pose detector, proposed in [3]. It is a real-time solution capable to detect a 2D pose of multiple people in an image. It uses a non-parametric representation, which is referred to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. This bottom-up system achieves high accuracy and real-time performance, regardless of the number of people in the image.



A. Khelvas et al.


Fist let us define several frames of reference (FoR) for our research, which are shown in Fig. 1.

Fig. 1. Frames of references for 2D skeletons parameters calculation

U, V – this frame of reference is associated with virtual or real motionless camera. Ucf , Vcf – this frame of reference is associated with frames in video. If camera is motionless, this FoR will be the same for all frames in video. This would be a common case of video surveillance systems for security or marketing. Upf k , Vpf k – this frame of reference is associated with object k, detected for the frame f . We will not use index f for video processing of motionless camera viewed scenes.



Our goal is to propose a novel algorithm for robust tracking of multiple person poses in the video stream by leveraging both temporal and spatial features of the data. To achieve this, we combine predictions done by a single-frame person/pose detection algorithm (such as OpenPose and YOLO) with Optical Flow - based estimations through a Kalman filter. The complete algorithm is described below and shown in Fig. 2.

Improved 2D Human Pose Tracking Using Optical Flow Analysis


Fig. 2. Full algorithm for 2D skeleton model calculation and filtration

1. Video preliminary processing step produces a set of frames with normalized brightness/contrast and calculates Gf values. 2. Objects detection step provides a set of bounding boxes for each person, detected by YOLO or some other object detection algorithm. 3. Pose detection ROI generation step provides a set of input frame regions for further pose detection. 4. 2D pose estimation and person identification step computes a set of vectors  f p = {uf p , v f p , uf p , v f p , . . . , uf p , v f p , }, where N = 25 is the number of joints B 1 1 2 2 N N for selected model of human body. (BODY-25 model provided by the OpenPose solution) 5. Optical Flow calculation step applies an optical flow estimation algorithm to the input frame, producing pixel velocity vectors for every joint position returned by the pose detector.


A. Khelvas et al.

6. Kalman Filtration step calculates the time series for filtered movement vectors  fp  f p = {ˆ B uf1 p , vˆ1f p , u ˆf2 p , vˆ2f p , . . . , u ˆfNp , vˆN , }. Let’s discuss each step in detail. After performing necessary source-specific video pre-processing, the next step would be extracting poses from single video frames where possible. We have selected an OpenPose-based solution as a human pose detector. OpenPose is a multi-person real-time pose keypoint detection algorithm, initially presented in [3]. An open-source implementation of OpenPose is used, providing pose keypoints in BODY-25 format. Reduced model example for 18 keypoints is shown in Fig. 3.

Fig. 3. Keypoints of a human pose model used in OpenPose (CMU-PerceptualComputing-Lab, 2017)

However, direct application of OpenPose library to a high-resolution 4K video stream would not work. Since algorithm memory requirements grow linearly with input image area and amount of 3 GB for a default resolution of 656 × 656 is consumed, a distributed video processor system would be needed. Downscaling of the input image results in a drastic loss of detected pose quality. We solve this problem by splitting image into a set of overlapping regions and invoking the detector on these regions in a parallel manner, combining detection results afterwards. We can substantially boost the algorithm’s efficiency, if a crowd on the video is sparse, which is often the case for video surveillance systems. Instead of processing the whole input frame, we employ an object detection algorithm to detect persons first and then build a set of regions that cover all persons’ bounding boxes. We used YOLO (You Only Look Once)-based solution, as it performs fast detection of objects for 4K images [13]. It was observed that YOLO object detections are also useful for eliminating false positives generated by OpenPose. For every frame in the input frame sequence, the algorithm first applies YOLO-based object detector trained on COCO dataset [10]. The result of YOLO

Improved 2D Human Pose Tracking Using Optical Flow Analysis


processing is a set of bounding boxes, with top left and bottom right corners defined in Ocf , Ucf , Vcf FoR. A set of regions fully covering these rectangles is generated with resolution matching the selected input resolution of OpenPose network. The result of video processing in the detection stage is a list of persons’ bounding boxes for each frame. For each detected object we have the bounding box coordinates u1 , v1 , u2 , v2 , detection confidence and the OpenPose keypoint  f p. data if a skeleton was successfully matched with YOLO object: vector B Additionally, we calculate approximate coordinates standard deviation for each keypoint by integrating over part heatmaps returned by the OpenPose. These values are later used as input for the Kalman filter as a measurement error estimate. To further refine poses extracted from single frames, algorithm uses an optical flow solution. Optical flow is a technology used in various fields of computer vision to find displacement of individual pixels between two video frames. It is usually expressed as a 2D vector field {vf low = (dx, dy)T } for every pixel of the initial frame In (x, y). Corresponding pixel in the next frame is In+m (x + dx, y + dy). Many different approaches to optical flow calculation are proposed in the literature. In our work, we use several open source optical flow implementations provided by the OpenCV library. The first one is presented in [7] called Dense Inverse Search. It belongs to the family of ‘dense optical flow’ algorithms and is notable for low computational complexity while it preserves good scores in standard optical flow benchmarks. The another one is called DeepFlow and was presented in paper [15]. The example of used soccer game frame for optical flow visualization calculated by two different algorithms is presented in Fig. 4.

Fig. 4. The example of used soccer game frame for optical flow visualization calculated by two different algorithms

For Optical Flow visualization we use the HSV model. Hue represents the motion vector angle and saturation encodes the motion vector length.


A. Khelvas et al.

The result of Dense-Inverse-Search algorithm for Optical Flow calculation is presented in Fig. 5.

Fig. 5. Example of optical flow, calculated by Dense-Inverse-Search algorithm

The result of DeepFlow algorithm for Optical Flow calculation is presented in Fig. 6.

Fig. 6. Example of optical flow, calculated by DeepFlow algorithm

By applying a selected algorithm to every video frame in the input stream and by taking pixel velocity estimations at keypoints generated by OpenPose, we achieve a new velocity measurement. We also need to build trajectories from individual detections in order to perform matching of detected poses belonging to the same person in different frames [2,9]. To combine pose keypoint measurements generated by OpenPose and corresponding pixel velocities estimated through optical flow, we use Kalman filter.

Improved 2D Human Pose Tracking Using Optical Flow Analysis


Kalman Filter was first proposed in 1960 [5] and, as a matter of fact, became the industry standard in the tasks related to fusion of measures performed by sensors of different types. Its application requires specification of a motion model for modeled object. There are several common motion models used in case the model of real motion is difficult or impossible to formalize [8]. Popular choices for a 2D case include a constant Cartesian velocity model and polar velocity non-linear model employing extended Kalman filter [2]. Alternative and more complex models can be implemented when 3D pose information is available, but their assessment lies beyond the scope of this work. In our experiments we used the constant Cartesian velocity model applied independently to joint coordinates in 2D video frame FoR. In this instance, the state vector consists of 4 components representing estimated position and velocity of pose keypoint: s(t) = (ˆ ut , vˆt , uˆ˙t , vˆ˙t )T . The state in the moment t + 1 may be linked to the state values in the moment t with the following equation: ⎛ ⎞ ⎞⎛ ⎞ ⎛ u ˆt 0.5δt2 0 1 0 δt 0

⎜0 1 0 δt⎟ ⎜ vˆt ⎟ ⎜ 0 0.5δt2 ⎟ νx (t) ⎟ ⎟⎜ ⎟ + ⎜ (1) s(t + 1) = ⎜ ⎝0 0 1 0 ⎠ ⎝uˆ˙t ⎠ ⎝ δt 0 ⎠ νy (t) 0 δt 00 0 1 vˆ˙t and the process noise covariance matrix: ⎞ ⎛ 00 0 0 ⎜0 0 0 0 ⎟ ⎟ ⎜ ⎝0 0 a ¯u 0 ⎠ 00 0 a ¯v


For every experiment configuration, two independent trajectory estimation passes were performed - with and without optical flow velocity measurements.



To evaluate performance of proposed solution for different applications, we prepared a set of video fragments. They are: 1. a fragment of 4k soccer video broadcast 2. a video from indoor surveillance system in a supermarket 3. a video from outdoor surveillance system The soccer match video had an additional pre-processing step - the fan stands were cut. Figure 7 presents skeletons for soccer and supermarket cases. Examples of filtered trajectories are presented in Fig. 8, 9. Figure 8(a) presents selected and filtered U coordinate trajectories of an ankle of walking person on outdoor CCTV camera. Periodic increased difference


A. Khelvas et al.

Fig. 7. Obtained execution results for the soccer and indoor scenes. Yellow rectangles correspond to different OpenPose calls, green rectangles are objects detected by YOLO.

Fig. 8. a) Detected and filtered U coordinate trajectories of an ankle of walking person on outdoor CCTV camera. b) Detected and filtered U coordinate trajectories of a knee of walking person on outdoor CCTV camera

between filtered and detected value is caused by poor Kalman model prediction of footsteps. Figure 8(b) presents detected and filtered U coordinate trajectories of a knee of a walking person captured by an outdoor CCTV camera. Short bursts of noise are caused by OpenPose detection errors occurring due to overlapping with other limbs. Figure 9(a) presents detected and filtered U coordinate trajectories of a wrist of a walking person with partial visibility. Rough line sections are intervals of time with missing OpenPose detections, resulting in constant velocity for filtered trajectories. Figure 9(b) presents detected and filtered U coordinate trajectories of an elbow of a soccer player.

Improved 2D Human Pose Tracking Using Optical Flow Analysis


Fig. 9. (a) Detected and filtered U coordinate trajectories of a wrist of walking person with partial visibility. (b) Rough line sections are intervals of time with missing OpenPose detections, resulting in constant velocity for filtered trajectories.

It should be noted that the filter using Optical Flow measurements demonstrates a more stable behavior in parts of the video with frequent OpenPose detection errors. To evaluate the improvement, we introduced a quality metric. To confirm the validity of Kalman filter application in the absence of known ground truth, a typical approach would be to extract standardized residuals and check if they follow the normal distribution with zero mean and constant variance [4]. However, in our case, a Kalman model is highly approximate and this method cannot be used for robust estimation of the impact of velocity measurements. Instead, we calculate a simpler metrics coming from a rather intuitive idea the predicted values provided by a ‘better’ filter should have less difference from actual measured values on average. This may not be true, if prediction error can correlate with measurement error at the next step, but, in our case, the studied velocity measurement has different source of errors than measures provided by a single-frame detector. With these assumptions, we calculated the average squared difference between predicted and measured coordinates for Kalman filter with and without optical flow velocity measurements: σ2 =

N 1  fp   f p |2 |B − B N i=1


Where Kalman output prediction is taken before adjusting for the measurement at that step. The sum is taken over all detected keypoints in the video with associated trajectory. The results for different types of video are presented in Table 1. It may be concluded that both studied algorithms have a similar positive impact on the error, with Dense Inverse Search being slightly more effective.


A. Khelvas et al. Table 1. Results of quality metrics in absence of ground truth σ 2 , soccer σ 2 , outdoor σ 2 , PoseTrack

Algorithm Open Pose + Kalman Filter




Open Pose + Kalman Filter + DIS




Open Pose + Kalman Filter + DeepFlow 9.06



To further validate performance of the proposed method, we used annotated videos from public pose tracking dataset called PoseTrack2018 [16]. This dataset provides means to quantify performance of algorithms in two different tasks - multi-person pose estimation task, through mean Average Precision metrics (mAP), and multiple object tracking task, through Multiple Object Tracker Accuracy metrics (MOTA). Effective multiple object tracking methods usually rely on global trajectory analysis which is necessary to process videos with crowds and large amount of occlusion. Since our method uses only local temporal data and does not aim to improve results in these cases, we used mAP metrics only, solving a multi-person multi-frame pose estimation (but not tracking) task. For PCKh and mAP values calculation, we used code provided in the evaluation repository referenced by PoseTrack official site. The results for different algorithms are provided in Table 2. Table 2. Results obtained on the PoseTrack dataset Algorithm

Precision Recall


No filtering




Kalman Filter




Kalman Filter + DeepFlow 79.5%

58.9% 52.7%

Kalman Filter + DIS




The OpenPose network in our case was not trained on the PoseTrack dataset, so the results should not be compared to its public leaderboard. In addition, the dataset used a different pose model with 18 keypoints per pose, which could have had a negative impact to metrics value. Also, the mAP metrics itself is not very good for estimating filtration quality since it classifies each predicted keypoint in a binary way - checking if it is closer to ground truth than some threshold or not. Nonetheless there is an improvement in experiments that used Optical Flow as an additional Kalman input. The performance degradation of filter without velocity measurements may be attributed to a lag due to trajectory smoothing, which was partially compensated by optical flow. Also, it was observed that errors in tracking cause notable disruptions in filtered trajectories absent in unfiltered case, which is an another source of errors.

Improved 2D Human Pose Tracking Using Optical Flow Analysis



Discussion and Conclusions

In our paper, we presented a method allowing to refine output of an arbitrary single-frame multi-person pose detection algorithm, by combining its output with an optical flow-based velocity estimator. We showed with openly available implementations that it is possible to improve pose estimation quality through Kalman filtration, both with annotated and unannotated data. However, the improvement achieved is relatively minor. We speculate that it may be developed further, taking the following points into account. The used camera-space constant-velocity 2D Kalman model is actually one of the simplest models possible. Some of its limitations were shown in the results section - for example, it produces bad results in moments where limb acceleration is higher than usual - feet while stepping on the ground, ball strike, etc. It also does not take into account various kinematic restrictions pertaining to humans, such as limb length and angle restrictions. By replacing it with a more precise human body model, it may be possible to improve Kalman filter performance and overall result. There is a vast amount of modern optical flow algorithms, including recent ones based on trainable convolutional networks, which were not tested. Also, their parameters may be fine-tuned, to better grasp individual limb movements. The used optical flow algorithms were observed to predict human movement as a whole object in some cases, depending on parameters, environment and image quality. The filter performance in our experiments greatly depends on the object tracker, since Kalman filtration of wrong tracks tend to magnify errors even more. While missing limb detections for several frames can be handled by the Kalman filter well, detection and tracking artifacts that affect the whole pose (e.g. track swap happening during occlusion, merging of several poses to one) usually disturb output trajectory to a significant degree. For this reason, method can be futher improved by using Kalman predictions eariler at the pose generation step. (e.g. when building pose from heatmaps and PAFs in case of OpenPose detector). In the future, we plan to validate the algorithm performance more using available open source pose tracking datasets and compare different optical flow algorithms. Also, it is our intention to validate performance of more complex Kalman models. Acknowledgment. Authors thank XIMEA corp. CEO Max Larin for XIMEA cameras during the summer of 2019 for data collection and experiments with the adaptive video surveillance system. Also, we would like to thank the MRTech directors Igor Dvoretskiy and Aleksandr Kiselev for the numerous and extensive discussions about video system architecture and video processing solutions and Fyodor Serzhenko from FastVideo corp. for comments about GPU usage for the on-board video processing. Boris Karapetyan provided us with the skeleton visualization software components. The reported study was funded by the RFBR, project number 19-29-09090.


A. Khelvas et al.

References 1. Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and peopledetection-by-tracking. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008) 2. Antonucci, A., Magnago, V., Palopoli, L., Fontanelli, D.: Performance assessment of a people tracker for social robots. In: 2019 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), pp. 1–6. IEEE (2019) 3. Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multiperson 2D pose estimation using part affinity fields arXiv:1812.08008 (2018) 4. Jalles, J.T.: Structural time series models and the Kalman filter: aconcise review (2009) 5. Kalman, R.E., et al.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82(1), 35–45 (1960) 6. Kocabas, M., Karagoz, S., Akbas, E.: MultiPoseNet: fast multi-person pose estimation using pose residual network. In: Proceedings of the 15th European Conference, Munich, Germany, 8–14 September 2018, Part XI, pp. 437–453 (2018). https://doi. org/10.1007/978-3-030-01252-626 7. Kroeger, T., Timofte, R., Dai, D., Van Gool, L.: Fast optical flow using dense inverse search. In: European Conference on Computer Vision, pp. 471–488. Springer (2016) 8. Li, X.R., Jilkov, V.P.: Survey of maneuvering target tracking. Part I. Dynamic models. IEEE Trans. Aerosp. Electron. Syst. 39(4), 1333–1364 (2003) 9. Lin, L., Lu, Y., Pan, Y., Chen, X.: Integrating graph partitioning and matching for trajectory analysis in video surveillance. IEEE Trans. Image Process. 21(12), 4844–4857 (2012) 10. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer (2014) 11. Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., Zhao, X., Kim, T.K.: Multiple object tracking: a literature review. arXiv preprint arXiv:14097618 (2014) 12. Ning, G., Zhang, Z., Huang, C., Ren, X., Wang, H., Cai, C., He, Z.: Spatially supervised recurrent convolutional neural networks for visual object tracking. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4. IEEE (2017) 13. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 14. Xiu, Y., Li, J., Wang, H., Fang, Y., Lu, C.: Pose flow: efficient online pose tracking. In: British Machine Vision Conference (BMVC) arXiv:1802.00977 (2018) 15. Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: large displacement optical flow with deep matching. In: Proceedings of the IEEE International Conference on Computer Vision 2013, pp. 1385–1392 (2013) 16. Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., Schiele, B.: Posetrack: a benchmark for human pose estimation and tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5167–5176 (2018) 17. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 466–481 (2018)

Transferability of Fast Gradient Sign Method Tam´as Muncsan(B) and Attila Kiss Department of Information Systems, E¨ otv¨ os Lor´ and Tudom´ anyegyetem, Budapest, Hungary [email protected], [email protected]

Abstract. Image classification in computer vision is a process which classifies an image depending on its content. While classifying an object is trivial for humans, robust image classification is still a challenge in computer vision applications. The robustness of such image classification models in real world applications is a major concern. Adversarial examples are specialized inputs created with the purpose of confusing a classifier, resulting in the misclassification of a given input. Some of these adversarial examples are indistinguishable to humans, but the classifier can still be tricked into outputting the wrong class. In some cases adversarial examples can be transferred: an adversarial example crafted on a target model fools another model too. In this paper we evaluate the transferability of adversarial examples crafted with Fast Gradient Sign Method across models available in the open source Tensorflow machine learning platform (using ResNetV2, DenseNet, MobileNetV2 and InceptionV3). Keywords: Fast Gradient Sign Method · Adversarial attack · Adversarial example transferability · Tensorflow · Image classification



Convolutional Neural Network (CNN) is a powerful artificial neural network technique. These networks can be used in different fields of computer science: recommender systems, natural language processing, computer vision. In case of Image Classification, a CNN is trained on a large dataset of images. CNNs extract features from images while training on the dataset. After training, CNNs can detect features with the help of multiple hidden convolutional layers. With each layer, the complexity of the features is increased. Image classification in computer vision is a process which classifies an image depending on its content. In general, a CNN has an image as an input, creates a feature map, uses a ReLU function to increase non-linearity, applies a pooling layer to each feature map, flattens the pooled images into one long vector and inputs this vector into a fully connected artificial neural network which outputs the scores of the classes. c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 23–34, 2021.


T. Muncsan and A. Kiss

According to [1], state-of-the-art machine learning models can be manipulated by adversarial examples. Such examples are inputs intentionally crafted to cause the image classification model to make an incorrect output. Generally, machine learning models accept inputs as numeric vectors. Adversarial attacks are algorithms which produce specific inputs that are able to get the wrong result from the model. Adversarial attacks can be categorized in two groups based on their knowledge of the target model. An attack which uses information about the architecture, design or implementation of the model is called a white-box attack. In case of black-box attacks there is no information given on the target model. Based on the models output, an adversarial attack can be targeted or non-targeted. A targeted attack is aiming to get a particular classification output for the crafted input, on the other hand a non-targeted attack is aiming to get an incorrect output regardless of the classification. In this work we test the transferability of Fast Gradient Sign Method [1]. We attempt this on four different open source models pre-trained on the ImageNet dataset: InceptionV3, MobileNetV2, ResNet50V2, DenseNet201.


Related Work

In the field of adversarial robustness of machine learning models there are two well known frameworks: IBM’s Adversarial Robustness Toolbox and Google’s Cleverhans. These frameworks were not used in our experiment yet they were a useful reference for building our implementation of the image classifier and the FGSM attack. 2.1

Adversarial Robustness Toolbox (ART)

IBM’s Adversarial Robustness Toolbox [2] is an open-source software written in Python which focuses on adversarial machine learning. It supports multiple classifiers from different popular machine learning libraries such as Tensorflow, Keras and PyTorch. This library provides implementations of adversarial defences and attacks. The library is built upon a modular architecture which we used as a reference. It consists of modules such as classifiers, attacks, defences, detection, metrics. The classifiers module can be used for the integration of machine learning models of various libraries. The attacks module contains the implementation of multiple adversarial attacks. At the time of this writing, there are 21 adversarial attacks implemented in this library. FGSM in ART. The implementation of Fast Gradient Sign Method extends the attack to other norms, therefore in the library it is called Fast Gradient Method. In this extension of the attack a minimum perturbation is determined for which the class of the generated adversarial example is not equal to the class of the original image. This extended attack works with two extra parameters: step and max . It performs consecutively the traditional FGSM attack with

Transferability of Fast Gradient Sign Method


strength  = k · step for k = 1, 2, 3... until the attack is successful. In case of k · step > max the attack has failed. Then again, in the traditional FGSM attack the degree of the perturbation (the value of ) is fixed and for a small perturbation the attack can be unsuccessful. 2.2


Cleverhans [3] is Google’s open-source Python library which provides adversarial attack and adversarial training methods. The library uses TensorFlow to accelerate graph computations performed by many machine learning models. Developers can use Cleverhans to create robust machine learning models by applying adversarial training. The implemented adversarial attacks have two purposes. The first purpose is adversarial training, which requires the generation of adversarial examples during the training procedure. The second purpose is the standardization of adversarial attacks so the accuracy of the models in the adversarial setting could be compared. As IBM’s ART, Cleverhans has a modular architecture too. The two most important modules are the attacks and the model module. The attacks module contains the implementation of the Attack class which is an interface used by all implemented attacks. These adversarial example crafting algorithms take as an input the model and the input image and return the adversarial example. The model module contains the implementation of the Model class. All models need to implement this class to be compatible with the implemented attacks. FGSM in Cleverhans. Fast Gradient Sign Method is implemented in the traditional way in Cleverhans. The adversarial example computed as follows: − → − → − x ∗ ← x +  · ∇→ x J(f, θ, x )


where x∗ is the generated adversarial example, x is the input image and  is the parameter which defines the degree of the perturbation. For large  values the likelihood that the adversarial example will be misclassified by the classifier is greater, but the perturbation is easier to detect by humans.



We focus on examining the transferability among state-of-the-art Tensorflow models pre-trained over the ImageNet dataset. In this section, we describe the essential concepts: the adversarial attack, the examined models, the used frameworks, the measurement types, the evaluated dataset. 3.1


The ImageNet database is a widely used dataset among researchers in the field of computer vision. The database is organized based on the WordNet’s hierarchical


T. Muncsan and A. Kiss

structure. Each node of the structure is represented by thousands of images. According to [4], in 2009 ImageNet had 12 subtrees with 5247 synsets and 3.2 million images in total. Since then the database had grown, the last statistical analysis on the dataset is from 2010. 3.2

Image Quality Metrics

During image processing algorithms the quality of the images can degrade. The cause of the degradation can be blurring, noising, ringing, etc. In our case the degradation is caused by adding a specific noise to the image. Our goal is to measure the quality of the adversarial examples that correlates well with the subjective perception of quality by a human observer. Image quality metrics can also be used to compare image processing algorithms. As we have the original images available, we can use them as references to compare the quality of the adversarial examples. Using a reference image to compute the quality of an image is called a full-reference quality metric. If a reference image without degradation is not available we would be needed to use a no-reference quality metric. Usually, no-reference quality metric algorithms are based on statistics. Full-Reference Image Quality Metrics. Full-reference metrics compare the input image against the original image with no perturbation. We considered using the following metrics: Mean-Squared Error (MSE). MSE [5] measures the average squared difference between the adversarial and the original pixel values. This metric does not correlate well with the human perception of quality. The formula of MSE is: M SE =

m−1 n−1 1  [I(i, j) − K(i, j)]2 m · n i=0 j=0


Peak Signal-to-Noise Ratio (PSNR). PSNR [6] is derived from the Mean-squared error. This metric is simple to calculate but might not align well with perceived quality, just like the MSE. The equation of PSNR is: P SN R = 20 · log10 (M AXI ) − 10 · log10 (M SE)


Mean Structural Similarity Index (MSSIM). MSSIM [7] combines the structure of the image, contrast and luminance into a quality score. The MSSIM quality score correlates well with the human perception because the human visual system is good at observing the structure of an image. In our experiment, MSSIM will be used to represent the quality of the adversarial examples. We compute the SSIM for every adversarial example compared to the original image as shown in Fig. 1 and compute the mean value for every given epsilon.

Transferability of Fast Gradient Sign Method


Fig. 1. Structural Similarity Index of an Adversarial Example compared to the original image.



Tensorflow. Tensorflow [8] is a widely used open-source software library, most commonly used in the field of machine learning. It consists of state-of-the-art models, tools, libraries that lets researchers take ideas, conceptions to implementation. We used it to implement the generation of adversarial examples and to test these examples on the aforementioned pre-trained image classification models. Scikit-Image. Scikit-image [9] is an open-source Python library which implements a collection of image processing algorithms. It consists of multiple modules which can be used for image filtering, morphological operations, color space conversion, etc. The most important module of this library for us is the measurements library. In our experiment, scikit-image is used to calculate the Mean Structural Similarity Index between the adversarial examples and original images. 3.4

Image Classification Models

The problem of image classification can be described by four steps. Firstly, we collect a large amount of images which will be the training dataset and associate each one with a label from the set of classes. Then, the classifier uses this training dataset to generate a model which can recognize each one of the classes. Using a new set of images we evaluate the quality of the model by comparing the true labels of the images with the predicted ones. The following models were used in our experiment:


T. Muncsan and A. Kiss

MobileNetV2. The MobileNetV2 [10] architecture was introduced in 2018. This new neural network architecture was designed to be used in mobile devices, resource constrained environments. The network evolved state of the art computer vision models by notably reducing the memory usage and the number of operations while not just retaining but increasing the accuracy. The MobileNetV2 models performance was measured on ImageNet classification, COCO Object detection and VOC image segmentation. InceptionV3. The Inception network was an important milestone in the development of CNN classifiers. The main hallmark of this architecture is the improved utilization of the computing resources inside the network. It used a lot of tricks to push performance in terms of speed and accuracy. Its constant evolution lead to the creation of several versions of the network (InceptionV1, InceptionV2 [11], InceptionV3 [12]). ResNet50V2. ResNet [13] is a short name for Residual Network and the 50 at the end refers to the number of layers. In the field of image classification deep convolutional neural networks have led to a series of breakthroughs. At the beginning there was a trend to make these networks have more and more layers. With “deeper and deeper” we managed to solve more complex tasks and also increase the accuracy, but in the long run in turned out that it had multiple drawbacks such as difficulties in training or vanishing gradients. The ResNetV2 architecture was made to solve these problems. DenseNet201. The Dense Convolutional Network (DenseNet [14]) architecture was introduced at the end of 2016. The curiosity of this network architecture direct connections between the layers, in contrast with is that it has L·(L+1) 2 the traditional convolutional networks which have L connections between each layer (where L is the number of layers). DenseNets solve the problem of vanishing gradients, reduce the number of parameters, and encourage feature reuse. DenseNet201 is a pre-trained neural network on the ImageNet dataset and has 201 layers. 3.5

Fast Gradient Sign Method

As demonstrated in [1], the Fast Gradient Sign Methods (FGSM) advantages are reliability and fastness. The disadvantage of this attack is that certainly successful attacks require more noticeable perturbations. We use the following equation to produce the adversarial example: xadv = x + ε ∗ sign(∇x J(x, ytrue ))


As shown in Fig. 2, in the equation xadv holds the place of the calculated adversarial image, x marks the original image, ∇x J is the loss function’s Jacobian, and ytrue is the true label of the image. The ε coefficient controls how much the input image is perturbed.

Transferability of Fast Gradient Sign Method


Fig. 2. Demonstration of Fast Gradient Sign Method [1] applied to GoogLeNet.



The details and the procedure of this experiment are as follows. In the setup phase we gathered 3000 images from the ImageNet dataset. We cherry-picked images which were classified correctly by all of the models. These images were passed to the preprocessing functions of the given models. The preprocessing function center-cropped the images to have a dimension of 224×224×3 in case of MobileNetV2, DenseNet201, ResNet50V2 and a dimension of 299 × 299 × 3 in case of InceptionV3. For every preprocessed image we crafted an adversarial example on every model with the values for epsilon:  = 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1


In total, we generated 120000 adversarial examples. Every adversarial example for a given  value was fed to every image classification model again, to test if the adversarial attack was successful and is transferable or not. We calculated the percentage of successful adversarial attacks for every adversarial example crafted on the models. This way, we can measure the success rate of the attack and the transferability of the adversarial examples. At the end of the experiment, for every  value we calculated the average mean structural similarity index of the adversarial example compared to the original image to represent the quality degradation of the images. 4.1

Transferability of Adversarial Examples

When observing the success rate of the adversarial attacks on a given model, it is obvious that for a higher value of  (Tables 5, 6, 7, 8, 9, 10) the attack is more successful. For lower  values (Tables 1, 2, 3, 4) the ranking is not different from the summarized results. The attacks could not reach the success rate of 100% even on the target models. The results show that the most prone to transferable adversarial examples is the MobileNetV2 model. The ranking is, in order: MobileNetV2, InceptionV3, DenseNet201, ResNet50V2.


T. Muncsan and A. Kiss

The two oldest model architectures ResNet and DenseNet performed well against adversarial examples crafted on different models. Nevertheless, it is not surprising that the MobileNetV2 architecture which is optimized for a resource constrained environments came out as the most vulnerable model. In case of the InceptionV3 model, the adversarial examples crafted on the other models had an original dimension of 224 × 224 × 3 which was needed to be transformed to 299 × 299 × 3. As proposed in [15], input transformations could be used as adversarial defenses. Contrary to our predictions, this input transformation was not enough to reduce the number of successful transferable adversarial attacks. Table 1. Epsilon = 0.01 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.33%

29.50% 29.33%




99.43% 27.57%




23.63% 99.73%




36.47% 37.73%


Table 2. Epsilon = 0.02 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.43%

51.70% 51.50%




99.67% 44.83%




40.87% 99.90%




50.80% 52.83%


Table 3. Epsilon = 0.03 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.53%

64.70% 65.53%




99.63% 56.60%




53.90% 99.83%




60.77% 61.73%


Transferability of Fast Gradient Sign Method Table 4. Epsilon = 0.04 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.57%

74.13% 74.57%




99.53% 64.00%




64.03% 99.80%




66.37% 68.17%


Table 5. Epsilon = 0.05 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.50%

81.53% 80.63%




99.40% 69.60%




71.10% 99.80%




70.73% 72.57%


Table 6. Epsilon = 0.06 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.37%

87.20% 85.73%




99.43% 74.07%




76.67% 99.80%




75.10% 75.30%


Table 7. Epsilon = 0.07 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.33%

90.30% 89.50%




99.30% 78.50%




81.27% 99.80%




77.37% 78.33%




T. Muncsan and A. Kiss Table 8. Epsilon = 0.08 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.33%

93.57% 92.23%




99.23% 81.73%




84.33% 99.70%




80.17% 80.90%


Table 9. Epsilon = 0.09 Tested on

Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.43%

95.20% 93.87%



99.23% 84.37%

88.63% 71.07%



86.50% 99.60%




82.20% 83.43%


Table 10. Epsilon = 0.1 Tested on


Crafted on MobileNet ResNet DenseNet Inception

MobileNet 99.63%

96.23% 95.37%




99.13% 86.57%




89.37% 99.57%




84.53% 85.37%


Quality Degradation

The reason to measure the quality of the adversarial examples compared to the original images is only to show how easily humans could detect perturbations on an image. The level of quality degradation also correlates with the success of the adversarial attack. Low structural similarity index value increases rather substantially the chances of perturbing the features of the original class. According to these results shown in Table 11, the most similar adversarial examples can be crafted on the InceptionV3 model, which is followed by, in order, DenseNet201, ResNet50V2, and MobileNetV2. Note that the difference between these values is minimal.

Transferability of Fast Gradient Sign Method


Table 11. Average mean structural similarity index Epsilon Crafted on MobileNet ResNet DenseNet Inception




96.33% 96.33%




88.06% 88.06%




79.00% 79.00%




70.78% 70.78%




63.79% 63.79%




57.87% 57.87%




52.88% 52.88%




48.56% 48.56%




44.84% 44.84%




41.53% 41.53%


Future Work

Based on this experiment, there is still a handful of state of the art models which could be examined from a transferability point of view, with multiple attacks, not just one. During our research we have found multiple machine learning platforms and libraries focusing on adversarial attacks and defenses, but none of them focused on testing the transferability of the adversarial example. Based on the architecture of IBM’s ART or Google’s Cleverhans, it would be useful to implement a module which aims on adversarial transferability.



We have conducted an analysis of the transferability of Fast Gradient Sign Method attack on multiple open source image classification models. We have shown that using one of the simplest adversarial attacks in a black-box setting, with small perturbations we can achieve a transferability success rate of 19.40% and with noticeable perturbations 96.23%. As many of the commercial applications rely on these architectures, this is a major security risk in safety-critical environments. The lesson to system designers is to strengthen their models with defense methods, such as adversarial training or adversarial example detection. Acknowledgment. The project has been supported by the European Union, cofinanced by the European Social Fund (EFOP-3.6.3-VEKOP-16-2017-00002).


T. Muncsan and A. Kiss

References 1. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014) 2. Nicolae, M.-I., et al.: Adversarial Robustness Toolbox v0. 4.0. arXiv preprint arXiv:1807.01069 (2018) 3. Papernot, N., et al.: cleverhans v2. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768 10 (2016) 4. Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (2009) 5. Wikipedia contributors: Mean squared error. In Wikipedia, The Free Encyclopedia, 8 January 2020. squared error&oldid=934796877. 13 Jan 2020 6. Johnson, D.H.: Signal-to-noise ratio. Scholarpedia 1.12, p. 2088 (2006) 7. Wang, Z., et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 8. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016) (2016) 9. Van der Walt, S., et al.: scikit-image: image processing in Python. PeerJ 2, e453 (2014). 10. Sandler, M., et al.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018) 11. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 12. Szegedy, C., et al.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) 13. He, K., et al.: Identity mappings in deep residual networks. In: European Conference on Computer Vision. Springer, Cham (2016) 14. Huang, G., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017) 15. Guo, C., et al.: Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117 (2017)

Design of an Automatic System to Determine the Degree of Progression of Diabetic Retinopathy Hernando González(B) , Carlos Arizmendi, and Jessica Aza Universidad Autónoma de Bucaramanga, Bucaramanga, Colombia {hgonzalez7,carizmendi,jaza}

Abstract. This paper proposes an analysis and detection of diabetic retinopathy by using artificial vision technics, such as filtering, transforms, edge detection and segmentation on color fundus images to recognize and categorize microaneurysm, hemorrhages and exudates. The algorithms were validated with the DIARETDB database. Of the processed images are determined the descriptors for the design of two classifiers, the first based on vector support machines and the second with neural networks. Keywords: Diabetic retionopathy · Microaneurysm · Hemorrhages · Exudates · Neural network · Support Vector Machine

1 Introduction Diabetic retinopathy is one of the most common conditions of diabetes being the third cause of irreversible blindness in the world, but the first in people of productive age (16 to 64 years) who suffer from type 1 diabetes or type 2 for more than 10 years, which have a 50% to 60% probability of developing it. There are several methodologies to identify the medical condition under a physical observation, done by health professionals with an ophthalmology specialization, of the ocular organ of which are: the slit lamp examination and using an eye fundus camera with dilated pupils [1]. These forms of diagnosis are time-consuming and often require fluorescein angiography or an optical coherence tomography to confirm the degree of diabetic retinopathy. Because the medical condition often does not present afflictions and its symptoms are perceived long after its prolonged advance, this leads to delay in the medical opinion. This contributes to the increase and incidence of diabetic retinopathy’s cases, which is why its early detection and prolonged monitoring greatly influence in preventing and controlling the condition of the ocular organ to stop the loss of vision. Therefore, as a measure of timely detection of diabetic retinopathy, there are references in the trend of research towards a methodology using filtering, segmentation and edge detection techniques in retinal fundus images, as an analysis to establish the degree of diabetic retinopathy. The three major changes that can occur in the retina due to diabetic retinopathy are exudates, hemorrhages and micro aneurysms. Presence of one or more number of the © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 35–44, 2021.


H. González et al.

above lesions in the retina indicates the presence of diabetic retinopathy. Exudate detection methods include both supervised and un-supervised techniques, in [2, 3] it used the unsupervised methodology to detect hard exudates. Jayakumar Lachure et al. [4], retinal micro-aneurysms, hemorrhages, exudates, and cotton wool spots are the abnormality find out in the fundus images. Preprocessing, morphological operations performed to find microaneurysms and features are extracted such as GLCM (Texture Analysis Using the Gray-Level Co-Occurrence Matrix) and structural features for classification. This SVM classifier optimize to 100% and 90% sensitivity. The section involved in this paper is mentioned below. Section 2 discusses the different methods for the processing of the images of the DIARETDB database, Sect. 3 discusses the design of the classifier and Sect. 4 concludes the paper.

2 Digital Image Processing In order to determine the presence of diabetic retinopathy, it is necessary to specify the distinguishing characteristics for diabetic retinopathy’s diagnosis. In this paper the microaneurysms, hemorrhages and exudates of the fundus image were taken as delimiters in the analysis. The microaneurysms are located within the inner nuclear layer of the retina, they are observed as small red spots, rounded, with well-defined smooth edges. Intraretinal hemorrhages are produced by rupture of microaneurysms, capillaries or venues, and their shape depends on their location in layers of the retina. They can be deep, red, small and rounded, with irregular edges located in middle layers of the retina; or superficial, elongated or in flame and located in the layer of nerve fibers. The exudates correspond to extravasated lipids of retinal vessels whose permeability is increased. They are divided between hard and cottony. Hard exudates are white or yellowish (waxy) deposits of lipids and lipoproteins with irregular but precise limits of variable size. They are isolated or grouped, in the form of a star, ring (partial or complete) or compact plates. The soft exudates are round or oval, white/yellowish cottony deposits, with imprecise edges, located superficially in the layer of nerve fibers, caused by capillary occlusion. 2.1 DIARETDB Database The DIARETDB (Standard Diabetic Retinopathy Database) database was composed in 2007 as a part of the Imageret project. The DIARETDB database was created in two sub-databases as DIARETDB0 and DIARETDB1. In the present work we used DiaretDB1database. The DiaretDB1 (diabetic retinopathy databse-1) database consists of 89 images of the retina. The images were taken in the Kuopio university hospital. These images are available in PNG format and are of dimensions 1500 × 1152 with 500 field of view (FOV). The database was analyzed and classified according to established in the Early Treatment Diabetic Retinopathy Study (ETDRS). The criteria for classification of the database are: – Without retinopathy. There are no signs. – Very mild. Only microaneurysms. – Mild. Any or all of the following: microaneurysms, retinal hemorrhages, exudates.

Design of an Automatic System to Determine the Degree


– Moderate. Severe retinal hemorrhages, vascular anomalies, venous ligation. – Severe. Rule 4-2-1, that is, one or more of: severe retinal hemorrhages in 4 quadrants, venous thickening in 2 or more quadrants, intraretinal microvascular abnormalities in tombs in one or more quadrants. – Very severe. Two or more criteria of severe. Of the database, 77 images were selected and classified into three categories: 35 with very mild, 26 with mild and 16 with moderate. The proposed classification system consists of five phases: preprocessing of the image, extraction of blood vessels, optical disc and fovea, segmentation of exudates, segmentation of microaneurysms and hemorrhages, and, finally, classification of the images. Figure 1 shows the proposed methodology, which begins with the preprocessing of the image, it has three resulting layers: the first, called dark lesions, consisting of microaneurysms, hemorrhages, blood vessels and the fovea. The second, blood vessels, by its name is only blood vessels. The third layer, exudated lesions, which includes exudates and the optical disc. The green component (G) is favorable for the distinction between dark and exudate lesions.

Fig. 1. Methodology’s diagram to segment classification characteristics of the image

2.2 Blood Vessels The methodology is based in the investigations of Wang [5] and Shahbeig [6], which use morphological processes that allow the segmentation of blood vessels. – G Channel with adaptive CLAHE. One of the challenges is the illumination of the ocular fundus. To mitigate this problem, it is proposed an equalization of the histogram according to the exposure to the light of each image. In order to eliminate a little of the noise present in the image, a media filter of 3 × 3 dimension is applied.


H. González et al.

– Intensity adjustment. Using the resulting image, an intensity adjustment is applied to obtain the most blood vessels. – Regional minimum. It is calculated with the objective of eliminating, by morphological reconstruction, the possible microaneurysms that appear in the image because they are considered noise in this segmentation instance. – Morphological reconstruction. It is referring to a morphological transformation of the image which uses dilation (1) or erosion to fit one image into the edge of another. The process uses a binary marker and a mask. The marker dilates or erodes (2) the necessary times so that the pixels of the marker are equal and those of the mask. Then an AND operation is applied, if it is dilation, or OR, if it is erosion in order to find the pixels that fit inside the mask. In (1) and (2), A is the image, B is the structured object that dilates from the origin and is reflected and transferred along the object and Z is the mask. ˆ z ∩A A ⊕ B = {Z | (B)


ˆ z ⊂ A A  B = {Z | (B)


– BottomHat. The resulting image is applied to another morphological operation called bottomhat (4), to improve the image contrast of the dark pixels, especially the blood vessels. The operation consists of performing a morphological closing operation (3), which starts with a dilation of pixels using a binary structured object, followed by an erosion that also uses a structured object. Then, the original image is subtracted from the image resulting from the previous operation. In this case, the structuring object is a radio disk 20. A • B = (A ⊕ B)  B


A = (A • B) − A


– Contrast and illumination adjustment. The contrast is adjusted to have a greater number of demarcating filaments to segment. – Gauss filter. Essentially a way to soften the image to eliminate noise. When applied the filter in two dimensions, the surface turns out to be concentric circles that are applied to the image. In (5), x and y are the distance from the origin and, σ is the standard deviation of the Gaussian distribution. G(x, y) = √

1 2π σ2


2 +y2 ) 2σ2

− (x


– Binarization. The image passes through a threshold, which acts as a barrier of the intensity of the pixels, deciding, according to the proposed parameter, which pixels, will be one or zero. Figure 2 shows the complete process. 2.3 Optical Disk and Fovea The main reason to segment the fovea is because its composition in shape and intensity is easily confused with a retinal hemorrhage, affecting the classification. In the same

Design of an Automatic System to Determine the Degree


Fig. 2. Segmentation process of blood vessels. a) Regional minimum. b) Morphological reconstruction. c) Adaptive equalization of the histogram with limited contrast (CLAHE). d) Bottom-hat transformation. e) Image with intensity adjustment. f) Binarized image.

Fig. 3. Optical disk a) Region with minimum blood vessels. b) Binarized image. c) Segmented image

way, it happens with the optical disc and the exudates. To find the optical disk, a sobel filter is applied to Fig. 2(f). The resulting image is divided into five sections using masks created from the bottom of the image. The goal is to find the one with fewer elements corresponding to blood vessels. That one with the smallest number, in that region is the optical disc. This region is binarized and the centroid of the largest object is taken, a dilation is applied in the x and y coordinates. The object of greater area is considered, whose centroid will serve to take a marker of the image and perform a reconstruction by dilation. The fovea is one of the elements present in the ocular fundus and is considered one of the most important in terms of monitoring diabetic retinopathy, but for the proposed classification and the database, the fovea will not be used. Even so, it must be removed so that it does not mix with hemorrhages and aneurysms. Adarsh proposed that to find the fovea it is necessary to follow parameters that allow finding an area in which the fovea rests [7], as indicated in Fig. 3. For them, it is necessary to consider the centroid of the optical disk, from there on they measure 2 to 3 spaces equivalent to the diameter


H. González et al.

Fig. 4. Illustration of the distance between the optical disc and the fovea

of the optical disc, in the same way, as illustrated in Fig. 4, the region in which they are searched. the fovea is 10° lower than the centroids of the optic disc. 2.4 Microaneurysms and Hemorrhages Microaneurysms and hemorrhages are found, in terms of luminance, in the same plane of the blood vessels. The steps are: – Illumination adjustment. A luminance adjustment was applied to expand the difference between dark and light pixels. – Regional minimum and morphological reconstruction. Similar to the blood vessel methodology, a morphological reconstruction is performed using as a mask the result of the regional minimum. – CLAHE. Adaptive histogram equalization is applied to increase the contrast of the image and to recognize those microaneurysms that are in the ocular fundus. – Segmentation pre-lesions. Dark lesions are segmented, this is understood as blood vessels, microaneurysms and hemorrhages. – Fovea and vessel elimination. The image resulting from the previous step, the layers of blood vessels and fovea are subtracted. Now only the possible microaneurysms and hemorrhages remain. – Dilation and morphological reconstruction. The image resulting from the previous step still has elements that do not correspond to microaneurysms and hemorrhages. To eliminate pixels that do not correspond again, a morphological reconstruction is carried out. – Binarization. Finally, the output image of the morphological reconstruction is binarized. The criteria between 6 and 20 pixels with a roundness greater than 0.75 corresponds to a microaneurysm, the rest are hemorrhages. Figure 5 shows the results of this process. 2.5 Exudates From a visual recognition, the exudates appear very distinguishable from the ocular background, in a yellowish or white color with different sizes, shapes and locations. There

Design of an Automatic System to Determine the Degree


Fig. 5. Microaneurysms and hemorrhages. a) Image with intensity adjustment. b) Morphological reconstruction. c) Adaptive equalization of the histogram with limited contrast (CLAHE). d) Preinjury segmentation. e) Binary image with microaneurysms. f) Binary image with hemorrhages.

are several methodologies to segment exudates. Sinthanayothin detected the diabetic retinopathy with a region growth algorithm [8], Usher detected the exudate candidate by using a combination of adaptive intensity contrast thresholds [9], Zheng detected exudates using thresholding and a region growth algorithm [10]. Figure 6 shows the results of the proposed methodology, which, based on these methodologies, determined that morphological operations would be used. – Backlight adjustment. The green channel of the RGB plane is used because it has the highest contrast and favors the exudates. Then the image is passed through a media filter to soften the background and subtract it from the original image, this is done in order to obtain a uniform image background. Finally, a luminance adjustment to expand the difference between dark and light pixels. – Adapted Otsu binarization. Preliminarily, it will segment the pixels corresponding to the upper 20% intensity of the image resulting from previous step. To do this, a histogram analysis is made based on the mean and the standard deviation of the image. Next, the image is binarized according to (1), where M (I ) is the mean of the image, STD(I ) is the standard deviation of the image and TH (I ) image’s Otsu threshold Th =

0.8 M (I ) + STD(I ) + TH(I ) 255


– Optical disc removal. A morphological reconstruction is applied and it is removed the optical disc. The new image has only the exudates.


H. González et al.

Fig. 6. Exudates. a) Backlight adjustment. b) Adapted Otsu binarization. c) Optical disc removal.

3 Classifier Design The DIARETDB database contains images, demarcated by experts, of hemorrhages, exudates and microaneurysms. For the classification, the number of microaneurysms, the area of hemorrhages and exudates of the 77 images are taken into account, of which 70% is used for training and 30% as a test. The test group consists of 23 images, of which twelve images belong to class 1, five images to class 2 and six to class 3. 3.1 Support Vector Machine (SVM) Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. They belong to a family of generalized linear classifiers. In another terms, Support Vector Machine (SVM) is a classification and regression prediction tool that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data. Support Vector machines can be defined as systems which use hypothesis space of a linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. For design of SVM classifier, the input data must be normalized in a range of values goes from 0 to 1. Once the input values are normalized, it is trained using Matlab SVM toolbox. 3.2 Artificial Neural Network (ANN) A neural network is an information processing system consisting of elements called nodes or neurons formed by layers and joined by information threads that have a weight. The neural network has two phases, one for training and the other for testing. The first is carried out iteratively using the weights and values entered, with the aim of minimizing the error between the response and the desired output by the data. As the network is trained, it may suffer from overfitting and cease to generalize what has been learned in the face of new cases. Therefore, when using a multilayer perceptron network, a group of data is used that performs a random validation. To design the classifier is used Matlab Neural Networks toolbox, the network that was obtained with an error of less than 5% has three hidden layers, the first with 25 neurons, the second with 40 and the third with 70.

Design of an Automatic System to Determine the Degree


3.3 Results The results when testing the database with the SVM and ANN classifiers are favorable. The confusion matrix for the test group, with the SVM and ANN classifiers, show a classification level of 87% and 90.9%, respectively. Figure 7 and 8 show the confusion matrix for the two classifiers, SVM and ANN, respectively.

Fig. 7. Confusion matrix of SVM classifier

Fig. 8. Confusion matrix of ANN classifier


H. González et al.

4 Conclusion There is a great variety of databases to examine and verify segmentation methods, but in the same way they consist of multiple pathologies and this affects whether or not it is classifiable as regards diabetic retinopathy. One of the challenges of the present work was to keep the images coherent in terms of lighting, noise and contrast. Although techniques were used to mitigate this stumble, in some images it did not turn out as expected. It should be noted that the processes used were mostly predetermined as morphological, this is because under the experimentation it turned out that it generates less distortion, noise and loss of details that otherwise would have been obstacles in the definition of parameters. The results of the classifiers show that the ANN classifier is better than SVM classifier, a reason is because the network perform better on large dataset although the computational cost is higher. Acknowledgment. The authors thank to IMAGERET project, Lappeenranta University of Technology, Finland for providing a public database for Diabetic retinopathy with experts ground truth marking.

References 1. Saine, P.J., Tyler, M.E.: Ophthalmic Photography: Retinal Photography, Angiography, and Electronic Imaging, 2nd edn. Butterworth-Heinemann Medical (2002). ISBN 0-7506-7372-9 2. Sanchez, C.I., Hornero, R., Lopez, M.I., Aboy, M., Poza, J., Abásolo, D.: A novel automatic image processing algorithm for detection of hard exudates based on retinal image analysis. Med. Eng. Phys. 30, 350–357 (2008) 3. Sopharak, A., Uyyanonvara, B., Barman, S.: Automatic exudate detection from nondilated diabetic retinopathy retinal images using fuzzy c-means clustering. Sensors 9(3), 2148–2161 (2009) 4. Lachure, J., Deorankar, A.V., Lachure, S., Gupta, S., Jadhav, R.: Diabetic retinopathy using morphological operations and machine learning. In: IEEE International Advance Computing Conference (IACC) (2015). 5. Wang, H., Hsu, W., Goh, K., Lee, M.: An effective approach to detect lesions in color retinal images. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000, June 2000. 6. Shahbeig, S.: Automatic and quick blood vessels extraction algorithm in retinal images. IET Image Process. 7(4) (2013). 7. Adarsh, P.: A novel approach for diagnosis and severity grading of diabetic maculopathy. In: Conference: Advances in Computing, Communications and Informatics (ICACCI) (2013). 8. Sinthanayothin, C., Kongbunkiat, V., Phoojaruenchanachai, S., Singalavanija, A.: Automated screening system for diabetic retinopathy. In: Proceedings of the 3rd International Symposium on Image and Signal Processing and Analysis, pp. 915–920 (2003). ISPA.2003.1296409 9. Usher, D., Dumskyj, M., Himaga, M., Williamson, T.H., Nussey, S., Boyce, J.: Automated detection of diabetic retinopathy in digital retinal images: a tool for diabetic retinopathy screening. Diabet. Med. 21(1), 84–90 (2004) 10. Zheng, L., Opas, C., Krishnan, S.M.: Automatic image analysis of fundus photograph. In: Proceedings of the 19th International Conference on Engineering in Medicine and Biology, vol. 2, p. 524–525 (1997).

Adaptive Attention Mechanism Based Semantic Compositional Network for Video Captioning Zhaoyu Dong1 , Xian Zhong1,2(B) , Shuqin Chen1 , Wenxuan Liu1 , Qi Cui3 , and Luo Zhong1,2 1

School of Computer Science and Technology, Wuhan University of Technology, Wuhan, China [email protected] 2 Hubei Key Lab of Transportation Internet of Things, Wuhan University of Technology, Wuhan, China 3 China Construction Third Bureau Green Industry Investment Co., Ltd., Wuhan, China

Abstract. Video captioning task is to generate a text to describe the content in the video. To generate a proper description, many people have begun to add explicit semantic information to the video generation process. However, in recent work, with the mining of semantics in video, the semantic information in some existing methods will play a smaller and smaller role in the decoding process. Besides, decoders apply temporal attention mechanisms to all generation words including visual vocabulary and non visual vocabulary that will produce inaccurate or even wrong results. To overcome the limitations, 1) we detect visual feature to composite semantic tags from each video frame and introduce a semantic combination network in the decoding stage. We use the probability of each semantic object as an additional parameter in the long-short term memory(LSTM), so as to better play the role of semantic tags, 2) we combine two levels of LSTM with temporal attention mechanism and adaptive attention mechanism respectively. Then we propose an adaptive attention mechanism based semantic compositional network (AASCNet) for video captioning. Specifically, the framework uses temporal attention mechanism to select specific visual features to predict the next word, and the adaptive attention mechanism to determine whether it depends on visual features or context information. Extensive experiments conducted on the MSVD video captioning dataset prove the effectiveness of our method compared with state-of-the-art approaches. Keywords: Video captioning · Semantic Compositional Network Temporal and Adaptive attention mechanism




Recently, image and video retrieval [1] have made great progress, the task of video captioning has also begun to gradually attracted people’s attention. Video c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 45–55, 2021.


Z. Dong et al.

Fig. 1. The AASCNet model uses convolutional neural network(CNN) to extract video features and semantic tags during the encoding phase. The decoding phase combines an LSTM with a temporal attention mechanism and one with an adaptive attention mechanism.

captioning is a task to generate text based on sequence data. Some previous work focus on describing images in natural language [2–5], however video captioning task is probably more challenging for video consists of not only spatial feature but also temporal correlation. What’s more, the objects in the video and the interaction between them need to be understood. To this end, many areas of computers such as vision and natural language processing are working together to meet the challenge. Video captioning task has high research value, such as answering image questions and assisting people with vision disorders. The encoder-decoder framework is currently widely used in video description tasks. Some of them provide the semantic concept to the decoder as an initialization step. After several time steps of calculation, the impact of the semantic concept will become smaller and smaller, and the integration of semantic concepts into the long-short term memory-based decoding process is limited. Besides, the visual attention model for video caption uses video frame features at each time step without explicitly considering the relationship between predicted words and visual information. For example, some objects and actions are obvious visual words, and some conjunctions such as the, a, of, and so on are non-visual vocabulary, and the use of visual information in predicting these words can even be misleading. To solve these problems, we combined a high-level adaptive attention mechanism [6] based on semantic compositional network [7] (AASCNet) for video caption tasks. We show the overall process of our method in Fig. 1. First, we use CNN to extract video features and related semantic tags, and then use the probability matrix of them as parameters of the LSTM to participate in the calculation to obtain the output of the middle hidden layer, and then use the temporal attention mechanism to assign weights to the visual information to obtain the most relevant visual information at every moment. Finally, the adaptive attention



mechanism combined with the multilayer perceptron (MLP) layer is utilized to generate visual words based on visual information only. Non-visual words are generated using natural language templates. In this paper, our contributions are threefold: – To ensure that semantic tags can be applied to the entire process of LSTM, we use semantic compositional networks in the decoding stage. – In order to apply attention only to visual words, we further combined the hierarchical LSTM with adaptive attention mechanism in the decoding stage. – We propose a novel adaptive attention mechanism based on semantic compositional network (AASCNet). Numerous results on benchmark dataset prove that our method is well addressed by the proposed compositional structure and considerable benefits on video captioning are achieved. We will introduce related work in the second section, the specific methods in the third section, the details of the experiment and the analysis of the results in the fourth section, and the final section summarizes the full text.


Related Work

Video captioning initially uses the method of language model [8–11], that is, to generate language description according to the preset template and specific syntax rules, but the structure of the generated description will be rigid due to the limitation of the template, which does not conform to people’s daily syntax. Recently, mainstream method of video caption tasks is to use the encoderdecoder architecture. Generally, the encoder uses CNN to generate semantic representation of video frames. For a given video sequence, the generated semantic representation is further fused to make use of video time dynamics and generate video representation. Decoder usually use LSTM or gated recursive unit (GRU), which will generate sentence fragments one by one and combine into one sentence. Author in [12] shows the first sequential learning model for video to text generation. The mentioned model can learn potential “meaning” state and smooth grammar model of related languages at the same time, and results are verified on YouTube clips dataset. In [13], attention mechanism is introduced to assign a weight to each frame’s features, and then they are fused based on attention weight to explore the time domain of video. In [10], a transmission unit is proposed, which can extract the semantic attributes of the image and the average merged visual feature of the video, and add it to the video representation as supplementary information to further improve the performance of the video captioning model. In [11], they propose a multi model storage network, M3-VC for short, which can meet the long-time visual text dependence and guide visual attention. In the previous literature, some innovations have also been made in the encoder-decoder framework, or enhanced visual features or advanced visual features based on attention mechanism or introduced other units. Here, our idea is


Z. Dong et al.

to combine semantic tags and adaptive attention mechanism, which is expected to further improve the quality of video captions.



On the basis of semantic compositional network, we combine adaptive attention mechanism and traditional temporal attention mechanism. The semantic compositional network combines semantic tags and video features into the decoding process, and the two-level LSTM with attention mechanism guide the decoding process at the same time. The first level selects a specific frame to predict related words, and the second layer ensures that only visual information to predict visual words. Specifically, first, the hidden state of each time step is calculated by combining each frame level feature with semantic tag in semantic compositional network. Then, the context vector is obtained through the first level LSTM with temporal attention mechanism, and then the adjusted context vector is obtained through the second level LSTM with adaptive attention mechanism. Finally, the adjusted context vector and hidden state are involved in word generation together. 3.1


The encoding phase includes the detection of semantic tags and the extraction of video features. To detect tags from video frames, we first need to obtain the semantic label from the standard description. According to [14], we use the M most frequently occurring words in the standard description as the contents of the tag, including some common objects, actions, and connecting words. Inspired by [15,16], we can train a multi-label classification model to detect semantic labels in a given video. We select K images as the training samples, M and yi = [yi1 , ..., yiM ] ∈ {0, 1} is the tag vector of the i-th image, If a tag M exists in the descriptive text of the video frame, then yim = 1, otherwise yim = 0. We denote the feature vector of the video frame and the probabilistic set of the semantic tag in the i-th image as vi and si , respectively. We define the loss function as: k M  1  yim logsim + (1 − yim )log(1 − sim ) K i=1 m=1


  where si = σ f (vi ) is a M -dimensional vector with si = [si1 , ..., sim ], σ(·) is the logistic sigmoid function and f (·) is implemented as a MLP. Video features extraction is used to generate feature vectors that can represent video and capture related visual information for the decoder.





So far, recursive neural networks (RNN) are generally used when processing sequence data. RNN can fully consider the association between sequence data at every moment, and is widely used in speech recognition, machine translation and text generation. However, during the training process of RNN, there will be a problem of long-term dependence and vanishing gradient. To overcome this problem, Hochreiter et al. [25] improved the structure of traditional RNN and proposed a LSTM network. LSTM can avoid the problem of long-term dependence and vanishing gradient. The LSTM unit consists of three gate structures (input gate it , output gate ot and forget gate ft ), memory unit mt and output unit. Specifically, The gate structure in the LSTM unit is an optional way to pass information. The input data will output a value between 0 and 1 through these gate structures, and then the structure controls the amount of information transmitted according to the size of the output value. When the value of the “door structure” output is 1, it means that the “door” is all open, then all the information can pass; when the value of the “door” structure output is 0, it means that the “door” is all closed, all the information can’t pass. The specific definitions are as follows: it = δ(Wi yt + Ui ht−1 + bi ) ft = δ(Wf yt + Uf ht−1 + bf )

(2) (3)

ot = δ(Wo yt + Uo ht−1 + bo ) gt = ϕ(Wg yt + Ug ht−1 + bg )

(4) (5)

mt = ft  yt + it  gt ht = ot  ϕ(mt )

(6) (7)

where W is the transformation matrix, U and b are the parameters used in the calculation process, yt is the input vector of the LST M unit of each time t, δ function is the S-type activation function, whose function is to map the real number to the range of (0, 1), ϕ is the tanh function,  is the element-wise product with the gate value. mt represents the long memory, ht represents the short memory and yt represents the LSTM unit of each time t. 3.2.1 Semantic Compositional Network The decoder of the semantic combinational network uses the LSTM unit, which expands each weight matrix of the traditional LSTM into a set of weight matrices according to the label. The final result is related to the tags and the probability of its existence in each video frame. The model enables the tags to participate in the calculation of every time step of LSTM, thus further enhancing the performance. The model can combine visual features V and semantic concept vectors S to be fed into LSTM, which is expected to provide richer video content for the caption generation process. At the same time, the whole visual context V is used to initialize the state of LSTM, and the set of LSTM parameters of group K is used to generate a caption by weighting the semantic concept vector S.


Z. Dong et al.


Two-Level LSTM with Temporal Attention and Adaptive Attention We add adaptive attention mechanism to traditional temporal attention mechanism, and use two-level LSTM with attention mechanism as decoder. The first level LSTM with temporal attention mechanism is used to select the video frame associated with the generated word at the current moment, and the second level LSTM with adaptive attention mechanism is used to guide the model to use visual features to generate visual words and sentence context information to generate non-visual words. The first level LSTM is used to decode video frame features. For this level LSTM, the final output depends on the output of the current word yt , the output of previous moment ht−1 and the output of previous memory unit mt−1 as input to obtain the output ht at time t, and the specific process is as follows: h1 , m1 = [W h ; W c ]M ean(vi ) ht , mt = LST M (yt , ht−1 , mt−1 )

(8) (9)

where yt represents the word features of a single word yt , vi represents the set of all visual features, and W h and W c are the parameters used in the calculation process.

Fig. 2. Structure of two-level attention mechanism.

The structure of this level LSTM based on the temporal attention mechanism is showed in Fig. 2. To cope with the variability of video length, we calculate the average value of features in video, and use the hidden state ht generated in



semantic compositional network as the input of the model in each time step, we define the context vector ct as follows: ct = ψ(ht , V )


where ψ(·) represents the calculation process of temporal attention model, and ct represents the calculated intermediate vector. However, this method compresses all visual features into a vector matrix, neglecting the temporal relationship between visual features. To solve this problem, we can calculate the dynamic weight sum of the time feature vector according to the attention weight αit . We calculate ct according to the visual features vi and attention weight αit at each moment, the specific process is as follows: n

ct =

1 vi αti n i=1


n where i=1 vi αti = 1 at t time step. The second level LSTM takes the output ht of the first level LSTM unit, the output of the previous moment ht−1 and the output of the previous memory unit mt−1 as input to obtain the output ht at time t, which is defined as follows: ht , mt = LST M (ht , ht−1 , mt−1 )


This level LSTM is based on the adaptive attention mechanism, and its structure is illustrated in Fig. 2. The adjusted context vector ct calculated by the adaptive attention mechanism is defined as follows: ct = ψ(ht , ht , ct )


where ψ is the computational function of adaptive attention mechanism. Using the adaptive attention mechanism to calculate the computed ct can ensure that the decoder hardly uses the visual features from the video frame to generate the non-visual words, so it can use the most relevant video frame features to generate the visual words. In this two-layer LSTM network structure, the output of the first LSTM layer is the potential representation of the information that the decoder already knows. By using ht , the attention model in the first level is extended, and a computational model that can determine whether to combine visual information to predict the next word is proposed. Its representation is as follows: ct = θt ct + (1 − θt )ht


θt = σ(Ws ht )


where Ws is a parameter that needs to be used in the calculation process, and θt is the adjustment unit of each time t, σ is sigmoid activation function. In the adaptive attention model, the value range of θt is [0,1]. When t = 1, means the model will use visual information to predict words, while when t = 0, means the model will use the context information to generate the next word.



Z. Dong et al.


In MLP layer, we can use the hidden state ht and the adjusted context vector ct to get probability distribution on a set of possible words, and then get output zt . The definition is as follows: Pt = sof tmax(Up φ(Wp [ht ; ct ] + bp ) + d)


where Up , Wp , bp and d are the parameters used in the calculation process. Next, we use the softmax function to calculate the probability distribution pt of words, the specific process is as follows: P = (zt |z 0 g(y) = 0, otherwise.



Moving the threshold value b0 out of the activation function inside the neuron summation as so-called bias, allows the neuron to have a learnable role outside of the input coming from the previous layer. Therefore, the basic forward-feed work-flow of the ANN can be described as follows: when neurons of the shallower layer communicate to deeper neuron, it will make a sum of the inputs × weights

Makeup Challenges in Face Recognition


plus the bias, that weighted sum gets passed to the non-linear activation function that will shape our result, and pass the result to the next layer, and so on [12]. Another commonly used family of the activation functions are sigmoids functions that have ‘S’ shape: logistic, hyperbolic tangent, and arc-tangent. The logistic activation function is z = g(y) =

ey 1 , y = wT x. = y −y 1+e e +1


The output of this function can be interpreted as conditional probability of observations over classes, which is related to a Gaussian probability function with an equal covariance [13]. Another problem that the sigmoid functions address is the inconvenient zero or undefined derivative of the threshold activation. If the gradient descent algorithms are used to fit ANN to data, which we will discuss later, more tangible and easy to follow derivatives are desired. Rectified Linear Unit (ReLU) is, yet, another popular family of the activation functions. They are meant to address the problem of “vanishing gradients” of sigmoid activation functions in DNN which occurs when the input y is far enough from zero, and already weak activation function derivative on the output layer gets weaker and weaker on each previous layer [14] so that: z = g(y) = y + = max(0, y). 2.2


Learning Algorithms and Back-Propagation

The way to ‘fit’ an ANN model into the real-world process, or to make sure that the composite transformation of the perceptron layers and activation functions falls into the neighbourhood of the expected results, is adjusting learnable parameters of an ANN model, or its Wij weights for each layer k. In theory, a brute-force, or random search algorithms can archive that, however, because they require a prohibitively long time, more informed about the model and reallife training data algorithms are predominately used. The most popular family of algorithms are Gradient Descent (GD) ones. The simplest GD algorithm may be presented as follows: (10) Wt+1 = Wt − η∇lWt . Here t is the sequence member or iteration number, 0 < η < 1 is learning rate, and ∇lWt is the gradient of the ‘cost’ or ‘objective’ function l : Z ⊂ Rk → L ⊂ R in respect to the weight vector W. To find out how close ANN transformations fall into the expected neighbourhood of the training data, a metric or distance function is needed. Cost functions used for finding possible solutions. Generally, cost function have multiple extrema, hence multiple local minima, therefore, a cost function calculated on the training data is more prone for producing a sub-optimal solution than a sequential training procedure on a subset or ‘mini-batches’. A local minimum found on a mini-batch may lead to a local minimum. Such a modification of the GD algorithm is called Stochastic Gradient Descent (SGD). In its extreme case, when mini-match is one observation, Learning algorithm called Reinforcement Learning. Another variation of


N. Selitskaya et al.

SGD uses gradients from deeper than one sequence member and is called SGD with ‘momentum‘ (SGDM). The efficient algorithm form the SGDM family that became recently popular - Adaptive Moment Estimation (Adam) - is used in the following experiments. Hand in hand with learning algorithms comes the Back-propagation algoˆ), where z ˆ is a training observation rithm. Having defined cost a function l(z, z vector, activation function z = g(y), and perceptron layer summation function y = Wx, partial derivatives of the cost function in respect to the activation ∂l are readily available, where j is the index of a neuron in a function results ∂z j layer. Using the chain rule for partial derivatives, it is easy to find our the cost function derivative in respect to the summation fiction for the given j-th neuron: ∂l ∂l ∂zj = . ∂yj ∂zj ∂yj


Similarly the cost function derivative in respect to the input parameters xi , where i is index of the previous layer output:  ∂l ∂yj ∂l = . ∂xi ∂yj ∂xi j In a vector form: ∇lx = J(

∂y T l ) ∇y , ∂x



T where J( ∂y ∂x ) is a transposed Jacobian matrix of the partial derivatives of the vector of the neuron summation function results yj in respect to the vector of ∂l ∂l ∂l T , . . . ∂y , . . . ∂y ) is a gradient of the cost function l in inputs x, and ∇ly = ( ∂y 1 j k respect to the the vector y. Similarly, the needed for the learning algorithm cost function derivative in respect to the matrix of learning parameters W, if W is flattened to a vector W can be expressed as: ∂y T l ) ∇y . (14) ∇lW = J( ∂W Then the back-propagation process is repeated for previous layers, considering: ∂l ∂l = , (15) ∂zik−1 ∂xik

where k is the neuron layer index. 2.3

Cost Functions

Simple and natural cost function based on the Euclidean distance - Sum of Squared Errors (SSE) is convenient to use with linear transformations because it allows to get solution analytically for the optimisation problem, and even if GD algorithm is used, produces easy to follow gradient with one-root partial

Makeup Challenges in Face Recognition


derivatives. However, if the logistic sigmoid activation function is used, SSE causes problems. Similarly does the ‘Softmax’ generalisation of the logistic activation function applied to the multi-class problem: eyj z = g(yj ) =  yj , y = W x, je


Partial derivatives of SSE cost function in respect to yj , when logistic sigmoid activation function applied to it, results into polynomials of third degree which have three roots: ∂l ∂zj ∂l = = −2(zˆj − zj )zj (1 − zj ). ∂yj ∂zj ∂yj


Such a gradient ∇ly has multiple local minimums which is inconvenient even for GD algorithms. To make partial derivatives having one root, it could have: ∂l (zˆj − zj ) . = ∂zj zj (1 − zj )


Then, integration results into ‘Cross-entropy’ function that being positive and becoming zero when zj = zˆj , is suitable for the cost function role for logistic-type activation functions:  ∂l dzj = −(zˆj ln zj + (1 − zˆj ) ln (1 − zj )). (19) l(z) = ∂zj 2.4

Convolutional Neural Networks

When using general purpose DNNs for image recognition the necessity to address such problems as the input’s high dimensionality and weak control over the feature selection and processing algorithms led to the development of more specific DNN architectures. One of the popular DNN architectures for the image and signal recognition is Convolutional Neural Networks (CNNs). CNN specific techniques include the use of the local receptive fields – local neighbouring pixel patches that are connected to few neurons of the next layer. Such an architecture hints the CNN to extract locally concentrated features and could be implemented using Hadamard product of weight matrix and receptive fields masks y = (M W)x. Similarly to the feature extraction techniques, signals from the local receptive field can be convoluted with kernel functions. The convolution operator conv : Z2 → R of the image I pixel intensity u at i, j-th pixel location with the filter kernel v can be defined as follows:  x = conv(i, j) = u ∗ v(i, j) = u(i − k, j − l)v(k, l), (20) (k,l)∈I

where u : I ⊂ Z2 → R, v : I ⊂ Z2 → R, ∀(i, j) ∈ Z2 , ∀(k, l) ∈ Z2 , ∀x ∈ R.


N. Selitskaya et al.

To process local features shared learning parameters can be used instead of the weight matrix, when a Hadamard product of weight vector is exploited y = (w U)T x. To reduce dimensionality, the deep layers average the neighbouring nodes and use a pooling function, dropping, or down-sampling number of outputs, according to [15,16].


Engineered Features Algorithms

Feature extraction techniques aim to improve the prediction ability of Machine Learning and pattern recognition methods, widely discussed in the literature related to applications in areas of engineering [17], medical decision making [18– 22], and signal analysis [23–27]. 3.1

Bag of Features

Bag of Features (BOF) is an extension of the technique known as Bag of Words. It uses a feature dictionary set in order to represent patterns of interest in a given image as a multi-set (or bag) of the ordered pairs of features along with the number of its occurrences. BOF model generates a compact representation of the large-scale image features in which the spatial relations between features are usually discarded [28]. A BOF dictionary of a manageable size can be constructed from large-scale features extracted out of the training set images by clustering similar features, as described in [28]. Features extracted from images are associated with the closest dictionary term using the Nearest Neighbours or similar techniques in a feature parameters space. Matches between the image and the dictionary features are normalised and evaluated. The result is a vector representation of the given j-th image Ij in the dictionary feature space as follows: Ij = {a1j v1 , a2j v2 , . . . , anj vn },


where vi are the feature dictionary basis vectors, and aij are the numbers of occurrences in the Ij image normalised by the count number of all features in the image. BOF technique offers a freedom in the selection and representation of the image features. However the choice of BOF parameters is crucial for achieving high accuracy, robustness, and speed of recognition, as described in [29,30]. 3.2

Speeded-Up Robust Feature Detector and Descriptor

Speeded-Up Robust Features (SURF) method is a robust and fast gradientbased sparse algorithm for local detection, comparison and representation of invariant image features [31,32]. SURF described in [33] is a further development of an exhaustive feature detection scale-invariant feature transform (SIFT) Engineered Features (EF) algorithm proposed in [34]. SIFT provides a computationally efficient reduction of feature space dimensionality as well as aggregation

Makeup Challenges in Face Recognition


of neighbouring pixels, keeping or even improving the recognition accuracy [35]. The SURF-SIFT family of algorithms combines two phases: (i) image feature detection and (ii) feature description. The feature detection phase uses convolution of the Gaussian filter, to smooth up the feature noise, with the second-order gradient operator to extract edge, corners, blobs, and other candidate structures, among those the most promising ones are selected with the maximal extrema and those that pass the optional noise filters. The process is repeated for multiple image scales. The intensity gradient maximum and its location from an adjacent plane subspace of the scale space are interpolated to extract the robust persistent points of interest. For the rotationinvariant variations of the algorithms, the orientations of the features producing the best gradient extrema are also detected [33]. To avoid sub-sampling recalculations on various scales, the SURF algorithm uses the integral image which is calculated by summing up intensities of all pixels in the box at the current point to retrieve any rectangular area intensity on the constant speed. Instead of the Gaussian filters, approximation box filters are used. Following [33,35], the scale-normalised determinant of the Hessian matrix is used to estimate the second-order gradient: det(Happrox ) = Dxx Dyy − 0.9Dxy ,


where Dxx is the box filter approximation of the second order Gaussian derivative at the i-th pixel location xi = (xi , yi ) with the scale σ of the image I. The above feature representation phase is built on feature descriptors that characterise the neighbourhood of each of interest points in all the scales and their dominant directions. The convolution of the Gaussian filter and the firstorder gradient measures in various directions for the neighbouring grid components comprise the feature space basis [33]. In particular, the SURF algorithm estimates the Gaussian and Haar wavelet convolution using box filters in the neighbourhood radius defined by the scale in the number of rotational steps. The orientation calculating the maximum linear sum of the different Haar components is chosen as dominant. Finally, in the chosen direction a 64-dimensional descriptor is built using the Haar components and their absolute values, calculated for each 4 × 4 cell of t-th region of the image I:     dxi , dyi , |dxi |, |dyi |, . . . }, (23) vt = {. . . ,  where dxi is the sum of the Haar wavelet details responses in the horizontal  direction of pixels in the i-th cell of the region, i ∈ {1, . . . , 16}, and |dxi | is the sum of absolute values of the wavelet responses, and y denotes the vertical direction of the Haar wavelet responses, as described in [33].


Data Set

The BookClub artistic makeup data set contains images of 21 subjects. Each subject’s data contain a series of photos with no-makeup, various makeups, and


N. Selitskaya et al.

images with other obstacles for facial recognition, such as wigs, glasses, jewellery, or various types of head-dress. Each photo-session series contains circa 168 RAW images of up to 4288 × 2848 dimensions (available by request) of six basic emotion expressions (sadness, happiness, surprise, fear, anger, disgust), neutral expression, and the closed eyes photo-shots taken with seven head rotations: en face, left, right, up and down tilted, and 3-axis up-left-counter-clockwise and up-right-clockwise rotations, at three exposure times on the off-white background. Default downloadable format is JPEG of the 1072 × 712 resolution. The subjects’ age varies from their twenties to sixties. Race of the subjects is predominately Caucasian and some Asian. Gender is approximately evenly split up between subcategories. The photos were taken on the course of 2 months with some subjects posed at multiple session with the weeks intervals in various clothing with the changing hairstyles [10].

Subj.1, Sess.MK2; Subj.13, Sess.NM1; Subj.10, Sess.NM1

Subj.1, Sess.MK3; Subj.5, Sess.NM1; Subj.10, Sess.NM1

Subj.5, Sess.MK2; Subj.1, Sess.NM1; Subj.3, Sess.NM1

Fig. 1. Sample images from the BookClub data set sessions (left) that were misidentified by both AlexNet (center) and Bof/SURF (right) models. Part a.



The images used in this research are kept coloured and downsized and compressed into JPEG format with the dimension of 48 × 48 pixels. The downsizing was done due to computational restrictions to keep processing times reasonable. However, observations made on the small size images are extendable to larger sizes. For computational experiments, Keras library with Tensoflow back-end were used. The being used ANN consists of the four sequential groups of layers

Makeup Challenges in Face Recognition


of the Gaussian noise, convolution with ReLU activation functions, normalization, pooling and dropout layers. It is topped with the fully connected layers, the softmax activation function of the last layer and cross-entropy loss function. “Adam” learning algorithm with 0.001 coefficient, mini-batch size 32 and 100 epochs parameters are used, see Listing-1.1. Listing 1.1. ANN model described in the Keras library format

model = S e q u e n t i a l ( ) model . add ( G a u s s i a n N o i s e ( 0 . 1 , i n p u t s h a p e =(48 , 4 8 , 3 ) ) ) model . add ( Conv2D ( f i l t e r s =32 , k e r n e l s i z e =3, a c t i v a t i o n=t f . nn . r e l u ) ) model . add ( B a t c h N o r m a l i z a t i o n ( ) ) model . add ( MaxPool2D ( p o o l s i z e =(2 , 2 ) ) ) model . add ( G a u s s i a n N o i s e ( 0 . 1 ) ) model . add ( B a t c h N o r m a l i z a t i o n ( ) ) model . add ( Dropout ( r a t e = 0 . 1 ) ) model . add ( Conv2D ( f i l t e r s =64 , k e r n e l s i z e =3, a c t i v a t i o n=t f . nn . r e l u ) ) model . add ( B a t c h N o r m a l i z a t i o n ( ) ) model . add ( MaxPool2D ( p o o l s i z e =(2 , 2 ) ) ) model . add ( B a t c h N o r m a l i z a t i o n ( ) ) model . add ( Dropout ( r a t e = 0 . 1 ) ) model . add ( G a u s s i a n N o i s e ( 0 . 1 ) ) model . add ( Conv2D ( f i l t e r s =128 , k e r n e l s i z e =3, a c t i v a t i o n=t f . nn . r e l u ) ) model . add ( B a t c h N o r m a l i z a t i o n ( ) ) model . add ( MaxPool2D ( p o o l s i z e =(2 , 2 ) ) ) model . add ( B a t c h N o r m a l i z a t i o n ( ) ) model . add ( Dropout ( r a t e = 0 . 1 ) ) model . add ( G a u s s i a n N o i s e ( 0 . 1 ) ) model . add ( Conv2D ( f i l t e r s =128 , k e r n e l s i z e =3, a c t i v a t i o n=t f . nn . r e l u ) ) model . add ( B a t c h N o r m a l i z a t i o n ( ) ) model . add ( MaxPool2D ( p o o l s i z e =(2 , 2 ) ) ) model . add ( GlobalAveragePooling2D ( ) ) model . add ( Dense ( 1 0 0 0 , a c t i v a t i o n=t f . nn . r e l u ) ) model . add ( Dense ( n c l a s s e s , a c t i v a t i o n=t f . nn . softmax ) ) model . compile ( o p t i m i z e r=Adam( l r = 0 . 0 0 1 ) , l o s s=l o s s e s . sparse categorical crossentropy , m e t r i c s =[ ‘ a c c u r a c y ’ ] )


N. Selitskaya et al.

To calculate mean accuracy and standard deviation, k-fold cross-validation was applied: a) to the pool of all available images of the data set, b) to the pool of photo-sessions of the subjects that had enough sessions to do form the session pools, c) to the pool of all images of the subjects from the previous items. For the minimal, 3-fold validation on the session level, subjects had to have at least three sessions, which is true only for 12 out of 21 subjects. Overall, the image-level cross-validation doe not have such limitation on a number of folds; nevertheless, the number of folds was kept not significantly different from the session level number k = 5. To verify low or zero accuracy for some makeup types obtained by the Listing-1.1 model, the well known AlexNet implementation in MATLAB [36] was retrained on all non-makeup images of the BookClub Data set of the higher resolution version 1072×712, converted to the expected 256×256. Also, the same input data (1072 × 712, however grey-scaled) were run through the MATLAB’s Bag of Features (BoF) model [37] with SURF descriptors [38].



As expected, despite the small size images were scaled to and not very deep ANN, mean accuracy of the face recognition of the model trained on the samples from all photo-sessions of all subjects is quite high at 92%, and higher (up to 99.9%) for more “lucky” distributions of the images in training and test sets. However, for a more realistic real-life photo-session level training set composition, when the type of makeup or other visual obstacles of the test data is not known ahead of time, accuracy was at sobering 46%, see Table 1. Examples of some types of makeup that scored low recognition rates by both Listing.1.1 model, and AlexNet, see Tables 2 to 3 and Fig. 1. In addition to the expected difficulty to recognise types of makeup with contrast blackwhite areas with sharp edges, camouflage patterns, and complicated paintings, relatively moderate makeups aimed at changing shape of nostrils, eye, brow, and lip cuts and angles, and other characteristic facial lines, folds and ridges, especially accompanied with significant change of the hairstyle or use of wigs, produced quite impressive recognition confusion results. In addition, the high wrong guess ratios of some makeup types demonstrate that un-targeted or even targeted spoofing of the wrong identity is possible. As expected, BoF/SURF algorithms scored worse accuracy than the ANN ones, although, overall high or low accuracy makeup types were common for both types of algorithms for most subjects and session. However, some makeup types that caused significant difficulty for BoF algorithms were easily recognised by ANN algorithms, see Table 5. ANN structures were able to successfully separate sub-spaces of the artificial high contrast features, which caught the attention of the SURF descriptors, and more subtle natural features. BoF/SURF model was easily distraught by the small scale and high contrast details on the clothing. On the other hand, there were few sessions, see Fig. 2, that caused difficulty for AlexNet model, while BoF/SURF model successfully identified subjects, and

Makeup Challenges in Face Recognition


Table 1. Test median accuracy and standard deviation for k-fold cross validation of the model-1. Folds of Subjects K-folds Accuracy 21


0.9221 ± 0.0463

Sessions 12


0.4922 ± 0.1093



0.8847 ± 0.0516



Subj.1, Sess.MK7; Subj.6, Sess.NM1

Subj.7, Sess.MK1; Subj.5, Sess.NM1

Fig. 2. Sample images from the BookClub data set sessions (left) that were misidentified by AlexNet (right) model.

sometimes even with quite high accuracy, see Table 4. In particular, AlexNet was struggling with wigs and sunglasses, which are not shown here, and were dramatically dropping recognition accuracy. Table 2. Test accuracy, best (wrong) guess and the ratio of the images identified as a wrong guess for the retrained AlexNet model on the non-makeup BookClub images. Subject and sessions misidentified by both AlexNet and BoF/SURF models. Session

CNN accuracy CNN guess CNN guess ratio

















S10MK2 0.0000



S10MK3 0.0000



S10MK4 0.0000



S12MK1 0.3448



S12MK2 0.0000



S20MK1 0.0060




N. Selitskaya et al.

Table 3. Test accuracy, best (wrong) guess and the ratio of the images identified as a wrong guess for the Bof/SURF model trained on the non-makeup BookClub images. Subjects and sessions misidentified by both AlexNet and BoF/SURF models. Session

SURF accuracy SURF guess SURF guess ratio

















S10MK2 0.0000



S10MK3 0.0000



S10MK4 0.0000



S12MK1 0.0057



S12MK2 0.01807



S20MK1 0.04819



Table 4. Test accuracy, best (wrong) guess and the ratio of the images identified as a wrong guess for AlexNet model and correct accuracy for Bof/SURF model, trained on the non-makeup BookClub images. Subject and sessions misidentified by AlexNet model, but correctly identified by BoF/SURF model. Session CNN accuracy CNN guess CNN guess ratio SURF accuracy S1MK7 0.0000




S7MK1 0.0000




Table 5. Test accuracy, best (wrong) guess and the ratio of the images identified as a wrong guess for Bof/SURF model and correct accuracy for AlexNer model, trained on the non-makeup BookClub images. Subject and sessions misidentified by BoF/SURF model, but correctly identified by AlexNet model. Session

SURF accuracy SURF guess SURF guess ratio CNN accuracy
























S11MK2 0.0000

Among those makeup types that were easily recognised by both AlexNet and BoF/SURF models with high accuracy, see Table 6, there were realistic-looking makeups, and makeups in “kabuki” style and other makeups that used white and light pigments. Makeup types that produced expected result with nothing outstanding in any way are not shown here.

Makeup Challenges in Face Recognition


Table 6. Test accuracy for Bof/SURF and AlexNer models, trained on the non-makeup BookClub images. Subject and sessions correctly identified by both BoF/SURF and AlexNet models.



SURF accuracy CNN accuracy






















S13MK1 0.8855


S15MK1 0.9515


S16MK1 0.9880


S17MK1 0.8182


Discussion and Future Work

Computation experiments conducted on the novel BookClub data set featuring advanced artistic makeup types and large variety of the training images, have confirmed expectations of the previous investigations conducted on less sophisticated data sets, that under the real-life conditions, unlike in the ideal laboratory setups, the advanced makeup and other occlusions capable of not only successful disguise of person’s identity form the machine learning algorithms, but also spoof the wrong identification. One can’t reasonably use a brute-force approach of cataloguing all possible makeup types in the training sets even for state of the art ANN architectures artistic creativity may produce an unrecognisable pattern. This is a practically useful consideration for those who legitimately seek to protect privacy from the invasive practices of commercial and government entities, as well as public safety agencies entrusted with crime protection. Despite rapid progress of the ANN algorithms in the face recognition that allows them to identify faces with the above-human accuracy in ideal laboratory conditions, in the real life edge case conditions, in particular makeup and other occlusions, combining ANN models in ensembles with the older, less-effective EF ones, but prone to another type of disguise attack, may have a benefit of the safety net. If ANN model gives less than perfect results, it may make seance to switch to EF models. Future work on the data set may consist of including more subjects with more even racial, age, gender, face, skin and hair type distribution, and reproducing the same makeup type on different subjects. In ANN architecture terms, structures of models that do canonical mapping of the original image pixel space into


N. Selitskaya et al.

the factor sub-spaces of the natural human features, which could be multiple, and race, age, gender, and so on specific, and into makeup and occlusion types specific.

References 1. Chen, C., Dantcheva, A., Swearingen, T., Ross, A.: Spoofing faces using makeup: an investigative study. In: 2017 IEEE International Conference on Identity, Security and Behavior Analysis (ISBA), pp. 1–8, February 2017 2. Eckert, M., Kose, N., Dugelay, J.: Facial cosmetics database and impact analysis on automatic face recognition. In: 2013 IEEE 15th International Workshop on Multimedia Signal Processing (MMSP), pp. 434–439, September 2013 3. Feng, R., Prabhakaran, B.: Facilitating fashion camouflage art. In: Proceedings of the 21st ACM International Conference on Multimedia (MM 2013), New York, NY, USA, pp. 793–802. ACM (2013) 4. Face Matching Data Set|Biometric Data|CyberExtruder, December 2019. Accessed 8 Dec 2019 5. Kushwaha, V., Singh, M., Singh, R., Vatsa, M., Ratha, N., Chellappa, R.: Disguised faces in the wild. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1–18, June 2018 6. Chen, C., Dantcheva, A., Ross, A.: Automatic facial makeup detection with application in face recognition. In: 2013 International Conference on Biometrics (ICB), pp. 1–8 (2013) 7. Dantcheva, A., Chen, C., Ross, A.: Can facial cosmetics affect the matching accuracy of face recognition systems? In: 2012 IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems (BTAS), pp. 391–398, September 2012 8. Colombo, A., Cusano, C., Schettini, R.: UMB-DB: a database of partially occluded 3D faces. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 2113–2119, November 2011 9. Setty, S., Husain, M., Beham, P., Gudavalli, J., Kandasamy, M., Vaddi, R., Hemadri, V., Karure, J.C., Raju, R., Rajan, B., Kumar, V., Jawahar, C.V.: Indian movie face database: a benchmark for face recognition under wide variations. In: 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pp. 1–5, December 2013 10. Selitskaya, Koloeridi, Sielitsky: Bookclub data set, June 2019. Accessed 15 June 2019 11. Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65(6), 386 (1958) 12. Ghosn, J., Bengio, Y.: Bias learning, knowledge sharing. IEEE Trans. Neural Netw. 14(4), 748–765 (2003) 13. Alan Julian Izenman, Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer Publishing Company, Incorporated (2008) 14. Agarap, A.F.M.: Deep learning using rectified linear units (relu). CoRR, vol. abs/1803.08375 (2018) 15. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

Makeup Challenges in Face Recognition


16. LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015) 17. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian learning of models for estimating uncertainty in alert systems: application to air traffic conflict avoidance. Integr. Comput.-Aided Eng. 26, 1–17 (2018) 18. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models: an application for estimating uncertainty in trauma severity scoring. Int. J. Med. Inform. 112, 6–14 (2018) 19. Schetinin, V., Jakaite, L., Krzanowski, W.: Bayesian averaging over decision tree models for trauma severity scoring. Artif. Intell. Med. 84, 139–145 (2018) 20. Schetinin, V., Jakaite, L., Jakaitis, J., Krzanowski, W.: Bayesian decision trees for predicting survival of patients: a study on the US national trauma data bank. Comput. Methods Programs Biomed. 111(3), 602–612 (2013) 21. Schetinin, V., Jakaite, L., Krzanowski, W.J.: Prediction of survival probabilities with Bayesian decision trees. Expert Syst. Appl. 40(14), 5466–5476 (2013) 22. Jakaite, L., Schetinin, V.: Feature selection for Bayesian evaluation of trauma death risk. In: 14th Nordic-Baltic Conference on Biomedical Engineering and Medical Physics: NBC 2008 Riga, Latvia, pp. 123–126. Springer Berlin Heidelberg (2008) 23. Schetinin, V., Jakaite, L.: Classification of newborn EEG maturity with Bayesian averaging over decision trees. Expert Syst. Appl. 39(10), 9340–9347 (2012) 24. Jakaite, L., Schetinin, V., Maple, C.: Bayesian assessment of newborn brain maturity from two-channel sleep electroencephalograms. Comput. Math. Methods Med. 2012, 1–7 (2012) 25. Jakaite, L., Schetinin, V., Schult, J.: Feature extraction from electroencephalograms for Bayesian assessment of newborn brain maturity. In: 24th International Symposium on Computer-Based Medical Systems (CBMS), Bristol, pp. 1–6, June 2011 26. Jakaite, L., Schetinin, V., Maple, C., Schult, J.: Bayesian decision trees for EEG assessment of newborn brain maturity. In: The 10th Annual Workshop on Computational Intelligence (UKCI 2010) (2010) 27. Nyah, N., Jakaite, L., Schetinin, V., Sant, P., Aggoun, A.: Learning polynomial neural networks of a near-optimal connectivity for detecting abnormal patterns in biometric data. In: 2016 SAI Computing Conference (SAI), pp. 409–413, July 2016 28. O’Hara, S., Draper, B.A.: Introduction to the bag of features paradigm for image classification and retrieval. arXiv, vol. abs/1101.3354 (2011) 29. Kalsum, T., Anwar, S.M., Majid, M., Khan, B., Ali, S.M.: Emotion recognition from facial expressions using hybrid feature descriptors. IET Image Process. 12(6), 1004–1012 (2018) 30. Chen, B.C., Chen, Y.Y., Kuo, Y.H., Ngo, T.D., Le, D.D., Satoh, S., Hsu, W.H.: Scalable face track retrieval in video archives using bag-of-faces sparse representation. IEEE Trans. Circu. Syst. Video Technol. 27(7), 1595–1603 (2017) 31. Cai, A., Du, G., Su, F.: Face recognition using SURF features. In: Proceedings SPIE, vol. 7496 (2009) 32. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 33. Bay, H., Tuytelaars, T., Van Gool, L.: Surf: speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) Computer Vision - ECCV 2006, pp. 404–417. Springer, Heidelberg (2006) 34. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)


N. Selitskaya et al.

35. Oyallon, E., Rabin, J.: An analysis of the SURF method. Image Process. On Line 5, 176–218 (2015) 36. Transfer Learning Using AlexNet Example, December 2019. Accessed 8 December 2019 37. Image Classification with Bag of Visual Words - MATLAB & Simulink, December 2019. Accessed 15 Dec 2019 38. Feature Extraction Using SURF, December 2019. Accessed 15 Dec 2019

The Effects of Social Issues and Human Factors on the Reliability of Biometric Systems: A Review Mohammadreza Azimi1(B) and Andrzej Pacut2 1


Institute of Control and Computation Engineering, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland m r [email protected] Biometrics Laboratory Research and Academic Computer Network (NASK), Warsaw, Poland

Abstract. This study cautions against the widespread use of biometrics modalities that only perform well under optimal conditions, and highlights the limitations of biometrics technology. Biometrics defines itself as what we are, as opposed to what we have (e.g. smart cards), or what we know (passwords). Today’s smartphones are equipped with biometric tech. The only problem with biometric solutions is their lack of performance. In real-life scenarios, the reliability of biometrics recognition systems can be affected by various social factors, covering user-related parameters, including physiological factors, behavioral factors and environmental factors. In this review, the bibliographical approach is used in order to describe the effects of human factors and the influence of social problems on the reliability of biometrics systems. Keywords: Social problems · Human factor of system · Biometrics system reliability


· Biometrics · Reliability


Biometric methodology, where the physical characteristics include iris, fingerprints, retina, hand geometry, and face characteristics, and the behavioral characteristics include voice, online signature, gait and keystroke pattern, is very convenient in comparison with pins, passwords and other tokens that can easily be forgotten and that are, in most cases, hard to remember. The research carried out by Howard and Etter [1] has hypothesized that factors such as ethnicity, gender and even eye color can play a significant role in the expected false rejection rate for various individuals across the population. They concluded that Asian and African American individuals with brown eyes are the most likely groups to be mistakenly not recognized as true users by biometric recognition systems. Daughman et al. [2] generated 316,250 entire c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 103–110, 2021.


M. Azimi and A. Pacut

distributions of IrisCode impostor scores, each distribution obtained by comparing one iris against hundreds of thousands of others in a database including individuals spanning 152 nationalities. However, the limits of various kinds of biometric techs must be highlighted. In this report, we will briefly review the most important/recent works concerned with the effects of human factors on the performance of biometric systems. For this purpose we will review the state of the art on time-passing effects, mood-variation effects and behavioral factors effects on the performance of biometric recognition systems. We will also highlight the limits of biometric techs under the influences of diseases.


The Passing of Time and Aging Effects

Absolute age has a high influence on how the user can authenticate him/herself by using a biometrics recognition system appropriately. Facial aging is mainly attributed to bone movement and growth, and skin related deformations. Skinrelated effects are associated with the introduction of wrinkles caused by reduced skin elasticity and a reduction in muscle strength [3]. The biological age of individuals can obviously make changes to the biometric data. Pupil dilation cannot directly influence the unique iris features themselves. There is unconfirmed evidence that the feature of iris biometric exhibits significant differences as a function of age. However, it can confidently be claimed that the eye’s capability for pupil dilation decreases with age [4]. The performance of online- signature and offline handwriting recognition systems are highly dependent on the age of the user. This is because the physiological age can affect the face. According to a previously published paper [5], in older users, pen dynamics (e.g., velocity, acceleration, pen lifts) decrease in magnitude. In [6], the relationship between the equal error rate of fingerprint recognition system and the physical aging of individuals was investigated. According to Madry et al. [7], speech signals can provide reliable traces, especially in cases of young adults, while less reliable traces can be found in elderly users. In [8], the age dependency of EEG brain signals has been investigated. For this purpose, electroencephalographic (EEG) resting state activity was acquired in 40 healthy subjects (aged 16–85). They concluded that the complexity of neuronal electric activity changes across the life span of an individual. In the paper by FaundezZanuy et al. [9], it was found that the false acceptance rate increases with the user’s age. Older users were found to be more likely to be mistakenly verified as a genuine user. They reported that the performance of handwriting-based biometric systems will be degraded as age increases. One of the factors that can change the biometric data and reduce the performance of the biometric system is the passing of time. The time lapse between the first session when the user enrolled him or herself and the next recognition event when the user wants to authenticate him or herself is well known as “template aging”. It has been proved that it is likely to affect the decision process by changing the matching score between new data and the biometric reference template that was taken from the user some weeks/months/years before. In other words, a reduction in

Biometrics System’s Reliability


the similarity between different samples of the same user is a direct function of time. Short-term aging can also influence the reliability of biometric systems that will be discussed in this part. Erbilek and Fairhurst [10] investigated aging in the iris biometric. The results showed a significant difference in the systems’ performance across time. In a study by Yoon and Jain in 2015 [11], fingerprint match scores were analyzed. Longitudinal fingerprint records of 15,597 subjects were sampled from an operational fingerprint database whereby each individual had at least five 10-print records over a minimum time span of five years. The study claimed that genuine match scores tend to significantly decrease as the time interval between two fingerprints increases. They also reported that fingerprint recognition accuracy at operational settings, nevertheless, tends to remain stable as the time interval increases up to 12 years (the maximum time span in the dataset). Best-Rowden and Jain [12], conducted a numerical experiment using a longitudinal database of 147,784 operational mug shots of 18,007 repeat criminal offenders, where each subject has at least five face images acquired over a minimum of five years. Longitudinal analysis of the scores showed that, despite decreasing genuine scores over time, the average subject can still be correctly verified at a false accept rate (FAR) of 0.01% across all 16 years of elapsed time in the mentioned database. Czajka [13] studied the iris aging effect on interclass matching scores using four different matchers. The database contained samples collected during three sessions: the first session in 2003, the second session in 2007 and the last session in 2013. Using this database, it was possible for the author to investigate aging in iris biometrics for both in the mid-term (up to two years) and long-term time span (from five to nine years). Czajka reported around a 14% degradation in the average genuine scores because of template aging. It has been also proved that there is a statistically significant difference with respect to the time between examined samples. In the paper [14] the template aging in three-dimensional facial biometric has been studied. The database contains three-dimensional and two-dimensional face pictures taken from 16 participants. The authors investigated the effects of short-term and long-term aging on the performance of an iris recognition system. Komogortsev et al. [15] presented a template aging study of eye-movement biometrics, considering three distinct biometric techniques on multiple stimuli and eye-tracking systems. The short to mid-term aging effects were examined over two weeks and seven months. Galbally et al. [16] investigated the aging effect on dynamic signature systems using a consistent and reproducible database over time. Three different systems, representing the most popular approaches in signature recognition, were used in the experiments, proving the degradation suffered by this trait with the passing of time (over a 15-month time span). Maiorana and Campisi [17], reported and discussed the results obtained from experimental tests conducted on a database comprising 45 subjects whose EEG signals were collected during five or six distinct sessions spanning a total period of three years, using four different elicitation protocols. Czajka et al. in (2018) [18] investigated the diurnal change of the iris, concluding that daily fluctuations have an impact on iris samples similarity scores.



M. Azimi and A. Pacut

Mood Variation

During the course of a day, people may experience considerable changes in mood. According to previous studies, the speaker recognition systems perform well only under optimal conditions (in completely neutral talking environments), whereas parameters such as mood variation [19] can challenge the reliability of both text-independent and text-dependent speaker-recognition systems significantly. An investigation into the effect of an emotional state upon text-independent speaker identification was presented in [20]. The effects of variations in facial mood expressions on the reliability of facial recognition systems was investigated by Azimi [21]. Ferdinando et al. [22] proposed a method that offers emotionindependent features for EEG-based biometric identification with a high accuracy of around 99.4%. Blanco-Gonzalo et al. at the University of Madrid [23] studied the evaluation of usability for an online dynamic signature-based biometric identification system, under pressure of mental fatigue or stress. The daily routine habits can also be considered as important factors playing a key role. For instance, lifestyle habits including addiction, smoking or the abuse of pharmaceutical drugs can cause a biometrics failure, or at least they can degrade the matching accuracy of the biometric system.



Within a person, facial recognition variability related to well-known social issues includes: aging, facial expression, makeup, and facial hair, as well as lighting conditions. However, from among these social issues, facial expression depending on mood and differences in makeup are the most popular factors that have a substantial ability to change the appearance of a face significantly in different ways. Dancheva et al. [24], collected and gathered two different databases from the facial images of individuals, before applying makeup and after applying makeup. They concluded that the makeup can influence the system’s performance evaluation results. In a study by Blanco-Gonzalo et al. [25] they answered the question of whether it is necessary to consider a stylus-used handwritten signature and an online fingertip-used signature as two different forms. They reported that, according to the results of feedback questionnaires, most participants preferred to use a stylus rather than a fingertip when providing a signature, as they are not used to using fingertip-based devices (in real life, an offline signature cannot be provided with fingertip). Blanco-Gonzalo et al. [26] presented performance evaluation results of the handwritten signature recognition system on mobile devices (iPad). Users were required to provide a signature using an iPad, with various styluses in different scenarios, correlating performance results. Smejkal et al. [27] reported that the use of a first signature as ‘practice’, not included in the results, can reduce the variability of signatures among all participants. They also demonstrated that shorter signatures (abbreviated signature, initials), show a very high variability of conformity and non-conformity between individual signatures. It was also concluded that the quality of recognizing a signature increases with the length of the information written down. In

Biometrics System’s Reliability


the investigation conducted by Seyd et al. [28], the performance of such a biometric identification system has been tested for passwords using familiar words from the dictionary or from a user’s memory, and for some with no meaning using arbitrary and random letters. In their conclusion, they declared that in comparison with dictionary-based words, the complex and random passwords with no meaning will cause a change in the reliability of a biometric system, and can increase the equal error rate of the system as a result. In a paper written by Bours and Evensen [29], the effect of one of the most probable sources of error in the reliability of a gait recognition system was investigated using a database comprising data gathered from 40 users. The participants were asked to wear two different shoes in two different sessions. According to the results, for the same shoe and cross shoe setting, the accuracy of the system will be changed respectively. Smoking is another social issue that can lead to voice deviations, vocal changes and acoustics complaints, as reported in [30], in which the sub data from the NIST telephone recording database is used. In a study conducted by Arora et al. [31] the ‘IIITD Iris Under Alcohol Influence’ database, which containing images of the iris pre- and post-alcohol consumption, was used to investigate the influence of alcohol on the reliability of iris-recognition systems. In [32], the authors provided a methodology to measure the difference between pre- and post-alcohol consumption handwritten signature. They reported that the change in the stability of samples for all the participants by collecting a new database containing samples taken from 30 users. Osman et al. [33] contributed a manuscript to present an investigation into the effect of facial plastic surgery on the reliability of facial recognition systems.



Disease is another social problem that can affect the reliability of a biometric system. Azimi et al. [34] investigated iris recognition under the influence of diabetes. A new database containing 1318 pictures from 343 irises – 546 images from 162 healthy irises – 62% female users, 38% male users, 21% under 20 years old, 61% between 20 and 40 years old, 12% between 40 and 60 years old, and 6% more than 60 years old – and 772 images of irises from 181 diabetic eyes with a clearly visible iris pattern – 80% female users, 20% male users, 1% less than 20 years old, 17.5% between 20 and 40 years old, 46.5% between 40 and 60 years old and 35% more than 60 years old – were collected. All of the diabetes-affected eyes had clearly visible iris patterns without any visible impairments, and only type II diabetic patients with at least two years of being diabetic were considered for the investigation. Three different open source iris recognition codes and one commercial software development kit were used to evaluate the results of the iris recognition system’s performance under the influence of diabetes. For the statistical analysis, a t-test and a Kolmogorov-Simonov test were used. The same authors presented [35] a performance analysis of iris-recognition system for healthy and diabetes-affected irises, separately for female and male users. The study demonstrated that especially in light of the ever-growing diabetes


M. Azimi and A. Pacut

epidemic, abnormalities in the iris pattern need to be taken into account when it comes to the development of new biometric identification methods. Tomeo-Reyes et al. [36] in their manuscript entitled “Investigating the impact of drug-induced pupil dilation on automated iris recognition” collected a database consisting of 2183 iris images acquired at a resolution of 640 × 480 pixels. They concluded that drug-induced pupil dilation was shown to result in higher iris recognition error rates than light-induced pupil dilation.



Social issues such as daily routines, the emotional state of the users and template aging can lead to a reduction in the performance of biometric systems. This review presents a bibliographical review on the most important/recent work concerned with the effect of human factors on the performance of biometric systems in various modalities. Acknowledgment. This work is done with funding source from AMBER with sponsorship from the Marie Sklodowska-Curie EU Framework for Research and Innovation Horizon 2020, under Grant Agreement No. 675087.

References 1. Howard, J.J., Etter, D.: The effect of ethnicity, gender, eye color and wavelength on the biometric menagerie. In: IEEE International Conference on Technologies for Homeland Security (HST), Waltham, MA, pp. 627–632. IEEE Press (2013). 2. Daugman, J., Downing, C.: Searching for doppelg¨ angers: assessing the universality of the IrisCode impostors distribution. IET Biometrics 5(2), 65–75 (2016). https:// 3. Panis, G., Lanitis, A., Tsapatsoulis, N., Cootes, T.F.: An overview of research on facial aging using the FG-NET aging database. IET Biometrics 5(2), 37–46 (2016). 4. Fairhurst, M., Erbilek, M., Da Costa-Abreu, M.: Selective review and analysis of aging effects in biometric system implementation. IEEE Trans. Hum.-Mach. Syst. 45(3), 294–303 (2015). 5. Tolosana, R., Vera-Rodriguez, R., Fierrez, J., Ortega-Garcia, J.: Reducing the template ageing effect in on-line signature biometrics. IET Biometrics 8(6), 422– 430 (2019). 6. Beslay, L., Galbally, J., Haraksim, R.: Automatic fingerprint recognition: from children to elderly Ageing and age effects. Report number: JRC110173Affiliation: European Commission (2018). 7. Madry-Pronobis, M.: Automatic gender recognition based on audiovisual cues. Master Thesis (2009) 8. Zappasodi, F., Marzetti, L., Olejarczyk, E., Tecchio, F., Pizzella, V.: Age-related changes in electroencephalographic signal complexity. PLoS One 10(11) (2015).

Biometrics System’s Reliability


9. Faundez-Zanuy, M., Sesa-Nogueras, E., Roure-Alcob´e, J.: On the relevance of age in handwritten biometric recognition. In: IEEE International Carnahan Conference on Security Technology (ICCST), Boston, MA, pp. 105–109. IEEE Press (2012). 10. Erbilek, M., Fairhurst, M.: Analysis of ageing effects in biometric systems: difficulties and limitations. In: Age Factors in Biometric Processing. IET (2013). https:// ch15 11. Yoon, S., Jain, A.K.: Longitudinal study of fingerprint recognition. Proc. National Acad. Sci. U.S. Am. (PNAS) 112(28), 8555–8560 (2015). pnas.1410272112 12. Best-Rowden, L., Jain, A.K.: A longitudinal study of automatic face recognition. In: 2015 International Conference on Biometrics (ICB), Phuket, pp. 214–221 (2015). 13. Czajka, A.: Influence of iris template aging on recognition reliability, November 2014. 14. Manjani, I., Sumerkan, H., Flynn, P.J., Bowyer, K.W.: Template aging in 3D and 2D face recognition. In: IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS) (2016). 7791202 15. Komogortsev, O.V., Holland, C.D., Karpov, A.: Template aging in eye movementdriven biometrics. In: Proceedings Biometric and Surveillance Technology for Human and Activity Identification XI, vol. 9075, p. 90750A (2014). https://doi. org/10.1117/12.2050594 16. Galbally, J., Martinez-Diaz, M., Fierrez, J.: Aging in biometrics: an experimental analysis on on-line signature. Plos One 8(7) (2013). journal.pone.0069897 17. Maiorana, E., Campisi, P.: Longitudinal evaluation of EEG-based biometric recognition. IEEE Trans. Inf. Forensics Secur. 13(5), 1123–1138 (2018). 10.1109/TIFS.2017.2778010 18. Czajka, A., Bowyer, K., Ortiz, E.: Analysis of diurnal changes in pupil dilation and eyelid aperture. IET Biometrics 7(2), 136–144 (2018). 19. Mansour, A., Lachiri, Z.: Emotional speaker recognition in simulated and spontaneous context. In: 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Monastir, pp. 776–781 (2016). https://doi. org/10.1109/ATSIP.2016.7523187 20. Ghiurcau, M.V., Rusu, C., Astola, J.: A study of the effect of emotional state upon text-independent speaker identification, In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, pp. 4944–4947 (2011). 21. Azimi, M., Pacut, A.: The effect of gender-specific facial expressions on face recognition system’s reliability. In: IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR), Cluj-Napoca, pp. 1–4 (2018). 10.1109/AQTR.2018.8402705 22. Ferdinando, H., Sepp¨ anen, T., Alasaarela, E.: Bivariate empirical mode decomposition for ECG-based biometric identification with emotional data. In: Conference Proceedings of the IEEE Engineering in Medicine and Biology Society, pp. 450–453 (2017). 23. Blanco-Gonzalo, R., Sanchez-Reillo, R., Miguel-Hurtado, O., Bella-Pulgarin, E.: Automatic usability and stress analysis in mobile biometrics. Image Vis. Comput. 32(12), 1173–1180 (2014)


M. Azimi and A. Pacut

24. Dantcheva, A., Chen, C., Ross, A.: Can facial cosmetics affect the matching accuracy of face recognition systems? In: 2012 IEEE Fifth International Conference on Biometrics: Theory, Applications and Systems (BTAS), Arlington, VA, pp. 391– 398 (2012). 25. Blanco-Gonzalo, R., Sanchez-Reillo, R., Miguel-Hurtado, O., Liu-Jimenez, J.: Performance evaluation of handwritten signature recognition in mobile environments. IET Biometrics 3(3), 139–146 (2014). 26. Blanco-Gonzalo, R., Diaz-Fernandez, L., Miguel-Hurtado, O., Sanchez-Reillo, R.: Usability evaluation of biometrics in mobile environments. In: The 6th International Conference on Human System Interaction (HSI) (2013). 1109/HSI.2013.6577812 27. Smejkal, V., Kodl, J., Sieger, L.: The influence of stress on biometric signature stability. In: IEEE International Carnahan Conference on Security Technology (ICCST), Orlando, FL, pp. 1–5 (2016). 7815680 28. Syed, Z., Banerjee, S., Cheng, Q., Cukic, B.: Effects of user habituation in keystroke dynamics on password security policy. In: IEEE 13th International Symposium on High-Assurance Systems Engineering (HASE) (2011). HASE.2011.16 29. Bours, P., Evensen, A.: The Shakespeare experiment: preliminary results for the recognition of a person based on the sound of walking. In: International Carnahan Conference on Security Technology (2017). 8167839 30. Tafiadis, D., Chronopoulos, S.K., Kosma, E.I., Voniati, L., Raptis, V., Siafaka, V., Ziavra, N.: Using receiver operating characteristic curve to define the cutoff points of voice handicap index applied to young adult male smokers. J. Voice 32(4), 443– 448 (2018). 31. Arora, S.S., Vatsa, M., Singh, R., Jain, A.: Iris recognition under alcohol influence: a preliminary study. In: 2012 5th IAPR International Conference on Biometrics (ICB), New Delhi, pp. 336–341 (2012). 32. Shin, J., Kuyama, T.: Detection of alcohol intoxication via online handwritten signature verification. Pattern Recogn. Lett. 35, 101–104 (2014) 33. Osman Ali, A.S., Sagayan, V., Malik, A., Aziz, A.: Proposed face recognition system after plastic surgery. IET Comput. Vis. 10(5), 342–348 (2016). https://doi. org/10.1049/iet-cvi.2014.0263 34. Azimi, A., Rasoulinejad, S.A., Pacut, A.: Iris recognition under the influence of diabetes. Biomed. Eng./Biomedizinische Technik 64(6), 683–689 (2019). https:// 35. Azimi, M., Rasoulinejad, S.A., Pacut, A.: The effects of gender factor and diabetes mellitus on the iris recognition system’s accuracy and reliability. In: Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, pp. 273–278 (2019). 36. Tomeo-Reyes, I., Ross, A., Chandran, V.: Investigating the impact of drug induced pupil dilation on automated iris recognition. In: 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS), Niagara Falls, NY, pp. 1–8 (2016).

Towards Semantic Segmentation Using Ratio Unpooling Duncan Boland(B) and Hossein Malekmohamadi Institute of Artificial Intelligence, De Montfort University, Leicester LE1 9BH, UK [email protected], [email protected]

Abstract. This paper presents the concept of Ratio Unpooling as a means of improving the performance of an Encoder-Decoder Convolutional Neural Network (CNN) when applied to Semantic Segmentation. Ratio Unpooling allows for 4 times the amount of positional information to be carried through the network resulting in more precise border definition and more resilient handling of unusual conditions such as heavy shadows when compared to Switch Unpooling. Applied here as a proof-of-concept to a simple implementation of SegNet which has been retrained on a cropped and resized version of the CityScapes Dataset, Ratio Unpooling increases the Mean Intersection over Union (IoU) performance by around 5–6% on both the KITTI and modified Cityscapes datasets, a greater gain than by applying Monte Carlo Dropout at a fraction of the cost. Keywords: Semantic Segmentation · Ratio Unpooling · KITTI CityScapes · Fully Convolutional Networks · Encoder-decoder




Semantic Segmentation is the attempt to understand the contents of an image at the pixel-level by labelling each pixel in an image according to the class which that pixel represents. This typically takes the form of an image mask of the same dimensions as the original image. Semantic Segmentation is used in a number of applications [1,2,14,16,21,26,29] including in the field of Autonomous Vision, in which sections of street-level images are labelled to aid in the understanding of the scene and allow for more effective and safer navigation of an autonomous vehicle. Deep learning has proved successful in a number of applications and the majority of these projects are deep convolutional neural networks (CNNs), which use convolution kernels as feature maps. These types of networks have seen great success in image classification tasks and a number of variants have been developed such as AlexNet [18], VGG16 [27] and GoogLeNet [28]. Building on these successes, CNNs have been applied to the field of semantic segmentation and successive projects have evolved the architecture of these networks, resulting in enhanced performance. c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 111–123, 2021.


D. Boland and H. Malekmohamadi

Inspired by these successes, the authors of [19] applied these CNNs to the field of semantic segmentation, removing the final fully-connected classifier layer and replacing it with an upscaling function to create Fully Convolutional Networks (FCN). Similar approaches were taken in [10,13]. While very successful in fields considering the image as a whole, the feature maps learned by these systems are, by nature of their use of downsampling layers, only a fraction of the size of the input images. As a result, these initial attempts produced coarse, blocky masks. Despite this, the networks produced encouraging results, and FCNs form the basis of most subsequent Semantic Segmentation projects. U-Net [24] applies a symmetrical encoder-decoder architecture in which the FCN style architecture of convolutional layers followed by pooling layers is paired with a mirror image of upscaling deconvolutional layers followed by convolutional layers. This architecture produced masks with finer borders, achieving impressive performance in the 2015 ISBI Cell Tracking challenge; a segmentation task performed over microscopic images. U-net recorded a mean Intersection over Union (IOU) of 92%, which is a huge improvement over the second-place entry’s 46%. Deconvolutional layers, also known as Transposed Convolutional Layers, perform both a set of convolutions and the upscaling process and were introduced in Deconvnet [30] as an alternative to traditional upscaling methods such as bilinear interpolation or nearest-neighbour. The learnable parameters of the deconvolutional filters allow for better performance than these standard upscaling functions, however, the resulting input can suffer from checkerboard artefacting [22]. A common alternative method replaces the single deconvolutional layer with a dedicated upscaling layer followed by a standard convolutional layer, which reduces the checkerboarding effect.

Fig. 1. Progression of network architectures from a single upscale layer to symmetrical Encoder-Decoder to Encoder-Decoder with Max Pooling Switch connections, adapted from figures used in the SegNet and Bayesian SegNet papers [4, 15]

Ratio Unpooling


A similar encoder-decode architecture is used in an image classifier network in [23] which adds a connection between the Max Pooling layers in the encoder and the equivalent upscaling layers in the decoder. This process records the position of the pixel within the 2 × 2 pool which contained the maximum value that is passed to the next layer of the encoder. This positional data, referred to as the Max Pooling Switches, is used by the corresponding upscaling layer to return the updated single value to the original source position of the maximum value within the pool in a process referred to as Switch Unpooling. A summary of this progression is shown in Fig. 1. This technique was adopted into a Unet style symmetrical FCN for segmentation in SegNet [3,4], and later Baysian SegNet [15], an architecturally identical variation of SegNet which uses Monte Carlo Dropout [11] to improve the performance of the network by around 2–3% points. The rest of this paper is organised as follows. Section 2 presents a brief background on Ratio Unpooling. The experiments on two popular datasets with the idea of Ratio Unpooling are covered in Sect. 3 whereas Sect. 4 is dedicated to results and discussion of our experiments. Finally, Sect. 5 concludes this paper.



As described above, the use of Max Pooling Switches improves the upscaling process by allowing relevant spatial and positional information to be carried forward into deeper layers of the network. The authors of [25] demonstrate this by creating an encoder of only pooling layers and a decoder of only unpooling layers using the Max Pooling Switches. The decoder successfully recreates a close approximation of the original image using the pooled values and their positions. This therefore aids the semantic segmentation process by allowing the Max Pooling layers to negate any variance in position or rotation thus allowing the convolutional layers to identify objects within the image, as is the case in image classification networks. The positional information retained in the switches then allows these identified objects to be returned to an approximation of their original locations within the label mask which in turn allows for more accurate boundaries in the output. A variation of Switch Unpooling is used in [17], which applies the Max Pooling Switches during unpooling while also concatenating nearest neighbour upscaled values from earlier levels onto the output. This populates the remaining 3 values in each unpooled area and helps to mitigate the “bed-of-nails” style appearance of the upscaled layer and allows more high-level information to be carried forward through the network. Similarly DeepLabv3+ [8] adds a simple decoder on to the successful series of DeepLab networks which concatenates the result of a simple bilinear interpolation upscaling with the values of the equivalent resolution layer in the decoder. The effect of adding this step and allowing more values to be carried through the network improved the IoU performance from the 81.3% of DeepLabv3 [7] to 82.1% on the CityScapes [9] dataset, a new state of the art at the time.


D. Boland and H. Malekmohamadi

It seems logical then that there is scope for further improving the results if more of the positional information from the earlier layers could be carried forward. The authors of [20] highlight the “bed-of-nails” effect that occurs when unpooling an image using Max Pooling Switches. Since, in a 2 × 2 pool, the input is a single-pixel value which is then positioned in 1 corner of a 2 × 2 pixel square, the remaining 3 values in each unpooled area are left as 0, resulting in each unpooled pixel being isolated in the output image, which appears more dilated than upscaled. In this paper, we propose applying the ratios of the relative values of the original pre-pooled layer, rather than the position of only the maximum value, to fill in the remaining pixel values. This should allow for more detailed positional information to be carried forward and subsequently more accurate output.

Fig. 2. Comparison of Switch Unpooling (top) and Ratio Unpooling

3 3.1

Experiments Baseline Network

By taking the ratio of each pixel value in the pool compared to the maximum value a set of 4 ratios can be calculated quite easily by division. These ratios can then be applied to the equivalent pixels in the layer to be unpooled, giving a value for each of the 4 output pixels and retaining the relevant levels of intensity. This also removes the need to calculate and retain the positional switch

Ratio Unpooling


values entirely, which reduces the memory overhead for the process since the only values being used in the calculation are the original input and output of the Max Pooling layer and the input into the unpooling layer, all of which are held in memory already. Since the calculation of the Max Pooling Switches is a known performance bottleneck, also identified in [20], a sufficiently efficient implementation of this Ratio Unpooling process could also allow for a reduction in the inference time of the network. Figure 2 shows a comparison between Switch Unpooling and Ratio Unpooling. A pre-existing implementation of a fully convolutional semantic segmentation network, which utilises the Max Pooling Switches process, has been selected to provide a baseline level of performance. The model was then modified to use the new Ratio Unpooling method to allow for a comparison of performance. The chosen baseline model is SegNet-Tensorflow [5] an implementation of Bayesian SegNeg in TensorFlow created by L  ukasz Bienias. This implementation uses the VGG network weights as the encoder and was originally trained using a subset of the Cambridge-Driving Labelled Video Database (CamVid) [6]. Although now outperformed by several newer designs, the SegNet architecture was chosen due to its relative simplicity when compared to later enhancements. As many of these employ a similar encoder-decoder structure, any improvements found by applying Ratio Unpooling to SegNet could potentially be replicated in the more complex architectures of the later models such as [17] or [8]. 3.2


The baseline network has been retrained on a modified version of the CityScapes Dataset that has been resized and cropped to match the dimensions of the KITTI Dataset [12] with which it shared labelling conventions. To simplify the training and testing process, the experiments use only the 19 evaluation categories rather than the full 34. 3.3

Ratio Implementation

Having developed a test-bed network and established a baseline level of performance, the proposed Ratio Unpooling method has been added to allow a comparison. The Ratio Unpooling concept as described above applies the ratio of each of the 4 values in the original pool against the maximum value in that pool to the value being upsampled in the current layer. The ratios can be calculated by dividing each of the input values of the original pooling layer by the output values of their pool. However, in order to perform this elementwise division, the original output would have to be first upscaled using no interpolations, simply repeating the value along both axes. These calculated ratios can then be applied to the new input layer using elementwise multiplication; however, this too would have to be upscaled in the same way. A more efficient method for achieving the same result would be to divide the single value of the new input layer by the single value original output before


D. Boland and H. Malekmohamadi

upscaling, which involves multiplying 75% fewer elements. The result can then be upscaled to match the dimensions of original input in a single step for the elementwise multiplication with the input values of the original pooling layer. Since the new layer no longer utilises the Max Pool Switches, the Max Pooling layer definition can now also be changed to avoid calculating these values. This reduces the overhead and allows for more efficient network performance.


Results and Discussion Table 1. Per class mean IoU results for baseline and Ratio Unpooling networks CityScapes Bayesian Baseline Ratio Unpooling

KITTI Baseline Ratio Unpooling

Baseline Ratio Unpooling











































Traffic light







Traffic sign 30.41






Vegetation 74.83






























































Motorcycle 6.02



























After a similar number of iterations, the Ratio Unpooling network produces an increase in overall IoU of just under 8% points when compared to the original

Ratio Unpooling


Max Pooling Switches method. As shown in Table 1, improvement is seen across all 20 classes, with some seeing an increase of over 10 or even 15% points.

Fig. 3. CityScapes validation set accuracy (mean IoU) during training process

In addition, the new network converges in fewer iterations and, as shown in Fig. 3, the performance gains are much smoother. Bayesian inference, simulated through Monte Carlo Dropout, improves the performance of both networks slightly. The performance gain offered by the change in the pooling method is reduced slightly when applied to the Bayesian network to 6% points. Applying the same network to the KITTI dataset gives similar results, with the key difference being the Sky class losing performance, the only instance of performance loss observed across the experiment. When comparing the outputs of the base and Ratio Unpooling networks the improvements are visually noticeable, with classes occupying large regions such as road, building and sky all featuring fewer splotches of incorrect classifications within the regions.

Fig. 4. Improved car definition. Image taken from CityScapes [9]


D. Boland and H. Malekmohamadi

The use of Ratio Unpooling improves the accuracy of object borders, particularly among the car class. Figure 4 shows the performance on the same image by both networks, along with the original image. This result is also visible in Fig. 5, which also shows a reduction in the number of falsely identified pole sections in the Ratio Unpooling network, where the baseline model has mistakenly interpreted the striping pattern on the building.

Fig. 5. Improved car and pole definition. Image taken from CityScapes [9]

The large gain in bike and rider IoU is demonstrated in Fig. 6, which shows a significant improvement in definition and border clarity. It also further demonstrates an improvement in road sign definition. The Ratio Unpooling network also copes better with environmental variations, such as heavy shadowing, as shown in Fig. 7. as well as unusual road elements, such as a tramline, which the baseline network confused for a pole (Fig. 8), or a patch of leaves (Fig. 9).

Fig. 6. Improved rider and bike definition. Image taken from CityScapes [9]

Ratio Unpooling


Fig. 7. Improved ability to cope with shadow. Image taken from CityScapes [9]

Fig. 8. Improved handling of tramline. Image taken from CityScapes [9]

These performance gains can likely be attributed to the increase in the volume of data being passed forward to the decoder network. With a pool size of 2 × 2, Ratio Unpooling allows for 4 times the amount of data to pass through from the encoder to the decoder when compared to Switch Unpooling, populating all 4 values in the new pool, rather than just a single value. In particular it is spatial data, which Max Pooling traditionally removes from the data to aid in classification, that the Ratio Unpooling layer is able to recover. The results show that this does, as anticipated, improve the class edge definitions. In addition, the Ratio Unpooling seems to deal with instances of small unclassified objects, such as the leaves in Fig. 9, much more effectively. This can likely be attributed to the additional information about the surrounding region being carried forward, which seems to reduce the uncertainty of the classification. A similar result is seen in the more accurate classification of reflective surfaces. This also offers an explanation for the less erratic trend of the IoU during training, as the network appears less likely to suddenly overfit to a single training image as a result of these improved capabilities. Examining the output of the Bayesian versions of the two networks gives extremely similar results. Ratio Unpooling improves accuracy across all classes,


D. Boland and H. Malekmohamadi

Fig. 9. Improved handling of leaves. Image taken from CityScapes [9]

Fig. 10. Improvements to car definition in Bayesian Network. Image taken from CityScapes [9]

and the label images are much clearer and cleaner, as in the standard networks. Fig. 10 shows a similar level of improvement in the definition of cars against the road as seen in the non-Bayesian networks. Interestingly, the improvement gained by changing to Ratio Unpooling from the base network is much greater than those given by employing Bayesian inference. It also comes at a greatly reduced cost, since the image is still only processed once rather than 25 times. It is likely that gains would be produced in a similar manner when Ratio Unpooling is applied to a more advanced baseline network, albeit possibly not as significant in size. Evidence of this can be seen in the comparative performance of the standard and Bayesian versions of the network and that of the standard network on CityScapes and KITTI. In both comparisons the gains are, for the most part, consistent between the Switch and Ratio Unpooling techniques. It seems reasonable then to suggest that similar gains could be expected when substituting the Ratio Unpooling function into other networks using Switch Unpooling. At worst there is little suggestion that the performance is degraded

Ratio Unpooling


and removal of the need to retain the pool switch values during inference would reduce the memory required in addition to improving the training time. There is likely scope to further improve the implementation of Ratio Unpooling, possibly by taking advantages of developments in the Tensorflow library that have been released near the end of the project. Using the Ratio Unpooling technique to enhance newer, more complex encoder-decoder networks is an obvious next step.



This paper shows that the proposed Ratio Unpooling technique improves accuracy, as measured by mean IoU, over the use of Max Pooling Switches by around 5–6% points. This overall performance gain can be seen across both the CityScapes and KITTI datasets, and in both a standard inference model and a Bayesian inference model, simulated through Monte Carlo Dropout. These performance gains are attributed to the increased quantity of spatial and positional data that is passed from the encoder half of the network to the decoder. Ratio Unpooling intuitively reduces the memory requirement of the network by eliminating all Max Pooling Switch parameters and requiring no additional parameters or variables, although the overall inference time was found to have increased slightly. An additional unanticipated benefit of smoother and faster convergence during training has also been observed. Opportunities for future work include refining the implementation to reduce inference time and integrating Ratio Unpooling into other networks as an alternative to Switch Unpooling, across any number of domains and applications.

References 1. Abbasi, M.H., Majidi, B., Manzuri, M.T.: Glimpse-gaze deep vision for modular rapidly deployable decision support agent in smart jungle. In: 2018 6th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), pp. 75–78. IEEE (2018) 2. Azimi, S.M., Fischer, P., K¨ orner, M., Reinartz, P.: Aerial lanenet: lane-marking semantic segmentation in aerial imagery using wavelet-enhanced cost-sensitive symmetric fully convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 57(5), 2920–2938 (2018) 3. Badrinarayanan, V., Handa, A., Cipolla, R.: Segnet: A deep convolutional encoderdecoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293 (2015) 4. Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39(12), 2481–2495 (2017) 5. Bienias, L.: Segnet-tensorflow. (2018) 6. Brostow, G.J., Fauqueur, J., Cipolla, R.: Semantic object classes in video: a highdefinition ground truth database. Pattern Recogn. Lett. 30(2), 88–97 (2009) 7. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)


D. Boland and H. Malekmohamadi

8. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818 (2018) 9. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016) 10. Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2012) 11. Gal, Y., Ghahramani, Z.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: International Conference on Machine Learning, pp. 1050–1059 (2016) 12. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013) 13. Grangier, D., Bottou, L., Collobert, R.: Deep convolutional networks for scene parsing. In: ICML 2009 Deep Learning Workshop, vol. 3, p. 109. Citeseer (2009) 14. Guo, Z., Shengoku, H., Wu, G., Chen, Q., Yuan, W., Shi, X., Shao, X., Xu, Y., Shibasaki, R.: Semantic segmentation for urban planning maps based on U-Net. In: 2018 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2018), pp. 6187–6190. IEEE (2018) 15. Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680 (2015) 16. Khan, K., Mauro, M., Leonardi, R.: Multi-class semantic segmentation of faces. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 827–831. IEEE (2015) ˇ sevi´c, D., Krapac, J., Segvi´ ˇ 17. Kreˇso, I., Cauˇ c, S.: Convolutional scale invariance for semantic segmentation. In: German Conference on Pattern Recognition, pp. 64–75. Springer (2016) 18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 19. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 20. Mahendran, A., Vedaldi, A.: Salient deconvolutional networks. In: European Conference on Computer Vision, pp. 120–135. Springer (2016) 21. Moon, N., Bullitt, E., Van Leemput, K., Gerig, G.: Automatic brain and tumor segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 372–379. Springer (2002) 22. Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill 1(10), e3 (2016) 23. Ranzato, M., Huang, F.J., Boureau, Y.L., LeCun, Y.: Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007) 24. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 234–241. Springer (2015) 25. Saint Andre, M.D.L.R., Rieger, L., Hannemose, M., Kim, J.: Tunnel effect in CNNs: image reconstruction from max switch locations. IEEE Signal Process. Lett. 24(3), 254–258 (2016)

Ratio Unpooling


26. Saito, S., Li, T., Li, H.: Real-time facial segmentation and performance capture from RGB input. In: European Conference on Computer Vision, pp. 244–261. Springer (2016) 27. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013) 28. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 29. Wei, G.Q., Arbter, K., Hirzinger, G.: Automatic tracking of laparoscopic instruments by color coding. In: CVRMed-MRCAS’97, pp. 357–366. Springer (1997) 30. Zeiler, M.D., Taylor, G.W., Fergus, R., et al.: Adaptive deconvolutional networks for mid and high level feature learning. In: ICCV, vol. 1, p. 6 (2011)

Adaptive Retraining of Visual Recognition-Model in Human Activity Recognition by Collaborative Humanoid Robots Vineet Nagrath1,2(B) , Mossaab Hariz1 , and Mounim A. El Yacoubi1 1


Institute Mines-Telecom, Telecom SudParis, Institut Polytechnique de Paris, Paris, France {mounim.el yacoubi,mossaab.hariz} Service Robotics Research Center, Technische Hochschule Ulm, Ulm, Germany [email protected] Abstract. We present a vision-based activity recognition system for centrally connected humanoid robots. The robots interact with several human participants who have varying behavioral styles and interactivity-variability. A cloud server provides and updates the recognition model in all robots. The server continuously fetches the new activity videos recorded by the robots. It also fetches corresponding results and ground-truths provided by the human interacting with the robot. A decision on when to retrain the recognition model is made by an evolving performance-based logic. In the current article, we present the aforementioned adaptive recognition system with special emphasis on the partitioning logic employed for the division of new videos in training, cross-validation, and test groups of the next retraining instance. The distinct operating logic is based on class-wise recognition inaccuracies of the existing model. We compare this approach to a probabilistic partitioning approach in which the videos are partitioned with no performance considerations. Keywords: Learning and adaptive systems · Human activity recognition · Online learning · Distributed robot systems · Computer vision · Intersection-kernel svm model · Dense interest point trajectories

Acronyms ADLs: BOW: CV/cv Group: EADLs: GMM: HAR: HOF:

Activities of daily living Bag of words Cross-validation group Enhanced ADLs Generalized method of moments Human activity recognition Histogram of optical flows

c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 124–143, 2021.

Adaptive Activity Recognition by Humanoid Robots




Histogram of gradients Hue, saturation and value Instrumental ADLs Intersection kernel based SVM Interest point Long short-term memory Motion boundary histogram MBH in x orientation MBH in y orientation Natural language processing Probabilistic contribution Split Probabilistic ratio Split Recurrent neural network Space-time interest points Spatio-temporal LSTM Support vector machine Training group Test group


Cognitive robotics has seen an increased interest in the last few years. Robots in such systems can be used as an educational assistant (imitation learning [5]) or purely for entertainment purpose [28]. Humanoid robotic assistance to an elderly person may stimulate a healthier and enjoyable living. To a frail person, a robotic assistant can improve his/her performance of activity of daily living (ADLs). This is equally true for the independence and health of elderly people living on their own [22]. Self-maintenance tasks (ADLs), tasks required for independent living on your own (IADLs: instrumental ADLs) and recreational/educational activities (EADLs: enhanced ADLs) are all essential for maintaining an independent, healthy life. Most current robots supporting ADLs (mainly ambulation), IADLs (housekeeping) and EADLs (social interaction) for elder people lack a natural human-robot interaction [2]. This limitation of current assistive robotic systems can be reduced by developing robots with natural interaction capabilities (vision, speech etc.).

Fig. 1. Anatomy of multi-robot activity recognition system.


V. Nagrath et al.

Natural speech-based human-robot interaction using natural language processing (NLP) is reaching maturity with mobile phone applications like Siri [1] and Cortana [23] making voice-based personal assistance feel natural to most users. Vision-based interaction between robots and humans is necessary for meaningful real-world cognitive robotic applications. Present processing and storage limitations on humanoid robotic platforms prevent us from realizing such applications. Greater complexities in an image and video-based recognition systems compared to speech-based counterparts are also a major limiting factor. For an effective cognitive robot, both speech and vision (image and video) are essential for natural interaction and cognitive awareness. Video and speech can efficiently complement each other for most cognitive applications. Vision-based robotics can extend the capabilities of cognitive robotics to scenarios demanding gesture-based communication for people with speech-language disorders. Physical activity recognition by robots also leads to applications for stimulating physical exercise, imitation learning, and entertainment. An earlier article [10] presented a standalone robotic system for JULIETTE, a project that implemented machine vision algorithms in Nao humanoid robot for recognition of ADLs in a domestic environment. The work presented in [10] was a proof of concept on the recognition algorithms on a limited set of activities and human actors. The activities included some with low motion content such as writing. Variability in the system came with different human actors enacting the activities in their own performance styles. Even when an actor repeats an activity, the visual signature is not exactly the same. Different lighting situations, viewing angles and distances ensured the capabilities of the activity recognition algorithm can be tested on a challenging set of activity videos. The visual recognition part of the work presented in the current article is a reproduction and extension on the work presented in [10] on an entirely new set of activity classes and actors. The current activity database produces a greater challenge to the activity recognition algorithm. The activities chosen in the current work have higher inter-class overlap and intra-class variability. More than half of the activities in the new activity set have similar low motion video

Fig. 2. Similar looking, low movement gestures.

Fig. 3. Some of lighting and location variations.

Adaptive Activity Recognition by Humanoid Robots


signatures such as “come-closer”, “go-away”, “thumbs-up” and “Italian” [26] gestures (Fig. 2). In addition to lighting, viewing angles and distances, the physical locations where robots are deployed are also different (Fig. 3). The visual recognition system presented in [10] was trained once in the beginning (on a computer) and then deployed on the Nao robot. Once deployed, the model remained identical for the whole run and the newly recorded videos were not used to retrain the recognition model. The current work takes into account the present and historical performance of the current recognition model to decide an appropriate time for its retraining. More than one Nao robots are connected to a centralized server and are frequently updating their recognition model with newly recorded activity videos by all the robots. The key challenge in implementing a video-based activity recognition system on a robot is to manage the high demand for computation and memory resources. Even when the system is trained externally on a computer, recognition of a video recorded by a robot using a trained model is very testing on its limited resources. The main contribution on the recognition side of the current work is to show how vision algorithm and robot settings can be optimized for good recognition. Most current gesture recognition systems on robots are utilizing Kinect camera [24] as sensor and an external interface computer to provide processed skeleton and raw RGBD data to the robot [6,25,33,34]. Using only the onboard camera for recording activity videos makes our robot autonomous as once the recognition model is generated (externally), robot’s own resources (camera, processor, and memory) are enough to classify the human activities. The preceding work [10] was an attempt to implement a vision-based activity recognition system fully inside a robot without a high-level human skeleton data as input. This work used a set of 11 activity classes recorded from 8 persons interacting with one robot. The current work is an improvement on the earlier work with even greater complexity (22 classes, 14 persons, multiple robots at multiple locations, online learning, run-time retraining and redeployment of recognition model). We had to enforce drastic constraints on video resolution (160 × 120), duration (2 s) and frame-rate (12 fps) to implement the recognition system on Nao robot’s limited resources. Better recording settings (640 × 480 resolution, 30 fps and constrained human-robot distance/orientation) and its processing on a more powerful computer/robot significantly improve the performance of our recognition algorithm. For current work, however, the focus is on Nao robotic platform. In the past few years, many video-based activity recognition systems on robots have been proposed [14,15,31]. In [35], a simulated (Webot platform) vision-based humanoid robot (HOAP-2) was presented which can identify three simple human gestures. This was a simulation of the “Chalk, Blackboard, Duster” game and used HSV histogram to locate the human hand. 10 geometric features were then extracted from the detected hand pixels blob and are used as neural network input to learn the three gestures. In [13], a GMM method is used to identify the human hand which enacts the gesture of pointing to a calendar in an outdoor environment. For prototyping, the authors adopted a portable computer as a server for processing images. In [16], human action primitives are


V. Nagrath et al.

learned in an unsupervised way along with object state changes. Hidden Markov models (Parametric) were then used to map human arm action trajectories to object’s changes of state. In [3], the gestures were restricted to hand movements and a Gaussian model was used to identify hand and face (skin-color). Gesture trajectories were generated by continuous Kalman filter tracking of hands and face. A histogram is then generated from the points involved in the gesture trajectory, which is then used for classification. Implemented on Nao robot, the speed performance and classification accuracy were not mentioned in the article. The system was restricted to a small variation in lighting and required several manual adjustments on color variations and hand positioning. A more recent work [29] is a resource-heavy yet oversimplified HAR system that does not suit a resource-light humanoid like the Nao Robot. Another interesting attempt [19] performed on a single Nao robot is based on rigid body motions for gesture recognition. This work, however, focuses on alphabet renderings in different spatial orientations. An interesting approach operating in the frequency domain was presented in [11]. Aside from these, there are several skeleton-based approaches for 3d action recognition for usual applications utilizing deep neural networks [7], hierarchical recurrent neural network (Hierarchical RNNs) [9] and co-occurrence feature learning using deep long short-term memory (LSTM) networks [40]. Sequential skeletal-joint [27] and Spatio-temporal LSTM ( ST-LSTM) with trust gates [20] approaches are also worth mentioning. The aforementioned studies are restricted to a few simple gestures which are majorly macro hand and arm movements. The gestures are made very close to the robot making a very convenient scenario for the robot. In our work, we have intentionally kept the conditions complex to better simulate the real-life scenarios. The gesture signatures are not restricted to arm and hand movements in a predefined speed. The relative positioning or human with respect to the robot, lighting conditions and other humans and objects in the scene are not controlled while conducting the experiments. The human actors are not told how to perform an activity. Different actors thus enact the activity in light of their own understanding of the activity. For example, in the activity “Wave-hand”, the actor is told by the operator to simply wave his/her hand. It is up to an actor to decide which hand or hands, speed, height, arc-length, and arc-angles he/she identifies with the “Wave-hand” activity. It is also up to the actor to decide what variations he/she want to make when told to repeat the activity in the future. In most cases, the actors are not told to repeat an action immediately, so that they have limited memory of the last time they performed an activity. This approach ensures that the actors are not influenced by the operators and their own memory of the way a particular activity should be performed. Our system complexity is very high since it works on a high number of activity classes enacted by a high number of actors which are not influenced by the operators to enact an activity in a particular way. Furthermore, more than half of the activities have a very low optical signature and are classified by a single recognition model along with other activities with very large optical signatures. In the current context, the

Adaptive Activity Recognition by Humanoid Robots


optical signature of an activity/gesture is its uniqueness in the descriptor space. A small value of which signifies that it is hard to distinguish one from another. Vision-based human activity recognition (HAR) methodologies are of two main kinds. The first approach involves the extraction of two or three dimensional signature of human movement based on angles between various bodyparts. Such approaches require controlled environments for robust functioning. The more generalized approach is the one which used local interest points (IPs) that records a sparse representation of motion events in the videos. This approach doesn’t require foreground and background separation or tracking. In this approach, the IPs cannot be structured in a spatiotemporal order thus are represented in a bag-of-words (BOW) container. Most recent approaches to HAR go for a BOW based feature descriptor with encoded IP trajectories, pyramidal representations and neighborhood characterization followed by an SVM based classifier [4,18,38,39]. Another enhancement in BOW+SVM methodology for HAR is the use of dense interest point trajectories [36] for a higher resolution in detectable visual activity and robustness to enactment speed of activity. HOG [8], HOF [30], and MBH [37] like descriptors are used to encode IP trajectories. In our approach, we make use of a hybrid descriptor by joining the HOG, HOF, MBHx and MBHy vectors which gives a descriptor vector space of 396 dimensions. To enhance our model with respect to intra-class variability, we use an interaction kernel-based SVM (IKSVM) for classification. We first proposed this approach in [10] where a 396 dimension feature space BOW and an IKSVM based classifier worked as an efficient scheme. The 396 dimension BOW representation captures system variability in terms of scale, orientation, occlusion and actor habits while SVM cuts away unimportant support vectors while classification. We found the storage requirements of our BOW-IKSVM manageable for Nao humanoid robots. The rest of the article is structured as follows. Section 2 explains the activity recognition system based on BOW+IKSVM that runs on each of the Nao robots. In this section, we also explain the structure of cloud-connected Nao robots and methodologies for system learning and retraining along with the two partitioning approaches that we examine in our experiments. Section 3 explains the experiments and results obtained comparing probabilistic and performancebased partitioning approaches. This is followed by the conclusion and future perspectives.


Activity Recognition System

In the current work, we present a vision-based activity recognition system for multiple humanoid robots which are connected to a centralized cloud server. Figure 1 and Fig. 4 shows the anatomy of the system of cloud-connected Nao robots. The system we present has several mobile humanoid robots running vision based activity recognition models. These robots interact with several human participants one at a time. Since the robots in this system have limited computing and memory resources and they depend on an external mechanism


V. Nagrath et al.

Fig. 4. Anatomy of multi-robot activity recognition server.

for training and retraining of their vision based activity recognition models. A cloud server provides the initial recognition model to all member robots. This initial recognition model is trained using an existing sample of training and crossvalidation videos. We will explain the cloud-based training/re-training mechanism later in this section. Our system has N robots and an initial set of E activity classes with class labels: 1, 2, ..., E. At time t = 0 an initial set of videos is split in ratio nT R : nCV : nT S to form training, cross-validation and test video sets [T R0 , CV 0 , T S 0 ]. These initial set of videos are used to generate a recognition model M10 and optionally robot specific recognition models1 m01 , m02 , ..., m0N (Fig. 1 and Fig. 4). For the current article, however, all the robots use identical recognition model. 2.1

Machine Learning Methodology

Space-time interest points (STIPs) [17] are generated by applying the Harris corner detector to the time-stack of gray video frames. Here the Harris corners applied to three dimensions (two dimensions for the gray image and one for the time). Each IP is then the seed for generating surrounding gradient descriptors such as HOG and HOF. These descriptors are used to characterize the IPs which are represented in a BOW container. The BOWs are then classified using a static classifier like SVM [18]. Most current state-of-the-art feature extraction algorithms for videos are based on the above scheme. In these schemes, sparse IPs for motions with low optical signature may lead to poor classification. The use of local neighborhood in IP characterization doesn’t capture long-range motion dependencies. The problem was solved by the introduction 1

A mechanism exists to generate a robot specific recognition model tailor-made with a more elevated emphasis on videos recorded by that particular robot in its environment.

Adaptive Activity Recognition by Humanoid Robots


of trajectory-based activity recognition [36] where dense sampling is combined with IP tracking. IPs are described by the volume surrounding trajectories and not areas surrounding IPs. We used the dense trajectories [36] to accommodate different speeds by which actors perform the activities. Within each of the eight scales, feature points are tracked to generate trajectories using a median filtering kernel on a dense optical flow [12]. We describe the trajectories using HOG HOF and motion boundary histogram (MBH) along two axes (MBHx, MBHy) histograms on volume surrounding each trajectory in eight spatial scales (Fig. 5). HOG, MBHx, MBHy are each encoded using a 96 dimension vector while a 108dimensional vector is used to encode HOF. We combine the four vectors to form a 396 (96 + 96 + 96 + 108) dimensional IP feature descriptor. On conversion, the vector is converted into a D dimensional BOW histogram. Our system is observed to show no trajectory for static background and in video sequences with no moving person. This makes the system ideal for ADL application. For classification, an SVM kernel-based machine learning model is most appealing for video-based human activity detection. We have used an intersection kernel approach to SVM (IKSVM) [21]. IKSVM kernel improves the recognition model as it preserves inter-class similarities in videos with diverse environmental variability. 2.2

Machine Retraining Methodology

At any time t, the server fetches all new videos and actor-labels from N robots. The set of all new videos S t are organized as the set of correctly classified videos Ot and the set of incorrectly classified videos X t . The sets LS t , LOt and LX t are natural number sets containing actor-labels for the videos in the sets S t , Ot and X t respectively. The server computes two 1 × E sized vectors named ht (Count histogram) and it (Inaccuracy histogram) which store class-wise number frequency of E activity classes in sets LS t and LX t , respectively. The inaccuracy histogram is then normalized to generate inaccuracy contribution vector q t . ht (l) ≡ # new videos of lth activity at time t it (l) ≡ # inaccurate recognitions of lth activity at time t it · IT q t · ht at = E

qt =




[T Rt , CV t , T S t ] ≡ Databaset


[T Rt , CV t , T S t ] = [T Rt−1 , CV t−1 , T S t−1 ] + [trt , cv t , tst ] t


Database = Database




+ [tr , cv , ts ]

(4) (5)


V. Nagrath et al.

In (1) I is Identity vector of size E and q t (l) represents the fractional contribution lth activity make in inaccurate recognitions at time t. Following this, the average inaccuracy expectation, which is a scaler quantity at and applies to any video from the set S t is calculated (2). All videos in the set S t are split into three disjoint sets [trt , cv t , tst ] which are incorporated with the existing videos to form the new video database (3, 4, 5). The latest video database at time t is used for retraining of the recognition model when a retraining decision2 is made at time t (Fig. 1 and Fig. 4). In probabilistic ratio split (P-RS) we use a probabilistic splitting mechanism for assigning individual videos in the set S t to one of the three sets [trt , cv t , tst ], such that in long run, the videos are assigned to the three groups in the given ratio. In addition to P-RS, the current article further examines an improvement on P-RS termed as probabilistic contribution split (P-CS) which is based on performance3 of recognition model at any time. In probabilistic contribution split (P-CS), videos which contribute no lesser than the average inaccuracy expectation inaccuracy (at ) are forced away from the test set tst . The guiding principle behind P-CS is that classes which have a higher contribution to recognition inaccuracy should be adequately represented in training and cross-validation sets. Over time, classes with adequate representation will improve their recognition accuracy and will be replaced by other classes which presently have below average inaccuracy. ProbabilisticSplitting (video, ratio A:B:C) v⎧= video, r = randomly selected ∈ {1, 2, . . . , A + B + C}| ⎨ Assign v → trt 1 ≤ r ≤ A Assign v → cv t A < r ≤ A + B ⎩ Assign v → tst A + B < r ≤ A + B + C Probabilistic Ratio Split (P-RS) ∀v ∈ S t |P robabilisticSplitting(v, nT R : nCV : nT S) Probabilistic Contribution Split (P-CS) ∀v ∈ Ot |P robabilisticSplitting(v, nT R : nCV : nT S) 2


The server records meta-data on model performance across history (current/past/cumulative), class (specific/cumulative), group (Database/TR/CV/TS) and many other parameters such as time since last retraining, run-time addition or deletion of an activity class and more. Depending on the objective of an experiment, one or a combination of these is used as a trigger for retraining. For the experiment presented in the current article, the retraining was triggered whenever the current F score (TS-group class-cumulative) drops below that of cumulative F score (TS-group class-cumulative) producing a simplistic class-neutral mechanism. Class-wise recognition inaccuracies of the existing model are used for performancebased partitioning. This is not to be confused with performance measure used for triggering the retraining mechanism. Partitioning is part of the retraining mechanism and is not the mechanism that triggers retraining.

Adaptive Activity Recognition by Humanoid Robots

Fig. 5. Sampled IPs and trajectories in description by dense trajectories.

Fig. 6. Nao Robot.

Fig. 7. Italian Gesture.

Fig. 8. An actor performing activity “Pick up the phone”.



V. Nagrath et al.

Fig. 9. 14 actors performing the activity “Robot move my right” at different locations, lighting conditions, orientation, speeds and scale. t t t t ∀v  ∈ X , ∃vl ∈ LX , m = q (vl ) · h (vl )| P robabilisticSplitting(v, nTR:nCV:nTS) m < at P robabilisticSplitting(v, nTR:nCV:0) m ≥ at




Experiment Settings

We used Nao4 (pronounced now ), a humanoid robot (Fig. 6) developed by SoftBank Robotics Corp (Formerly known as Aldebaran Robotics), Paris. We have considered 22 initial5 classes of ADLs (Fig. 7, 8, 9). Out of the 22 activities6 , 10 hold a large optical signature and the other 12 activities are gestures with a low 4



Nao robot (specifications): 25 DoF, 2 face-mounted cameras (920p HD maximum 1280x960 resolution at 30 fps) pointing front and floor, animated LEDs, speakers, 4 microphones, voice recognition capabilities on a predefined dictionary, capability to identify human faces, Infrared, pressure and tactile sensors, wireless LAN, batteryoperated, Intel Atom 1.6 GHz processor, Linux kernel. In experiments other than the ones presented in this article, the “initial” #classes may be a subset of the total 22 classes, with remaining classes introduced as previously unknown activity for the recognition model. 22 ADLs: Walk [×4](Right ⇒)(Left⇐ )(Towards )(Away ); Open door )( ); Close door [×2]( )( ); Sit and Stand; Human enacts [×2]( gestures [×6] (Clap hands)(Pick up the phone)(Pick up the glass to drink)(Thumbs up)(Wave hands)(Italian gesture); Human enacts gestures looking towards robot [×6] (Come closer)(Go away)(Stand up)(Sit down)(Move towards my left)(Move towards my right).

Adaptive Activity Recognition by Humanoid Robots


optical signature (e.g. Italian Gesture Fig. 7). All activities are conducted within the robot’s view but only 6 of them are enacted looking directly towards the robot. The performances have a varying perspective, scale, pace, and occlusion from the robot’s viewpoint. The actors are only given simple texts to indicate the activity (e.g. “Wave hand”) and no other description or scheme that may influence their distinct performance style. For example, in the case of activity “Wave Hand” each of the 14 actors can pick their hand (right/left/both) and palm-shape to wave at any height, browsing any angle, at any speed, for any length of time and for as many times as he/she wants. The videos were collected in different locations with varying illumination and objects (props used by actors). There were 14 actors with each actor recording a minimum of 5 videos for each of the 22 activities. For the current experiments, BOW histogram dimension (D) was taken as 330 giving us approximately 15 code-words for each of the 22 classes. There are 1540 (14 × 22 × 5) videos in our database. Since Nao robot is of limited computational and memory capabilities, the video used for recognition7 is a sampled down version of the recorded video8 . From each recorded video, densely sampled IPs were tracked over a window of 15 (out of 24) frames to focus on the most significant elements of activity and match available computational source. The recognition by Nao robot is performed in around 14 s (around .2 s under simulated envelop). For the current article, we have risen the difficulty by using a single recognition model for the entire set of 22 activities. An easier approach that we may try in future trials would be to use separate recognition models for activities and gestures (using videos cropped on the actual action). Given the challenge, the results presented ahead are very satisfactory and highlight the robustness of the presented methodology. For experiments, robots at times were simulated by a software envelope (using pre-recorded videos as camera input) to repeat the experiments on existing videos and to conduct several order-independent test runs on modified sequences Table 1. Classification results



(F scores, Using P-RS)

Activities Gestures Joined

Closing train group




Closing cross-validation group 0.70



Closing test group




Video used by the robot for recognition (Signal): Camera: One (Front facing); Resolution: 160 × 120 × 1 (Gray); Time: 2 s; 24 Frames (2 s); Frame-rate: 12; Scales: 4 Scale Dense sampling. Outside the scope of this article, we use other signals as well. Video recorded by the robot (Record): Camera: One (Front facing); Resolution: 1280 × 960 × 3 (RGB); Time: 5 s (2 s of activity, 1.5 s of pre-activity and 1.5 s of post-activity recording); Frame-rate: 12; Scales: 8 Scale Dense sampling. We have stored all original videos in a secondary database for future employment when robots with better computational and memory capabilities will be accessible.


V. Nagrath et al.

of videos. We evaluate and record the performance of the recognition model at every stage of the experiment-run. We have used F score9 [32] as a measure of model performance. To get a more regular performance evolution trend, we consider the average F score from several order-independent test runs (Fig. 10).

Fig. 10. Several F Score evolution plots (blue) and averaged F Score evolution (yellow) for Test group using P-RS.

Fig. 11. Averaged F Score evolution plots from several order-independent test runs using P-RS.


F score, also known as F1 score/F measure is a measure of accuracy for classification results that considers both precision and recall to compute the score. It is the harmonic mean of precision and recall i.e. 2*((precision*recall)/(precision+recall)).

Adaptive Activity Recognition by Humanoid Robots


Fig. 12. Averaged confusion matrices from several order-independent test runs using P-RS.



V. Nagrath et al.

Results: Probabilistic Ratio Split (P-RS)

Fig. 10 presents the test group F score evolution plots for several orderindependent test runs. Figure 10 presents the test group (T S t ) F score evolution plots for several order-independent test runs. The splitting mechanism in use for the results shown in Fig. 10 to 12 is probabilistic ratio split (P-RS) which was earlier explained in Section 2 of this article. The yellow plot in Fig. 10 is the F score evolution plot averaged from all order-independent test runs. Figure 11 parallels averaged training, cross-validation and test group F scores evolution plots. We can see that both cross-validation and test group plots converge around the same value 0.44 − 0.45. Table 1 contains a compilation of classification results and F scores measured for activities and gesture groups individually. These results suggest a clear separation amongst the 22 recognition classes into activities (Class 1–10) and gestures (Class 11–22) as described in Section 3.1. This distinction is even more evident in the confusion matrix representation of results (Fig. 12). Table 2. Comparison results (F scores)

P-RS P-CS Improvement

Closing new test group 0.44 0.56 27.98% Closing history group





Improvements: Probabilistic Contribution Split (P-CS)

In Section 2.2 we described the working mechanisms of the two partitioning approaches we are examining in the current article. To examine their execution, we conducted several rounds order-independent test runs where at every step the same sequence of videos was fed to two groups of simulated Nao robots. The two groups of robots implemented the P-RS and P-CS partitioning approaches. For their comparison, we recorded F score evolution scored for the new test group (tst ) and history group (T Rt ∪ CV t ∪ T S t ). A video can influence the recognition model at time t in two distinct ways. The video can directly influence the recognition model by being part of the training or cross-validation set at time t − 1. Alternatively, the video can indirectly influence the recognition model by contributing to the reservoir of metadata metat−1 that triggered retraining of the model at time t (see Fig. 4). By comparing the new test group (tst ), we access the performance on the latest videos which have no direct/indirect influence on the recognition model (see Fig. 13). The history group (T Rt ∪ CV t ∪ T S t ) comparison serves as a comprehensive evaluation of the recognition model at time t (see Fig. 14). Table 2 contains a compilation of comparison results and closing F scores measured for P-RS and P-CS splitting mechanisms. These results suggest a clear improvement in P-CS mechanism over P-RS mechanism. There is an improvement of 6.43% in F score for the history group that implies a general improvement in the recognition model. A remarkable 27.98% improvement in F score

Adaptive Activity Recognition by Humanoid Robots


for the new test group implies robustness of P-CS in recognizing videos with no direct/indirect influence over the recognition model. It is also interesting to note that P-CS triggers retraining of the recognition mode more often compared to P-RS. Line markers in plots (Fig. 13, 14) signifies the number of times retraining occurred at a particular time t over several long term tests. Since there is an overall improvement in the recognition model, it will not be improper to assume that P-CS improves the effectiveness of the triggering mechanism as well.

Fig. 13. Averaged Test group F score evolution plot.

Fig. 14. Averaged History group F score evolution plot.



V. Nagrath et al.

Conclusion and Perspectives

The current video-based activity recognition models have great sensory and computational constraints for implementation on a simplistic robot-like Nao. The experiment environment with different locations, lighting conditions, performers and props brings a great degree of variability to the recognition problem. The difficulty is further amplified as relative orientation, scale and speeds are left unchecked. The performers are in no way instructed to perform the activities in a predefined manner. Given all these constraints and liberties given to the performers, the performance of our recognition model is very satisfactory. Improvements could be made by making two separate recognition models for low and high movement activities. Further enhancements could be made by using a better signal video for recognition as the robot is capable of recording a much higher resolution video (1280 × 960 × 3 × 30 fps), which was down-sampled in time and space to suit Nao robot’s computational capabilities and available working memory. The number of scales in the dense sampling algorithm can also be increased to 8. As the computational and memory capabilities of Nao and Nao-like robots increase, the methodology proposed in the current article can easily be up-scaled to give a better performance. It is important to remember that the objective of our work was to implement the video-based human activity recognition system on a simplistic humanoid robot. That objective is believed to be achieved satisfactorily. For the second part of our work, the comparison of performance-based retraining methodology shows that the proposed performance-based partitioning approach shows significant improvements over a simplistic probabilistic-ratio based partitioning approach. An analysis is required to understand why P-CS triggers the retraining more often than P-RS for the current triggering mechanism. Since it takes 14 s for the robot to recognize the action in the current settings, an obvious approach to try would be to perform the recognition utilizing the powerful cloud server using the full record video. Use of robot specific recognition models explained in Section 2 we can better cater to its environmental and cognitive needs specific to the human subjects the robot interacts with. In the current article, we have present only a small set of experiments for identifying the ideal/most-performing techniques for performance-based retraining for small robots. In future articles, we plan to touch upon other aspects that we have investigated. Acknowledgment. The authors would like to thank CARNOT MINES-TSN for funding this work through the ‘Robot apprenant’ project. We are thankful to the Service Robotics Research Center at Technische Hochschule Ulm (SeRoNet project) for supporting the consolidation period of this article.

Adaptive Activity Recognition by Humanoid Robots


References 1. Apple. 2. Begum, M., et al.: Performance of daily activities by older adults with dementia: the role of an assistive robot. In: 2013 IEEE 13th International Conference on Rehabilitation Robotics (ICORR), pp. 1–8 (2013). 6650405 3. Bertsch, F.A., Hafner, V.V.: Real-time dynamic visual gesture recognition in human-robot interaction. In: 2009 9th IEEE-RAS International Conference on Humanoid Robots, pp. 447–453 (2009). 5379541 4. Bilinski, P., Bremond, F.: Contextual statistics of space-time ordered features for human action recognition. In: 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, pp. 228–233 (2012) 5. Boucenna, S., et al.: Learning of social signatures through imitation game between a robot and a human partner. IEEE Trans. Auton. Mental Dev. 6(3), 213–225 (2014). ISSN 1943-0604 6. Chen, T.L., et al.: Robots for humanity: using assistive robotics to empower people with disabilities. IEEE Robot. Autom. Mag. 20(1), 30–39 (2013). 10.1109/MRA.2012.2229950. ISSN 1070-9932 7. Cho, K., Chen, X.: Classifying and visualizing motion capture sequences using deep neural networks. In: VISAPP 2014 - Proceedings of the 9th International Conference on Computer Vision Theory and Applications, vol. 2, June 2013 8. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: In CVPR, pp. 886–893 (2005) 9. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. pp. 1110–1118, June 2015. CVPR.2015.7298714 10. El-Yacoubi, M.A., et al.: Vision-based recognition of activities by a humanoid robot. Int. J. Adv. Robot. Syst. 12(12), 179 (2015). 11. Falco, P., et al.: Representing human motion with FADE and U-FADE: an efficient frequency-domain approach. In: Autonomous Robots, March 2018 12. Farneb¨ ack, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) Image Analysis: 13th Scandinavian Conference, SCIA 2003 Halmstad, Sweden, 29 June–2 July 2003 Proceedings, pp. 363–370. Springer, Heidelberg (2003). ISBN: 978-3-540-45103-7 13. Ho, Y., et al.: A hand gesture recognition system based on GMM method for human-robot interface. In: 2013 Second International Conference on Robot, Vision and Signal Processing, pp. 291–294 (2013). 14. Kotseruba, I., Tsotsos, J.K.: 40 years of cognitive architectures: core cognitive abilities and practical applications. In: Artificial Intelligence Review (2018). https:// ISSN 1573-7462 15. Kragic, D., et al.: Interactive, collaborative robots: challenges and opportunities. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), pp. 18–25. AAAI Press, Stockholm (2018). http://dl.acm. org/citation.cfm?id=3304415.3304419. ISBN 978-0-9992411-2-7 16. Kruger, V., et al.: Learning actions from observations. IEEE Robot. Autom. Mag. 17(2), 30–43 (2010). ISSN 1070-9932 17. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2), 107–123 (2005). ISSN 1573-1405


V. Nagrath et al.

18. Laptev, I., et al.: Learning realistic human actions from movies, June 2008. https:// 19. Lee, D., Soloperto, R., Saveriano, M.: Bidirectional invariant representation of rigid body motions and its application to gesture recognition and reproduction. Auton. Robots 42, 1–21 (2017). 20. Liu, J., et al.: Spatio-temporal LSTM with trust gates for 3D human action recognition. vol. 9907, October 2016. 50 21. Maji, S., Berg, A.C., Malik, J.: Classification using intersection kernel support vector machines is efficient. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008). 22. Margheri, L.: Dialogs on robotics horizons [student’s corner]. IEEE Robot. Autom. Mag. 21(1), 74–76 (2014). ISSN 1070-9932 23. Microsoft. 24. Microsoft. 25. Myagmarbayar, N., et al.: Human body contour data based activity recognition. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5634–5637 (2013). EMBC.2013.6610828 26. NYTimes. rt-Lexicon-of-Italian-Gestures.html? r=0 27. Oi, F., et al.: Sequence of the most informative joints (SMIJ): a new representation for human skeletal action recognition. vol. 25, pp. 8 –13, June 2012. https://doi. org/10.1109/CVPRW.2012.6239231 28. Okamoto, T., et al.: Toward a dancing robot with listening capability: keyposebased integration of lower-, middle-, and upper-body motions for varying music tempos. IEEE Trans. Robot. 30(3), 771–778 (2014). 2014.2300212. ISSN 1552-3098 29. Olatunji, I.E.: Human activity recognition for mobile robot. In: CoRR abs/1801.07633 arXiv: 1801.07633 (2018). 30. Pers, J., et al.: Histograms of optical ow for efficient representation of body motion. Pattern Recog. Lett. 31, 1369–1376 (2010). 03.024 31. Santos, L., Khoshhal, K., Dias, J.: Trajectory-based human action segmentation. Pattern Recogn. 48(2), 568–579 (2015). 015. ISSN 0031-3203 32. Sasaki, Y.: The truth of the F-measure. In: Teach Tutor Mater, January 2007 33. Saveriano, M., Lee, D.: Invariant representation for user independent motion recognition. In: 2013 IEEE RO-MAN, pp. 650–655 (2013). ROMAN.2013.6628422 34. Schenck, C., et al.: Which object fits best? solving matrix completion tasks with a humanoid robot. IEEE Trans. Auton. Mental Dev. 6(3), 226–240 (2014). https:// ISSN 1943-0604 35. Nandi, G.C., Siddharth, S., Akash, A.: Human-robot communication through visual game and gesture learning. In: International Advance Computing Conference (IACC), vol. 2, pp. 1395–1402 (2013). 2005.28 36. Wang, H., et al.: Action recognition by dense trajectories. In: CVPR 2011, pp. 3169–3176 (2011).

Adaptive Activity Recognition by Humanoid Robots


37. Wang, H., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013). s11263-012-0594-8. 38. Yuan, F., et al.: Mid-level features and spatio-temporal context for activity recognition. Pattern Recogn. 45(12), 4182 –4191 (2012). j.patcog.2012.05.001. 312002129. ISSN 0031-3203 39. Zhen, X., Shao, L.: Spatio-temporal steerable pyramid for human action recognition. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6 (2013). 2013.6553732 40. Zhu, W., et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In: CoRR abs/1603.07772. arXiv: 1603.07772 (2016).

A Reasoning Based Model for Anomaly Detection in the Smart City Domain Patrick Hammer1(B) , Tony Lofthouse2 , Enzo Fenoglio3 , Hugo Latapie3 , and Pei Wang1 1

Department of Computer and Information Sciences, College of Science and Technology, Temple University, Philadelphia, PA 19122, USA {patrick.hammer,} 2 Reasoning Systems Ltd., London, UK [email protected] 3 Cisco Systems Inc., San Jose, USA {efenogli,hlatapie}

Abstract. Using a visual scene object tracker and a Non-Axiomatic Reasoning System we demonstrate how to predict and detect various anomaly classes. The approach combines an object tracker with a base ontology and the OpenNARS reasoner to learn to classify scene regions based on accumulating evidence from typical entity class (tracked object) behaviours. The system can autonomously satisfy goals related to anomaly detection and respond to user Q&A in real time. The system learns directly from experience with no initial training required (oneshot). The solution is a fusion of neural techniques (object tracker) and a reasoning system (with ontology). Keywords: Smart City · Non-Axiomatic Logic Visual reasoning · Anomaly detection


· Object Tracking ·

Introduction and Similar Work

Supervised Deep Learning has generated a lot of success in image classification, object detection and tracking. Integrating principles from neural networks and logical reasoning, opens up new possibilities for anomaly prediction and detection. Our approach uses a Non-Axiomatic Reasoning System (NARS) which consumes tracklets provided by a Multi-Class Multi-Object Tracking (MC-MOT) system in real-time. The system infers spatial and temporal relations between events, describing their relative position and predicted path. The multi-class and multi-object tracker was developed by Cisco Systems Inc., and the reasoning-learning component utilized is the OpenNARS implementation [3] which implements a Non-Axiomatic Reasoning System [10]. We tested our approach on a publicly available dataset, ‘Street Scene’, which is a typical scene in the Smart City domain. The dataset is video data obtained by c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 144–159, 2021.

Reasoning Based Anomaly Detection


a street cam mounted on a building. We show the system learning to classify scene regions, such as street or sidewalk, from the typical ‘behaviour’ of the tracked objects belonging to three classes, such as car, pedestrian, and bike. The system is shown to autonomously satisfy a range of goals to detect and inform the user of certain anomaly types. An anomaly ontology supports various anomaly classes; location based, relational based, velocity based, or vector based. From this anomaly ontology a range of specific anomalies can be detected; jaywalking, pedestrian in danger, cyclist in danger, traffic entity too fast, traffic stopped, pedestrian loitering, vehicle against flow of traffic, etc. The current implementation supports the detection of ‘street’ and ‘sidewalk’ regions along with pedestrian in danger and jaywalking anomalies. The system learns in realtime and is capable of detecting anomalies after a cold start of few seconds of operation with no prior training. As well as the autonomous goal satisfying the system can also respond to user questions in real-time. The effectiveness of learning mappings for question-answering has been demonstrated by [5], at least for simple domains. However, their approach requires (image, question, answer) triples, essentially requiring questions to be provided at training time. This is not feasible in cases where novel questions are input by an operator at any time and require a real-time response. Also their model has no time-dependence and no way to make use of background knowledge, while the system we introduce has a notion of time and has the ability to use background knowledge, allowing it to identify user-relevant situations, and to make predictions about possible future situations. The prediction capability also potentially allows the system to identify anomalies before they occur rather than detecting them after the event occurs, though our anomaly detection so far concentrates on recent input, on the current situation.



Generally, our solution is closer to what is described as the first possibility in [1], namely the output of a sub-symbolic system (an object tracker in our case) is used as input to the reasoning component (NARS in our case). This differs from the second case proposed in [1] and followed in [2], to integrate the reasoning ability in the sub-symbolic architecture, such as within the hidden layers of a Convolutional Neural Network. We found the former approach easier to realize, as it allowed us to decouple the different requirements between both components. Following this overall idea, the components are the following: Multi-class Multi-object Tracker is a main component of the distributed asynchronous real-time architecture, and allowed us to produce scalable applications. The overall system is made up of highly decoupled, single-purpose event processing components that asynchronously receive and process events. In particular, a video streamer based on GStreamer pipelines that receives video flows from multi-source streaming protocols (HLS, DASH, RTSP, etc.) and distributes video frames in compressed or raw formats, a multi-object Deep Neural network (DNN) detector based on YOLOv3 [7], and a multi-class multi-objects tracker


P. Hammer et al.

(MC-MOT) [4]. The outputs of the MC-MOT together with the input video frames are merged and sent in sync to the OpenNARS-based visual reasoner for further processing. Interprocess communication and data exchange within the R , an in-memory data structure store different components is done using Redis  configured as LRU cache for reliable online operations. The tracking problem is about how to recognize individual objects in such a way, that they keep their ID while moving continuously in time. We used a tracking-by-detection approach for multi-objects (MOT) that has become increasingly popular in recent years, driven by the progress in object detection and Convolutional Neural Networks (CNN). The system performs predictions with a Kalman filter, and linear data association using the Hungarian algorithm, linking multiple object detections coming from state-of-the-art DNN detectors (e.g. YOLOv3) for the same class in a video sequence. The multi-class multiobject tracker extends this concept to tracking for objects belonging to multiple classes (MC-MOT) by forking a MOT for each class of interest. In this project we limit our solution to three classes: person, bike, and car. The CNN multiobject detector publishes the bounding boxes and object detection positions for the bounding box center (x, y), width and height (w, h) respectively. The three MOTs subscribe to a specific object class and receive the corresponding object’s detection. The output of each MOT is represented by a tracklet i.e., the fragment of trajectory followed by the moving objects. Each tracklet has a class ID, an instance ID, and a sequence of the previous five bounding box detections (t1 , x1 , y1 , w1 , h1 ), ..., (t5 , x5 , y5 , w5 , h5 ), where ti is the timestamp of the detection, xi , yi the location in pixel units on the X-axis and Y-axis, and wi , hi the width and height of the bounding box. The detection mAP is on average 80%, R R Tesla  P100 board, while it takes 30 ms for objects detection on an NVIDIA  and 15 ms for tracking. The object detection is based on Yolo v3 with network image size 768 × 768 for easy detection of small objects. It is worth noting that the model was re-trained on the Street Scene dataset (see [6]) but then also tested on similar scenarios, using available webcams (like the ones our solution is deployed on), to ensure wider generalization abilities. These performances are acceptable for typical real-time smart city applications, since we can reliably send tracklets and frames at 15 fps to the OpenNARS reasoner. Actually, detection and tracking performances can be controlled by the reasoner for a better street scene understanding, thanks to the richer symbolic description of objects behavior and their relationships described by the ontology. As an example, broken tracklets caused by occlusions or misdetections were merged by the reasoner and re-identified. OpenNARS (see [3]) is a Java implementation of a Non-Axiomatic Reasoning System (NARS). NARS is a general-purpose reasoning system that works under the Assumption of Insufficient Knowledge and Resources (AIKR). As described in [9], this means the system works in Real-Time, is always open to new input, and operates with a constant information processing ability and storage space. An important part is the Non-Axiomatic Logic (see [10] and [11])

Reasoning Based Anomaly Detection


which allows the system to deal with uncertainty. To our knowledge our solution is the first to apply NARS for a real-time visual reasoning task. Memory and Control. To decide which premises should be selected for inference, a control mechanism is necessary. OpenNARS’s control mechanism realizes a form of Attentional Control using a Bag data structure. This is a data structure where the elements are sorted according to their priority, and the sampling operation chooses candidates with selection chance proportional to their priority. This makes the control strategy similar to the Parallel Terraced Scan (see [8]), as it also allows to explore many possible options in parallel, with more computation devoted to options which are identified as being more promising. Also please note that this is different to a priority queue, which just selects the highest priority option. After the selection, a candidate is returned to the bag, with a decrease in priority proportional to the durability value. Knowledge items entering have their priority modulated by their truth value and occurrence time. This is the case for both, input events and results generated by the inference process. All combinations of premises happen through sampling from a bag. Through this, recent and high truth expectation (to be introduced in the next section) items tend to be used more often, but also those related to questions and goals get a priority boost. This, in total, can be seen as a form of Attentional Control, which chooses premises neither fully randomly nor exhaustively, but biased by the relevance of patterns and the current time. More details of the memory and control mechanism can be found in [3], but the following knowledge representation is especially important to understand the encodings used in the next sections. Knowledge Representation1 . As a reasoning system, NARS uses a formal language called “Narsese” for knowledge representation, which is defined by a formal grammar given in [10]. To fully specify and explain this language is beyond the scope of this article, so in the following only the directly relevant part is introduced informally and described briefly. The logic used in NARS belongs to a tradition of logic called “term logic”, where the smallest component of the representation language is a “term”, and the simplest statement has a “subject-copula-predicate” format, where the subject and the predicate are both terms. The basic form of statement in Narsese is inheritance statement which has a format “S → P ”, where S is the subject term, and P is the predicate term, the “→” is the inheritance copula, which is defined as a reflexive and transitive relation from one term to another term. The intuitive meaning of “S → P ” is “S is a special case of P” and “P is a general case of S”. For example, statement “robin → bird” intuitively means “Robin is a type of bird”. We define the extension of a given term T to contain all of its known special cases, and its intension to contain all of its known general cases. Therefore, 1

This subsection was adapted from [12] to make the paper self-contained.


P. Hammer et al.

“S → P ” is equivalent to “S is included in the extension of P”, and “P is included in the intension of S”. The simplest, or “atomic”, form of a term is a word, that is, a string of characters from a finite alphabet. In this article, typical terms are common English nouns like bird an animal, or mixed by English letters, digits 0 to 9, and a few special signs, such as hyphen (‘-’) and underscore (‘ ’), but the system can also use other alphabets, or use terms that are meaningless to human beings, such as “drib” and “aminal”. Beside atomic terms, Narsese also includes compound terms of various types. A compound term (con, C1 , C2 , ..., Cn ) is formed by a term connector, con, and one or more component terms (C1 , C2 , ..., Cn ). The term connector is a logical constant with pre-defined meaning in the system. Major types of compound terms in Narsese includes: – Sets: Term {T om, Jerry} is an extensional set specified by enumerating its instances; term [small, yellow] is an intensional set specified by enumerating its properties. – Intersections and differences: Term (bird ∩ swimmer) represents “birds that can swim”; term (bird − swimmer) represents “birds that cannot swim”. – Products and images: The relation “John is the uncle of Zack” is represented as “({John} × {Zack}) → uncle-of ”, “{John} → (uncle-of / {Zack})”, and “{Zack} → (uncle-of /{John} )”, equivalently.2 Here,  is a placeholder which indicates the position in the uncle-of relation the subject term belongs to. – Statement: “John knows soccer balls are round” can be represented as a higher-order statement “{John} → (know/ {soccer-ball → [round]})”, where the statement “soccer-ball → [round]” is used as a term. – Compound statements: Statements can be combined using term connectors for disjunction(‘∨’), conjunction(‘∧’), and negation(‘¬’), which are intuitively similar to those in propositional logic, but not defined using truthtables.3 Several term connectors can be extended to take more than two component terms, and the connector is often written before the components rather than between them, such as (×{John} {Zack}). Beside the inheritance copula (‘→’, “is a type of”), Narsese also has three other basic copulas: similarity (‘↔’, “is similar to”), implication (‘⇒’, “if-then”), and equivalence (‘⇔’, “if-and-only-if”), and the last two are used between statements. 2


This treatment is similar to the set-theoretic definition of “relation” as set of tuples, where it is also possible to define what is related to a given element in the relation as a set. For detailed discussions, see [10]. The definitions of disjunction and conjunction in propositional logic do not require the components to be related in content, which lead to various issues under AIKR. In NARS, such a compound is formed only when the components are related semantically, temporally, or spatially. See [10] for details.

Reasoning Based Anomaly Detection


In NARS, an event is a statement with temporal attributes. Based on their occurrence order, two events E1 and E2 may have one of the following basic temporal relations: – E1 happens before E2 – E1 happens after E2 – E1 happens when E2 happen More complicated temporal relations can be expressed by expressing the subevents of the events. Temporal statements are formed by combining the above basic temporal relations with the logical relations indicated by the term connectors and copulas. For example, implication statement “E1 ⇒ E2 ” has three temporal versions, corresponding to the above three temporal orders, respectively:4 – E1 /⇒ E2 – E1 \⇒ E2 – E1 |⇒ E2 Conjunction statement “E1 ∧ E2 ” has two temporal versions, corresponding to two of the above three temporal orders, respectively: – (E1 , E2 ) (forward) – (E1 ; E2 ) (parallel) All the previous statements can be seen as Narsese describing things or events from a third-person view. Narsese can also describe the actions of the system itself with a special kind of event called operation. An operation is an event directly realizable by the system itself via executing the associated code or command. Formally, an operation is an application of an operator on a list of arguments, written as op(a1 , . . . , an ) where op is the operator, and a1 , ..., an is a list of arguments. Such an operation is interpreted logically as statement “(× {SELF } {a1 } . . . {an }) → op”, where SELF is a special term indicating the system itself, and op is an operator that has a procedural interpretation. For instance, if we want to describe an event “The system is holding key 001”, the statement can be expressed as “(×{SELF } {key 001}) → hold”. Overall, there are three types of sentences defined in Narsese: – A judgment is a statement with a truth-value, and represents a piece of new knowledge that system needs to learn or consider. For example, “robin → bird f, c ”, where the truth-value f, c is defined using (w+ , w− ), indicating the positive and negative evidence for/against a statement. Based 4

Here the direction of the arrowhead is the direction of the implication relation, while the direction of the slash is the direction of the temporal order. In principle, copulas like ‘/⇐’ can also be defined, but they will be redundant. For more discussion on this topic, see [10].


P. Hammer et al.

on (w+ , w− ) the frequency f which is closely related to frequentist proba+ + +w− and the confidence c as ww . bility (see [10]) can be defined as w+w+w − + +w− +1 Roughly speaking the former measures the probability and the confidence the size of the sample space seen so far. To combine both into a single measure of “degree of belief” for decision making purposes, the truth expectation measure is used, which is defined as e(f, c) = c ∗ (f − 12 ) + 12 . – A question: is a statement without a truth-value, and represents a question to be answered according to the system’s beliefs. For example, if the system has a belief “robin → bird” (with a truth-value), it can be used to answer question “robin → bird?” by reporting the truth-value, as well as to answer the question “robin → ?” by reporting the truth-value together with the term bird, as it is in the intension of robin. Similarly, the same belief can also be used to answer question “? → bird” by reporting the truth-value together with the term robin. – A goal is statement without a truth-value, and represents a statement to be realized by executing some operations, according to the system’s beliefs. For example, “(× {SELF } {door 001}) → open!” means the system has the goal to open the door 001 or to make sure that door 001 is opened. Each goal has a “desire-value”, indicating the extent to which the system hopes for a situation where the statement is true. Prediction by Inference. To understand the reasoner’s ability to predict future events, important is how to form (E1 /⇒ E2 ) based on the observation of event E1 with truth value (f1 , c1 ) and event E2 with truth value (f2 , c2 ), where E2 happened after E1 . As described in [10], this is a case of Induction, and the 2 ∗c1 ∗c2 ), which can then be revised on each occurresulting truth value is (f1 , f2f∗c 1 ∗c2 +1 rence by calculating the sum of the positive and negative evidence respectively, allowing the hypothesis to get stronger. Then, based on the potentially revised truth value (f ∗ , c∗ ) of (E1 /⇒ E2 ), and a new occurrence of event E1 with truth value (f1∗ , c∗1 ), E2 can be deduced with truth value (f ∗ ∗ f1∗ , f ∗ ∗ f1∗ ∗ c∗ ∗ c∗1 ). If this anticipated event will then not appear in the input, negative evidence is added to (E1 /⇒ E2 ), in such a way that the frequency value corresponds to the success rate of the hypothesis, while the confidence corresponds to the total of seen cases (including both the positive cases where E1 preceded E2 and the negative cases where it didn’t). Tracklets to Narsese. To map tracklets to NARS events, the numeric information encoded in each tracklet is discretized. This can happen in many ways. We used a fixed-sized grid that maps every detection tuple to the rectangle it belongs to, as shown in Fig. 1:

Reasoning Based Anomaly Detection


Fig. 1. Spatial discretization by a grid

Also for each detection, an Angle term is built based on a discretization of the overall tracklet direction, which can be – – – –

11 = left up 10 = left down 01 = right up 00 = right down though a higher resolution discretization would be possible as well. This angle helps the reasoner to take angle of entities into account when making predictions about the future position of entities, leading to higher prediction accuracy.

Additionally the class and instance ID is provided, which (currently) can be Car, Bike or Pedestrian. Now using the above, the following events can be built: – Indicating the class of an instance: {InstanceID} → Class Currently Class can either be Car, P edestrian or Bike, and for regions street, sidewalk, crosswalk or bikelane. – Indicating the direction of an instance: ({InstanceID} × {Angle}) → directed – Indicating the position of an instance: ({InstanceID} × {RectangleID}) → positioned – To reduce the amount of input events, also combinations are possible: ({InstanceID} × {RectangleID} × {Angle} × {Class}) → T racklet – The system learns to assign street, sidewalk, bikelane and crosswalk labels to the scene, based on the car and pedestrian activity. This is achieved through the use of implication statements:


P. Hammer et al.

(({#1} → Car); (({#1} × {$2}) → positioned)) /⇒ ({$2} → street). (({#1} → P edestrian); (({#1} × {street}) → parallel); (({#1} × {$2}) → positioned)) /⇒ ({$2} → sidewalk). (({#1} → P edestrian); (({#1}×{street}) → orthogonal); (({#1}×{$2}) → positioned)) /⇒ ({$2} → crosswalk). where the additional orthogonal and parallel relation indicates whether the entity is orthogonal to the closest region labelled street, which the reasoner tracks by revising the directions of entities that appear on the related location, choosing the direction that has the highest truth expectation to decide the truth of the relationship. Whenever the consequence is derived, it will be revised by summarizing the positive and negative evidence respectively, and then the label of highest truth expectation (either street or sidewalk ) to answer the question {specif icP osition} → ?X will be chosen. – In addition, relative location relations R are provided by the system, including lef tOf , rightOf , topOf , belowOf and closeT o. These are encoded by ({InstanceID1} × {InstanceID2}) → R Please note that the closeTo relation is only input when the distance is smaller than some threshold defined by the system operator.



Street Scene (see [6]) is a dataset for anomaly detection in video data. It is data collected from a camera mounted on a building, watching a street. The dataset includes unusual cases that should be detected, such as jaywalking pedestrians, cars driving on the sidewalk, or other atypical situations. The tracker applied to the video dataset, outputs tracklets as introduced previously, which are then encoded into Narsese as described. NARS then can use the input information to make predictions, satisfy goals, or to answer queries in real time. Also NARS can detect anomalies and classify them with a background ontology. Initially we developed a street scene simulator, Crossing, simulating a real street. This allowed us to simulate the traffic and pedestrians on a street and generate situations/anomalies that we did not have real data for. We then moved on to the Street Scene dataset and tested our system on the following tasks (to reproduce our results, please see our open source code repository, [13]): Prediction. Here, we tested how well the system can predict future locations of objects, their behavior over time. Figure 2 illustrates this, showing the system’s prediction of the movement of car67 with arrows in violet:

Reasoning Based Anomaly Detection


Fig. 2. The predicted movement of a car

To measure prediction accuracy, the following global metric was defined: the successes ratio Θ = successes + f ailures . Successes indicates the amount of predictions that were matched by actual future input, failures indicates predictions that failed to materialize. As described earlier, this value is also implicitly kept track of by NARS (using NAL truth value) for each hypothesis, as the system strives to find hypotheses that predict future events reliably, whilst under a constant resource constraint. Since predictions in NARS are not of yes/no nature, and have different degree of confidence (and frequency), a truth expectation threshold of 0.6 was used to control whether a prediction is taken into account for Θ. Taking this approach ensures only the predictions the reasoner is relatively certain about are counted, allowing the metric to be applied. As test data, Street Scene was chosen, for plotting in particular the videos Test007 and Test002 of the dataset. While the other test videos in the dataset gave similar scores, these two show characteristic phenomenons to be explained. Test007 produced following Θ (Fig. 3) and prediction amount (Fig. 4):


P. Hammer et al.

Fig. 3. The prediction performance measured by Θ over time

Fig. 4. The amount of predictions over time

Reasoning Based Anomaly Detection


With the last datapoint having Θ = 0.574, and to the following Θ (Fig. 5) and total prediction amount (Fig. 6) in Test002:

Fig. 5. The prediction performance measured by Θ over time

Fig. 6. The amount of predictions over time

With Θ = 0.518 for the last data point. For comparison, our random prediction “baseline agent” did not exceed Θ = 0.05 in any of the examples.


P. Hammer et al.

Please also notice the quick learning performance which our solution supports, the ability to reach reasonable prediction performances from just a few minutes of watching the street from raw video input. To our knowledge this has not been achieved with Artificial Neural Networks alone to comparable extent, though the trained tracker is of course crucial. This combination is a way to allow autonomous systems, like NARS is designed to be, to adapt quickly. For these systems, learning speed, in terms of sample efficiency clearly matters too, not just the end result. While the monotonic improvements in prediction accuracy over time can mostly be explained by the incremental evidence collection, more difficult to explain are the non-trivial jumps, which lead to a staircase-like shape. The steps appear partly because of zero tracklet activity at some times, and also because NARS, as a reasoning system with different requirements than the tracker it utilizes, adapts by spontaneous formation of new concepts. Some of them correspond to new predictive hypotheses which can explain the observations better than the previously existing hypotheses, leading to a sudden stepwise improvement in predictive capability. Of course our solution doesn’t guarantee to find a global optimum, which is true for both the Reasoning component and the trained tracker, and thus also true for the overall solution. Additionally, concept drift (due to accidents, rush hour, weather conditions, road maintenance, etc.) which radically changes the scene dynamics, is largely present in real world datasets like Street Scene. This makes convergence more difficult to achieve, and when concept drifts remain it is not even desired. Most importantly though, we were able to show the empirical results of improvement in prediction accuracy over time. This was sufficient for our purposes and is of practical relevance, though it’s not the final story as there are theoretical issues to be further explored in the future. As a related note, concept drifts in the real application we aim for would have rendered convergence proofs with too strong assumptions about the underlying data distribution inapplicable, but identifying assumptions which are not too strong, are compatible with AIKR [9], and would nevertheless allow for such a proof, are of interest to us, and will be part of our future research. Reasoning-Based Annotation. In the Street Scene dataset, our system is able to label a location (discretized according to the lattice) as street based on the tracklet activity of pedestrians and cars. This is done by the reasoner, which utilizes the related background knowledge expressed by the implication statements we have seen. The labelling usually happens in an one-shot way, but will be overridden or strengthened by further evidence, using Revision (see [10]). Revision adds together both the positive and negative evidence of the premises each, to form a stronger conclusion (in terms of confidence) than each of the premises. Question Answering. Also in Street Scene, we tested the system’s ability to answer questions about the current situation, in real time, demonstrating situ-

Reasoning Based Anomaly Detection


ational awareness about the current state of the street. Questions included (({?1} → Class); ({?1} × {LocationLabel}) → located)? where ?1 is a variable queried for, essentially asking for an instance of a specific class at a specific location such as lane1, which the system returns immediately when perceived, allowing a user, for instance, to ask for jaywalking pedestrians. Anomaly Detection. Often a system should not operate passively (answer queries), but automatically inform the user when specific cases occur. Our approach allows the usage of background knowledge to classify unusual situations, and to send the user a message, if desired. For instance, consider the case of a moving car getting too close to a pedestrian, putting the person in danger. This can easily be expressed in Narsese, using the previously mentioned relative spatial relations the system receives. Furthermore, it can be linked to a goal, such that the system will inform the user whenever it happens: ((({#1} → Car); ({#1} → [f ast]); ({#2} → P edestrian); (({#1} × {#2}) → closeT o)), say(#2, is in danger)) /⇒ ({SELF } → [inf ormative]). which will let the system inform the user assuming the goal ({SELF } → [inf ormative])! was given to the system. An example can be seen in Fig. 7:

Fig. 7. A pedestrian in danger due to close proximity to a quickly moving car

Also jaywalking pedestrians can be specified similarly, using ((({#1} → / ⇒ P edestrian); (({#1} × {street}) → at)), say(#2, is jaywalking)) ({SELF } → [inf ormative]). An example can be seen in Fig. 8. As before, the anomaly is reported by the system through the invocation of the say operator with the instance name given as argument:


P. Hammer et al.

Fig. 8. An example of jaywalking

In our tests, the system reported all of these cases in the Street Scene dataset after an initial runtime of 5 min to map the streets based on car tracklet behavior. Future additions will include richer relationships, especially the vector-based relations which would allow to take relative angles into account, including an [approaching] property to refine the danger detection. Improvements like this allow for additional future investigation.



We demonstrated our system’s ability to quickly acquire a reasonable prediction performance from raw video feeds, using Non-Axiomatic Reasoning on tracklet representations of objects provided by the multi-class multi-object tracker (MCMOT), which are directly converted to Narsese, as we described. Also, the relative and absolute location information given to the system, together with object instances, object categories, relations and other attributes, allows for a rich set of questions to be asked and answered by the system. Here, the system has proven to be able to label streets, sidewalks, bikelanes and crosswalks automatically, and has shown to be capable of answering conjunctive queries with variables in real time, while the scene is changing. Also the system has proven to be effective in detecting anomalies with the help of a simple seed ontology, which contains anomaly classes that are of interest to the operator. From a software perspective, due to the proposed streaming approach and R queues, the architecture is highly scalable. It allows the comthe use of Redis  ponents to work in a distributed way and to be easily deployed on traffic cameras like the one the Street Scene dataset was created with. It also has reached our performance goals of running with 15 fps with the proposed setup, which makes it suitable for deployment on real traffic cams.

Reasoning Based Anomaly Detection


Most importantly here though, the system has proven to be able to operate autonomously, without human-in-the-loop: the system automatically maps locations based on tracklet activity and informs the user about situations of interest, guided by background knowledge, and its goals, but does not demand the user to give any information in return. The scene independence allows for large-scale deployment, as no scene-specific adjustments need to be made. Future work on theoretical side will include potential convergence proofs, architectural improvements as well as exploring control improvements for NARS. On the practical side, we will extend the system to support a more comprehensive ontology, along with a wider range of tracked entity classes. Also, together with additional measures and comparisons with related techniques, we will try to enhance the prediction accuracy further. Predictions will potentially allow anomalies not only to be detected, but predicted before they actually occur, utilizing the prediction ability of the system.

References 1. Alirezaie, M., Laengkvist, M., Sioutis, M., Loutfi, A.: Reasoning upon learning: a generic neural-symbolic approach. In: Thirteenth International Workshop on Neural-Symbolic Learning and Reasoning (2018) 2. Garcez, A.D.A., Gori, M., Lamb, L.C., Serafini, L., Spranger, M., Tran, S.N.: Neural-symbolic computing: an effective methodology for principled integration of machine learning and reasoning. arXiv preprint arXiv:1905.06088 (2019) 3. Hammer, P., Lofthouse, T., Wang, P.: The OpenNARS implementation of the non-axiomatic reasoning system. In: International Conference on Artificial General Intelligence, pp. 160-170. Springer, Cham, July, 2016 4. Jo, K., Im, J., Kim, J., Kim, D.S.: A real-time multi-class multi-object tracker using YOLOv2. In: 2017 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 507–511. IEEE, September 2017 5. Mao, J., Gan, C., Kohli, P., Tenenbaum, J.B., Wu, J.: The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584 (2019) 6. Ramachandra, B., Jones, M.: Street Scene: A new dataset and evaluation protocol for video anomaly detection. arXiv preprint arXiv:1902.05872 (2019) 7. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018) 8. Rehling, J., Hofstadter, D.: The parallel terraced scan: an optimization for an agent-oriented architecture. In: 1997 IEEE International Conference on Intelligent Processing Systems (Cat. No. 97TH8335), vol. 1, pp. 900–904. IEEE, October 1997 9. Wang, P.: Insufficient knowledge and resources—a biological constraint and its functional implications. In: 2009 AAAI Fall Symposium Series, October 2009 10. Wang, P.: Non-Axiomatic Logic: A Model of Intelligent Reasoning. World Scientific, Singapore (2013) 11. Wang, P.: Rigid Flexibility: The Logic of Intelligence. Springer, Heidelberg (2006) 12. Wang, P., Li, X., Hammer, P.: Self in NARS, an AGI system. Front. Robot. AI 5, 20 (2018) 13. OpenNARS applications. Accessed 25 June 2019

Document Similarity from Vector Space Densities Ilia Rushkin(B) Harvard University, Cambridge, USA [email protected]

Abstract. We propose a computationally light method for estimating similarities between text documents, which we call the density similarity (DS) method. The method is based on a word embedding in a high-dimensional Euclidean space and on kernel regression, and takes into account semantic relations among words. We find that the accuracy of this method is virtually the same as that of a state-of-theart method, while the gain in speed is very substantial. Additionally, we introduce generalized versions of the top-k accuracy metric and of the Jaccard metric of agreement between similarity models. Keywords: Document retrieval · Text similarity · Kernel regression · Word embedding

1 Introduction Estimating similarities among texts is essential for document retrieval, clustering and recommendation systems. Taking into account semantic connections is an interesting way of enhancing the similarity estimation, since it alleviates two interconnected shortcomings of the simpler word-matching methods. One is the difficulty with polysemy and semantics in natural languages, and the other is the near-orthogonality, meaning that the number of matching features is low for most pairs of documents, causing a large uncertainty in the match-based similarities [3]. To incorporate semantic connections, the algorithm should have some kind of a “meaning-distance map” among words. A popular method of doing this is word embedding, also known as vectorization [5, 7, 9], where such a map is trained on the frequency of words being near each other in the document corpus. This method achieves mapping words (or, more generally, features such as n-grams) onto points in a Euclidean space in such a way that the Euclidean distance between points represents semantic distance. Word embeddings can now be trained relatively easily, and pre-trained ones are freely available (e.g. [6]). However, even when the word embedding exists, there remains a question of getting from similarities between features (words) to similarities between documents. At present, it appears that the existing methods for this task are either computationally very heavy (e.g. the relaxed word-moving distance of RWMD, introduced by [8] following [10]), or light but with a significantly lower accuracy (e.g. representing each document by a single centroid point in the embedding space). In this work we propose a method that © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 160–171, 2021.

Document Similarity from Vector Space Densities


appears to be both accurate and fast. We call it the density similarity method, or DS for short, since it represents documents as probability densities in the embedding space. On our example data, the DS method performed very similarly to that of RWMD, while being orders of magnitude faster. Furthermore, we think that the idea of the DS method is very natural and straightforward mathematically, giving it interpretability. The rest of this paper is organized as follows. In Sect. 2 we introduce and discuss the DS method, with Subsect. 2.1 and 2.2 devoted to the questions of bandwidth and sampling the space, respectively. In Sect. 3 we propose new generalizations of accuracy metrics. In particular, they are used in Sect. 4, where we apply the DS and the benchmark method to several datasets. Section 5 contains the discussion and further thoughts.

2 Description of the Method Word embedding, or vectorization, is a mapping of document features (informally referred to as “words”) onto points in a d -dimensional Euclidean space Rd , where d is usually in the hundreds. Each document becomes a sequence of points, containing a point multiple times if a word is used multiple times in a document. If we ignore the word order, then the document is a set of distinct points with multiplicity weights assigned to them, i.e. it is a distribution. More formally, the weights form a document-feature matrix wt,i where the index t = 1, .. Nd enumerates documents in the corpus and i = 1, .. Nf enumerates the features. In the simplest case, the elements of this matrix can be the counts of features in documents, but it is common to subject it to various transformations, such as normalizing rows to unit sum and applying the TF-IDF transformation with the goal of lowering weights of those features that are present in too many documents and therefore are not believed to be as informative as the rare features. The word embedding maps i → xi , where each xi is a d -dimensional vector (the “feature-point”), and therefore a document t is represented by a distribution defined on the set of all feature-points in the corpus: ft (xi ) ≡ wt,i . The density similarity method that we propose here is to view the document distributions as statistical sample densities and to estimate their distribution density values on   . Transition from the full set of feature-points to the set a small set of sample points z j  zj involves distances in the embedding space, thus incorporating semantic connections into the method. We call it the “density similarity method”. To estimate the densities, we can use kernel (or “non-parametric”) regression, which is a well-developed field (see [4] or [12] for an introduction to it). Namely, let us define the density of a document at an arbitrary point z ∈ Rd as a kernel regression:  z−xi  i=1 k h wt,i Nf  z−xi  i=1 k h

Nf ρt (z) =


Here k(y) is the kernel and h > 0 is the kernel bandwidth. The most commonly   used kernel is Gaussian: k(y) = (2π )−d /2 exp −|y|2 /2 . Another notable choice is   the radially-symmetric Epanechnikov kernel k(y) = max 0, d − |y|2 , based on [2], where the d = 1 version of this kernel was first studied and demonstrated to be the asymptotically optimal.


I. Rushkin

We note that the denominator in Eq. 1 is just a normalization constant. For our purposes normalization plays no role, so we could use only the numerator of Eq. 1. Applying   Eq. 1 at the n sample points zj , we obtain the density matrix of the corpus: ρt,j ≡ ρt zj . Presuming that n < Nf , this matrix can be seen as a condensed version of the document-feature matrix wt,i , transformed from the Nf original features to n new features. This transformation plays the role similar to latent semantic indexing (LSI) or a topic model such as latent Dirichlet allocation (LDA). Like these examples, the construction of the density matrix incorporates semantic relations among features, since it uses distances in the Euclidean space. Now the document-document similarity can be extracted as similarity among the rows of the density matrix. Given the interpretation of each row of as a probability distribution, it is conceptually appealing to use the Jensen-Shannon divergence [1], but a simple cosine-similarity measure is a faster alternative. In the above we were dealing with one corpus of documents. In document retrieval and recommendation problems there are usually two corpora, queries and items, and it is the cross-corpus (query-item) similarity that needs to be found. The generalization of (queries) (items) and wt,i will be the entire discussion is straightforward: the two matrices wt,i (queries)


and ρt,j , and similarity will independently transformed into two densities ρt,j be computed between them. Equation 1 is readily generalizable in a variety of ways. For instance, we used a spherically-symmetric kernel, but it is not a requirement. Generally speaking, instead of a single bandwidth parameter h one has a d × d bandwidth matrix H , in which case the kernel arguments (z − xi )/h in Eq. 1 are replaced by matrix products H −1 (z − xi ). However, this level of generality is often too complex, and a common intermediate  approximation is to use a diagonal bandwidth matrix H = diag h(1) , . . . , h(d ) , i.e. each Cartesian axis is served by its own bandwidth. Note, however, that it makes ρ dependent on the choice of the coordinate system in Rd . Let us briefly discuss the speed of the proposed DS method. If a document contains, on average, F unique features, creation of the density matrix has time complexity of at most O(FNd n), assuming straightforward calculation of sums in Eq. 1 (although faster methods may exist). The similarity among rows of this matrix (size Nd × n)   cosine has time complexity O Nd2 n . Hence, the time complexity of the entire calculation is:   2 O FN time   complexity  d n + Nd n . By contrast, for the RWMD method the best-case

is O Nf2 for each document-document pair [8], therefore O Nf2 Nd2 for the entire corpus. The dependence on the number of features (even ignoring the fact that F < Nf ) is quadratic in RWMD and linear in DS, indicating that on large corpora DS must be significantly faster. In order to make the comparison more direct, let us estimate that F = Nf Nd−b . Evidently, b may lie in the interval between two extreme values: 0 (each document contains all the features of the corpus) and 1 (no two documents  have any 1−b 2 features in common). With this, the time complexities are: O Nf Nd n + Nd n for DS,   versus O Nf2 Nd2 for RWMD. Below, when we compare RWMD and the DS methods on example datasets, we will see that this indeed translates into a very sizeable difference

Document Similarity from Vector Space Densities


in speed. We chose RWMD as a benchmark its authors, when introducing it, found it to compare favorably with many other popular algorithms. The DS method involves a number of meta-parameters: 1. 2. 3. 4.

The bandwidth h (or, more generally, the bandwidth matrix H ). The number n of the sample points. The kernel shape k(y). The type of similarity measure to be applied to the density matrix rows.

In the last two, it is easy to settle on Gaussian kernel and cosine-similarity. These are simple and standard, and the results are not expected to be too sensitive to them. The bandwidth and the sample points, on the other hand, affect the outcome strongly, as is well known for any kernel regression. In principle, the optimal bandwidth and sample points should be found by training on a labeled training set. Such a training, however, is costly, which makes it essential to look for quick estimates of these parameters instead. In the next two sections we discuss our strategy for such estimates. 2.1 Choice of Bandwidth Selection of the bandwidth h in kernel regression like Eq. 1 lies at the heart of kernel regression theory [4, 12]. The methods fall into two broad categories: computationally heavy methods that perform the optimization of bandwidth (i.e. training), and computationally light “rules-of-thumb”. The classic example of the first category is the least-squares cross-validation, which determines h as the minimizer of the following cost function: C(h) =

N xi − xj xi − xj 2N 2k(0) 1 K − k + d d 2 h N −1 h h N h (N − 1)



d Here K is the convolution kernel:  K(y)  = ∫ k(u)k(y − u)d u. For the Gaus−d /2 2 exp −|y| /2 , it is a broader Gaussian: K(y) = sian kernel k(y) = (2π )   (4π )−d /2 exp −|y|2 /4 . In the more general case of a bandwidth matrix H, the hd in the denominators should be replaced with det H , and the kernel arguments xi − xj /h   with H −1 xi − xj . The Silverman rule [11] is an example from the “rule-of-thumb” category. It estimates the bandwidth separately for each Cartesian axis in Rd , i.e. gives a diagonal bandwidth matrix, and a single bandwidth can be formed as the geometric mean of the diagonal. Namely, if σ (a) is the standard deviation of the a-th components (a = 1, ..d ) of xi , (a) hS


4 N (d + 2)

1 d +4

 , hS =


1/d (a) hS



Another option is to take the average spacing between the feature points. To estimate the volume populated by the xi , we circumscribe a d -dimensional sphere of radius R


I. Rushkin

centered at origin. The log-volume of a ball of radius R is log v(R) = log

π d /2 + d log R, (1 + d /2)


We may also cut out a smaller unpopulated ball of radius r < R, so the remaining volume is a spherical layer in which xi lie. The log-volume of the described spherical layer is   (5) log V (r, R) = log v(R) + log 1 − elog v(r)−log v(R) In fact, the second logarithm is only a small correction unless r is close to R: in a high-dimensional ball, the volume is highly concentrated near the surface. But we keep it here for generality. The bandwidth is then estimated as the typical spacing that the data points would have if they were distributed in that volume uniformly: hV = e(log V (r,R)−log N )/d


The straightforward choices for r and R are the minimum and the maximum of |xi |. However, to avoid making hV sensitive to outliers, we selected r and R as the 0.1 and 0.9 quantiles of |xi |. On our data, the volume method of Eq. 6 gave the best results. The Silverman rule gave somewhat smaller bandwidth and led to lower accuracy. The minimization of Eq. 2 was not only computationally heavy, but also produced much smaller bandwidth values, leading to the worst accuracy of all. For this reason, we suggest using Eq. 6 for bandwidth estimation. If labeled training data is available, one can also try to adjust the estimated bandwidth by a multiplicative factor. 2.2 Sampling the Space For a high d (a typical value in word embeddings is 300), creating a regular grid of sample points is not practical due to the “curse of dimensionality”: a single cube in Rd has 2d vertices. The regularity of the sampling array is not required, however. We just need to sample the space sufficiently for measuring distribution differences, wherever they may occur. One solution is to generate any desired number n of sample points randomly, from a distribution that is uniform in a ball of radius R. It is essentially the same concept as was considered in the previous section for bandwidth estimation. Logically, one should take a spherical layer V (r, R) rather than a ball, but, as we saw, the difference is negligible unless r is very close to R. As in the previous section, we select R as a high quantile of the norms |xi | (we used quantile 0.95), rather than max|xi |, due to the possibility of outliers. The way to generate n uniform random points zj in a d -dimensional ball of radius R is as follows. We first generate n points Zj from a d -variate standard normal distribution (in Cartesian components, it means generating a d × n matrix of independent standard normal   1/d variables). Then, each point is normalized: zj = Zj uj R/Zj , where uj are n numbers independently drawn from the standard uniform distribution on the [0, 1] interval.

Document Similarity from Vector Space Densities


There is no simple way to choose n optimally. Given the uniform distribution of zj , there is no expectation of results deteriorating if n is too high. Rather, the approximation is always expected to improve with higher n, but at an ever-decreasing rate. This situation of “diminishing returns” makes the choice of n similar to that of the number of trees in a random forest algorithm, rather than to the number of components in LSI or the number of topics in LDA. At the same time, since sampling involves randomness, a small n increases the probability of fluctuations in the similarity results. It is even possible for a lower n to give higher accuracy, but not consistently. In our application of the method we tried several values of n ranging from 100 and to 10,000.

3 Performance Metrics On labeled data we can compare the accuracies of the DS method and a benchmark method, and on a non-labeled one we can measure the agreement between them. While standard metrics for accuracy and agreement exist, it seemed useful to us to introduce versions of them that are modified with a “softness” parameter s, and present the results for several values of s. Of course, we include s = 0, at which the modification disappears. The remainder of this section describes these metrics. The output of a document-similarity method can be described by a distance matrix Mt  ,t , where the row-index t  enumerates documents in the corpus of queries and the column-index t – those in the corpus of items. Without loss of generality, we can consider only one query, treating t  as a spectator index. Furthermore, the actual values of distance (equivalently, of similarity) are not important, only their ranks are, because they determine which items, and in what order are returned for a query. Therefore, we consider the vector rt of ranks of the values in a row of M . An accuracy metric should give more weight to lower ranks: the top-k items are important, but we don’t care much in what order it places the almost-irrelevant items in the long tail of high ranks. A standard method is to take only items with rank not higher than a threshold k. More generally, even within these items, we can weigh lower ranks higher. Given a labeled data set, we take the k items with the lowest rank, ordered by that rank. The label-correctness values form a 0-or-1 (boolean) vector ck  , k  = 1, ..k, with which we define the soft top-k accuracy: k

k =1 A(k, s) =  k

ck  wk 

k  =1 wk 

, wk  =

1 k s


Here s ≥ 0 is the softness parameter. The values of A(k, s) lie in the [0, 1] interval. At s = 0 the metric reduces to the usual top-k average correctness. Another metric of performance is needed to measure the agreement between two (A) (B) models which produce two ranking vectors, rt and rt . Choosing a rank threshold k, we can measure the agreement as a Jaccard similarity between the items assigned rank ≤k in the two models:     (A) (A) U (A) (k) = t : rt ≤ k , U (B) (k) = t : rt ≤ k , U (k) = U (A) (k) ∪ U (B) (k),


I. Rushkin

with which the Jaccard similarity index is  (A)  U (k) ∩ U (B) (k) J (k) = |U (k)|


This metric has a sharp cutoff: if a model ranked an item beyond k, it is penalized regardless of how far beyond k it ranked the item. We want to generalize it to make depend on the rank. Observe that the quantity the penalty −1/s  (A) 1 penalizes the model A in just that way (here s ≥ 0). min 1, rt /k |U (k)| t∈U (k)

Taking, for symmetry, the sum of that quantity and its counterpart from model B, and subtracting 1, we obtain the soft Jaccard index: ⎡ ⎛  ⎞ ⎤  (A) −1/s 1 ⎣min⎝1, rt ⎠ + [A → B]⎦ (9) J (k, s) = −1 + |U (k)| k t∈U (k)

By the meaning of U , at least one of the two terms in the summand is always 1, so J (k, s) takes values between 0 and 1, and the upper bound is achieved if and only if the two models completely agree on which items they rank within k. At s = 0, the metric reduces to the standard Jaccard measure of Eq. 8:    (1)   (2)  U (k) + U (k) |U (k)| + U (1) (k) ∩ U (2) (k) = −1 + = J (k) J (k, 0) = −1 + |U (k)| |U (k)| We note in passing that our choice of a power-law dependence was made for simplicity, but one could also generalize A and J using, e.g., an exponential decay. Recalling now the query-index t  , the quantities of Eq. 7 and Eq. 9 can be computed for every query: At  (k, s), Jt  (k, s), and then their distribution over t  can be examined.

4 Experiment Our data set consisted of texts scraped from personal webpages of faculty members and researchers in Harvard University (dataset “People”), the descriptions of events, talks, seminars and public lectures (dataset “Events”), and the news articles recently published at the same institution (dataset “News”). All texts were cleaned and tokenized on the basis of single lemmatized words, dropping stopwords and words shorter than 4 characters. The tokens were then used as features in word embedding. A subset of the dataset “People” was labeled by people’s affiliation with Harvard departments (number of classes: 60; class sizes: 10-479, with mean 58 and median 33). Some descriptive statistics of the data are given in Table 1. The “People” dataset always plays the role of the “queries” corpus, and the role of the “items” corpus can be played by any of the three datasets (when the same dataset “People” is used as “queries” and as “items”, we remove self-recommendations). We used the same word2vec embedding for all methods, pre-trained on a corpus of Wikipedia articles, with dimensionality d = 300 [6]. The tokens that are absent in our corpora were dropped from the embedding, leaving 138,030 tokens.

Document Similarity from Vector Space Densities


Table 1. Descriptive statistics of the data. Dataset

# documents

Max # tokens/document

Mean # tokens/document





“People” labeled












We computed the query-item similarity matrices using the RWMD method and using the DS method with several different numbers of sample points. As was mentioned earlier, the RWMD model for document similarity was shown by its authors to compare favorably with a number of other algorithms, which is why we confine ourselves to using it as a benchmark. All calculations were done on the same machine (2.4 GHz processor and 16 GB memory), in an R environment. For RWMD, we used the implementation of this method in the R package text2vec, with Euclidean distance. Creation of the density matrices was done with a custom R script, and the subsequent calculation of the similarity matrices was done using the R package quanteda. In the DS method, the kernel was Gaussian, and the bandwidth was determined by Eq. 6, with several adjustment factors tried from 0.25 to 64, in powers of 2. We first used the dataset “People” as both queries and items (recommendation of people to other people), and looked at the top-5 accuracy on the labeled subset of this dataset. As Fig. 1 illustrates, the accuracy of RWMD is only slightly higher than that of density similarity

Fig. 1. Top-5 accuracy comparison among RWMD and several versions of density similarity method, differing by the number of sample points. Bandwidth is hV from Eq. 6. The vertical axis is the accuracy measure of Eq. 7, averaged across queries. The horizontal axis is the softness parameter. The error-bars show 3 standard errors.


I. Rushkin

with 500 or more points. For 1,000 and 10,000 points the difference from RWMD is statistically insignificant (p > 0.1).

Fig. 2. Accuracy comparison among RWMD and several versions of density similarity method, differing by the number of sample points. Bandwidth is adjusted by a factor 0.5. The vertical axis is the accuracy measure of Eq. 7, averaged across queries. The horizontal axis is the softness parameter. The error-bars show 3 standard errors.

Larger rank-cutoffs demonstrate a similar picture: top-10 accuracy values are consistently lower than the top-5 values, but the difference between RWMD and DS with 1,000 or 10,000 points not exceeding 0.02.At the same time, the difference in computation speed was large: the RWMD calculation took over 100 h, whereas the density similarity with 1,000 points took less than 11 min (more precisely, 654 s., consisting of 392 s. on density estimation and 262 s. on the similarity matrix), and the calculation with 500 points took about 5 min (175 s. on density estimation and 124 s. on the similarity matrix).1 We also calculated the recommendations of news and events to people, where the dataset “People” serves as queries and is paired either with “Events” or “News” as items. The RWMD method for both recommendations together took about 60,000 s. (about 16,000 s. for “News” and 45,000 s. for “Events”). By comparison, the density similarity method with 1,000 sample points took about 930 s for the same task, and with 500 sample points – under 600 s. We can measure the agreement with RWMD on unlabeled datasets using the soft Jaccard index of Eq. 9, and the results are shown in Fig. 3. They show a strong overall correlation between the model outcomes (this gives no indication which model gives a better recommendation when they do not agree). 1 In this case, queries and items are represented by the same corpus, so only one density matrix is


Document Similarity from Vector Space Densities


Fig. 3. Agreement of two versions of density similarity methods (differing in the number of sample points) with RWMD in three recommendation problems: recommending people to people, events to people, and news to people. The vertical axis is the generalized Jaccard measure of Eq. 9. The colors represent three values of the softness parameter. The box plots show the quartiles of the distribution of this quantity across all queries, and the added diamond shapes show the mean values ± one standard deviation.

As a minimal form of training the model, we repeated the calculation using adjusted bandwidth: multiplying the Eq. 6 by several simple factors. This exploration showed that hV is close to the optimum. Significant deviations from it lead to a decrease either in accuracy, or in the Jaccard index of agreement with RWMD, or in both. For instance, Fig. 2 shows the accuracy at a halved bandwidth which is lower than in Fig. 1.


I. Rushkin

5 Discussion and Further Work The density similarity (DS) method, which we propose here, estimates document similarities – a crucial task for document retrieval and clustering. The speed of the DS method strongly depends on the number of sample points. We found that a sample of 500 or 1,000 points is sufficient: increasing it further produces only a small additional improvement. Even with 10,000 sample points, the DS method is much faster than RWMD, while its top-k accuracy turns out essentially the same. We believe that the gain in speed compensates well for the slight difference in accuracy, even if it turns out that that difference is systematic. Elsewhere, RWMD has been shown to be more accurate than a number of other popular methods [8], and by amounts that are significantly larger than the difference that we observe here. Our application of the DS method relies on direct estimates of meta-parameters (bandwidth, sampling the space). In this form, it is an unsupervised machine learning algorithm. However, if a labeled dataset is available, it is straightforward to incorporate some training into the method, as we did with the bandwidth adjustment coefficient. We found the bandwidth is the single most important parameter of the method – as is typical in non-parametric regressions. The DS method is essentially a kernel regression in the embedding space. In our view, it is a very straightforward idea, making the results easier to interpret and the method – easier to develop further. Moreover, the corpus density matrix, computed as a step of the method, is an interesting condensed version of the document-feature matrix, and can be used as such for purposes other than finding document similarity, e.g. clustering, or visual representations of document corpora. In the future, we hope to pursue several possible directions of further research: generalization of document features from single words to n-grams; sensitivity to transformations of the document-feature matrices (in this work we used a standard TF-IDF transformation); possibility of combining this method with others in a multi-step fashion. Acknowledgments. The author is grateful for the support from the Office of the Vice Provost for Advances in Learning at Harvard University.

References 1. Briët, J., Harremoës, P.: Properties of classical and quantum Jensen-Shannon divergence. Phys. Rev. A 79(5), 052311 (2009) 2. Epanechnikov, V.A.: Non-parametric estimation of a multivariate probability density. Theory Probab. Appl. 14(1), 153–158 (1969) 3. Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of the 23rd international conference on Machine learning, pp. 377–384. ACM, New York (2006) 4. Hansen, B.E.: Lecture Notes on Nonparametrics. Lecture notes (2009). https://www.ssc.wisc. edu/~bhansen/718/NonParametrics1.pdf. Accessed 14 Jan 2020 5. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

Document Similarity from Vector Space Densities


6. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405 (2017) 7. Mikolov, T., Yih, W., Zweig, G.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Conference on Neural Information Processing Systems. Lake Tahoe, NV (2013) 8. Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966. JMLR, Lille (2015) 9. Pennington J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods on Natural Language Processing. ACL, Doha, Qatar (2014) 10. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: Sixth International Conference on Computer Vision, pp. 59–66. IEEE, Bombay (1998) 11. Silverman, B.W.: Density estimation for statistics and data analysis. In: Monographs on Statistics and Applied Probability. Chapman and Hall, London (1986) 12. Tsybakov, A.B.: Introduction to Nonparametric Estimation. Springer, New York (2009)

Food Classification for Inflammation Recognition Through Ingredient Label Analysis: A Real NLP Case Study Stefano Campese(B) and Davide Pozza OMNYS S.r.l, 36100 Vicenza, Italy [email protected]

Abstract. As of late, literature shows that food intolerances affect a large portion of the world population. Diagnosis and Prevention are essential to avoid possible adverse responses due to food ingestion. Concerning this point, consumers and industry players are also demanding tools useful to warn individuals about the composition of commercial products. In this scenario, Natural Language Processing (NLP) approaches can be very useful to classify foods into the right intolerance group given their ingredients. In this work, we evaluate and compare different deep and shallow learning techniques, such as Linear Support Vector Machine (Linear SVM), Random Forest, Dense Neural Networks (Dense NN), Convolutional Neural Networks (CNN), and Long shortterm memory (LSTM) with different feature extraction techniques like Bag of Word (BoW), Term Frequency-Inverse Document Frequency (TFIDF), and Word2Vec, in order to solve this task on real commercial products, aiming to create a baseline for future works and a software-product. In the end, interesting and noticeable results have been achieved and the baselines have been identified into the Linear SVM and the Dense NN with Bag of Words or with the combination of Bag of Words, TF-IDF and Word2Vec. Keywords: Natural Language Processing · Food health · Food inflammatory reaction · Food intolerances · Convolutional Neural Networs · LSTM · SVM · Random Forest



Inflammations and allergic reactions due to food ingestion nowadays affect a large part of the world population. It has been found with a growing trend that in 3–5% of Western countries population food may also cause serious adverse immune responses [12]. World-leading experts estimate that within the next few years up to 60% [14] of the global population could suffer from food intolerance or diet-related inflammations. Concurrently, individuals are becoming more and more aware that their diet might be one of the primary causes of health problems. c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 172–181, 2021.

NLP for Food Ingredient Label Analysis


In this scenario, preventing risks related to food ingestion is extremely important. From a nutritional and medical point of view, this means avoiding possible adverse immune responses to food, by diagnosing in advance a possible intolerance or allergy [8] and [13]. In addition to diagnosis, prevention is extremely important as well. Indeed, the research aims to prevent or reduce the risks of developing allergies and intolerance in several ways, starting from birth [2]. Nevertheless, this is not enough. Adverse reactions are often related to foods that are considered safe. This happens because individuals often base the safeness of a product on the ingredients label. In some cases ingredients descriptions may contain keywords that easily classify a food as potentially dangerous for a sensitized person, like milk or gluten. However, as shown in [1], the labels are more difficult to understand, as they may contain uncompleted words due to printing problems, grammatical errors due to incorrect translations, or ambiguous and/or technical words such as casein or albumin. These are substances that can be found in meat products such as ham or salami and they point out possible adverse reactions in people who are allergic to milk. In addition, certain foods may contain substances that are not declared on the product label: beer and wines, for example, do not directly contain yeast, but they may contain the same antigens due to the fermentation of their original ingredients. Similarly, some commercial products could be made of sub-products whose ingredients are not reported on the final label. Understanding food ingredient labels to establish the safeness of a product in some cases requires knowledge in biology, biochemistry or food production industrial processes: a complex task even for experienced nutritionists. In this scenario, consumers and industry players are asking for answers and tools that aim to solve the aforementioned issues in order to warn in advance and advise individuals on the risks that commercial products and their ingestion may constitute.


Task Definition

Section 1 highlighted the need to have a powerful tool that allows the user to understand whether or not a certain type of food may cause an adverse reaction, by analyzing the ingredients composition and their description. Due to the large variety of inflammations or allergic reactions related to different foods, according to literature [14], as far as our work is concerned, we focused on five main food clusters that are responsible for stimulating the production of IgG antibodies causing an increase of inflammatory mediators: milk, yeast, salicylates, nickel and wheat. At the time of writing this paper, food classification based on its ingredients analysis does not have proven benchmarks to start from. Therefore, we exploited Natural Language Processing (NLP) to face this challenge and to overcome the issues described in Sect. 1, by extracting correlations and-or semantic relationships among the words used in a food ingredients label. The goal is to predict if a certain food belongs to one of the five main clusters mentioned above and


S. Campese and D. Pozza

state if it may cause adverse reactions, thus reducing the original problem to five binary text classification challenges. The final goal of our work is to compare the state-of-the-art algorithms and feature extraction techniques, to confirm the initial hypothesis about the capacity of NLP to solve this task, hence providing a baseline for any future works.


Experimental Assessment

Experiments have been performed over real-world collected data grouped into five different datasets, as described in Sect. 3.1. In accordance with current literature [5], we focused on testing NLP stateof-the-art algorithms: Convolutional Neural Networks (CNN), Long short-term memory (LSTM), Dense Neural Networks (Dense NN), and Linear Support Vector Machine (Linear SVM). Random Forest has also been included in this study as it achieved promising results in recent years [16]. In Sect. 3.2 we will describe the model selection and hyper-parameters tuning steps. The text pre-processing pipeline and the feature extraction techniques used will be described in Sect. 3.3 and Sect. 3.4. Finally, in Sect. 3.5 we are going to describe and present the results achieved. 3.1

Datasets Description

We used 5 datasets, one for each of the five aforementioned food clusters: milk, yeast, salicylates, nickel, wheat. Each dataset was made up by collecting data from several ingredient labels of commercial products currently sold in Italian markets, thus contextualizing this work on a real-world application of NLP techniques. Each sample of every dataset is made of food ingredients and a label that states whether this specific food can cause inflammation or not. All data were accurately classified by an expert team of nutritionists, in order to avoid mislabeling errors or issues due to inexperience or to non-comprehensible labels made by non-technical people. All collected data are in the Italian language and cannot be disclosed as they are protected by an Non-Disclosure Agreement (NDA). Table 1 shows a complete overview of each dataset. What emerges from the data is that each dataset is relatively small in terms of samples and has slightly unbalanced class distribution. Moreover, for replicability purposes, in Table 2 positive and negative examples for each dataset are reported. 3.2

Model Selection

Intensive tuning and model selection phases were performed in order to find out the best suitable hyper-parameters for the algorithms being used. For this purpose, the training sets of the dataset were split in a training (80%) and a validation set (20%).

NLP for Food Ingredient Label Analysis


Table 1. Datasets description. The statistics were computed on the training sets. For both training and test sets, numbers of positive and negative elements are also reported Dataset

Training (positive/negative)

Test (positive/negative)

Sequence length


1704 (887/817)

426 (222/204)

34.57±32.37 375



1704 (1025/679)

426 (256/170)

34.52±32.48 359



1704 (701/1003)

427 (176/251)

34.37±31.25 370



1708 (838/870)

427 (210/217)

34.38±31.62 371


427 (251/176)

34.68±32.22 362


Salicylates 1706 (1001/705)

Max length

Min length

The hyper-parameters of CNNs tuned during the validation are the number of convolutional layers {1, 2, 3}, the number of filters {64 . . . 512} and the size of the convolutional kernels {3, 5, 7}, as well as the number of strides. For LSTMs the selected hyper-parameters are the recurrent dropout {0, 0.2, 0.5}, the number of recurrent layers {1, 2} and their dimensions with {64 . . . 512} neurons. At the top of both neural networks, we stacked dense layers for the final classification. The number of dense layers {1, 2, 3} and their neurons {64 . . . 256} were selected during the validation. The same approach was applied for Dense NN, by validating the number layers {1, 2, 3} and size of dense layers {64 . . . 512}. Other parameters such as the early stopping patience, the optimizer (Adam), and its learning rate (10−4 ), as well as the loss function (binary crossentropy), were fixed a-priori thanks to preliminary experiments. The tuned parameters for the Linear SVM are the tolerance threshold with tol ∈ {10i : i = −3 . . . 1} and regularization parameter C with C ∈ {10i : i = −1 . . . 4}. Finally, for the Random Forest, the tuned parameters are the maximum depth of the tree with possible values in {20 . . . 100}, the number of estimators with values that cover the range {20 . . . 200} and the impurity with possible values of gini and entropy. All experiments were carried out by using the same train and test splits and by maintaining the same distribution between classes. The binary accuracy was used as an evaluation metric. 3.3

Text Pre-processing Pipeline

Given the importance of word normalization on text classification [15] and since the food ingredient labeling may contain abbreviations, punctuation, grammatical errors, printing problems and further issues that can drastically prevent text classification tasks [9], we created a pipeline in order to pre-process and clean out the text. The pre-processing pipeline is made up of three steps: Number Removing, Stop Word Removing and Word Stemming. For this purpose, we used models that are already available in the NLTK library [7].


S. Campese and D. Pozza

Table 2. Positive and negative samples for each dataset. Each item is reported in the original language. Dataset

Positive sample

Negative sample


Acqua, sciroppo di glucosio e fruttosio, maltitolo destrina, aromi, edulcorante, sucralosio, lactobacillus casei, shirota yakult light, senza glutine

Cozze cilene con guscio, totano atlantico, acqua, vongole con guscio, pomodori, olio extravergine d’oliva, code di gambero Indo-pacifico, metabisolfito di sodio, prezzemolo, aglio, basilico, sale, pepe nero. Prodotto in uno stabilimento che utilizza prodotti a base di cereali contenenti glutine, crostacei, prodotti a base crostacei, derivati del latte, pesci, prodotti base di pesci o molluschi e sedano


Yogurt intero, latte intero, fermenti lattici vivi, streptococcus thermophilus, lactobacillus bulgaricus, preparazione a base di pera pari all’3% nel prodotto finito, zucchero di canna, acqua, addensante pectina, correttore acidita, acido citrico, zucchero di canna biologico, senza glutine

Farina di frumento, olio, grassi vegetali non idrogenati, olio di palma, olio di girasole, acqua, alcol, sale, succo di limone concentrato, agente di trattamento della farina, cisteina, pu` o contenere tracce di latte


Farina di grano tenero, acqua, latte scremato, grassi deidrogenati, oli vegetali, zucchero, lievito, sale, emulsionanti, trattato con alcool etilico, prodotto in uno stabilimento in cui si utilizzano olio di semi di sesamo

Formaggio crescenza, latte pastorizzato, crema di latte, fermenti lattici, sale, caglio, farina di grano tenero tipo 00, acqua, olio d’oliva, polpa di pomodoro, pomodoro, succo di pomodoro, sale di capperi, capperi, sale di acciughe, acciughe, olio di semi di girasole, sale, origano, olive taggiasche


Passata di pomodoro, merluzzo, calamari, cozze sgusciate, totani, olio di girasole, pomodoro, carote, vino rosso, cipolla, sedano, sale, prezzemolo, amido modificato, tapioca, aglio, olio extravergine d’oliva, peperoncino

Filetti di tacchino, farina di frumento, glutine, acqua, olio di palma, sale, amido di riso, amido di frumento, antiossidanti, ascorbato di sodio, aromi, spezie, pu` o contenere latte e uova

Salicylates Patate, acqua, sale, aceto, vino, conservante, solfiti, ortaggi, salamoia, cipolle, carciofi, cetrioli, carote, piselli, acqua, sale, aceto, zucchero, antiossidante, anidride solforosa, olio di semi di girasole, funghi champignon, salamoia, agaricus bisporus, acqua, sale, tonno, acciughe sottolio, tuorlo d’uovo salato, tuorlo d’uovo pastorizzato, vino, conservante, solfiti, succo di limone concentrato, acidificante, acido citrico

Patate, carote, pomodoro, piselli, zucca, porro, zucchine, fagioli, borlotti, cavolfiori, sedano, verza, fagiolini, spinaci, finocchio, cipolla, basilico, prezzemolo, aglio, pu` o contenere tracce di cereali e glutine

After the initial exploratory data analysis phase and some preliminary experiments, according to the specificity of the task, it emerged that the removing standard Italian stop words and stemming were not enough. We identified a set of other useless and/or potentially dangerous words whose semantic relationships, which describe their behavior and their meaning, do not seem to be captured by the features extraction techniques mentioned in Sect. 3.4.

NLP for Food Ingredient Label Analysis


Following on from the identification of these words, we performed another set of preliminary experiments in order to evaluate the safeness of removing these words from the input data. This study confirmed our initial hypothesis and these words were removed. The removed words are mainly proper-names and/or adjectives. In some cases, these words were ambiguous or had redundant information in respect to the ingredients label or were meaning-less for the resolution of the tasks. For the reproducibility of the experiments and for a clear vision on the difficulties of this pre-processing phase, a couple of examples have been reported. The word “Sorrento” is a proper name of a town in Southern Italy area. Several foods that are typical of this area, and especially for PDO (Protected Designation of Origin) [11] ones, they take the name of this town as a part of their label. In some cases, like for “Gnocchi alla Sorrento”, the presence of this word on the ingredients label may indicate the belonging to the Milk or Nickel food clusters since it may suggest the presence of milk derivates or other allergens. On the other hand, in many cases, the presence of this word may be ambiguous and meaningless like for the ingredient “limone di Sorrento”, where “Sorrento” only indicates the provenance of the lemon. In accordance with what has been written previously, removing the word “Sorrento” is necessary to avoid possible errors during the feature extraction phase. The word “Grappa”, is a proper name of a region in Northern Italy area but it is also the proper name of both an Italian liqueur and an Italian cheese. In this case, this word may indicate the belonging of the food to the Milk cluster but it can be also meaningless like for the liquor, creating ambiguity. Even in this case the meaning of the word does not seem to be easily captured by the features extraction techniques given its ambiguity and useless meaning, hence it was removed according to what we stated above. Since the data cannot be disclosed, as they are protected by an NDA, the whole list of removed words cannot be released at this time. 3.4

Feature Extraction

The feature representation of text documents plays a critical role in several NLP tasks. The same technique might have excellent performances in some tasks and poor in others. In this scenario, in order to create a baseline for future works, for the Linear SVM, Random Forest and Dense NN three different features extraction techniques were applied: Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words (BoW), and Word2Vec. Since each one of the above techniques offers a different text representation, a combination of the features generated by those approaches was also considered [3]. A set of experiments was performed for each of the aforementioned techniques. For CNN and LSTM models, the feature extraction process relies on the Keras Embedding [4] technique, while other embeddings such as Word2Vec have been excluded after preliminary tests. The embedding size was tuned and fixed a-priori by preliminary experiments.


S. Campese and D. Pozza

According to literature [6] and [10], it has been confirmed, from preliminary experiments, that pre-trained word embeddings on generic corpora like Word2Vec, do not capture the specificity of a domain ad its semantics and knowledge. Moreover, the distribution of the words, the size of the corpora and the size the datasets are extremely different from the Wikipedia dataset used to pre-train Word2Vec and this would introduce considerable errors on the final experiments assessment. Therefore, due to the domain specificity and specific nature of the task, in order to improve the performances and the capacity of extracting semantic relationships from the data, a Word2Vec model was trained from scratch in order to be more suitable for domain-specific words. A separated Word2Vec model was trained for each dataset. 3.5


Shallow and deep techniques were compared by measuring their capability to dichotomize foods that might cause intolerance or allergic reactions and foods that are harmless. As illustrated in Table 1, in order to evaluate each model, the original data were split in train and test sets. The accuracy scores achieved on the test set were collected in order to compare the models. To increase the stability of the results obtained with the neural networks, the train and test procedure was repeated 5 times with the same train and test sets, hence, the average and standard deviation of accuracy scores reached on each run were computed. Moreover, to have a clear vision on how well each model was performed on each task, a rank based on the reached accuracy score was computed. Finally, for each model, the average rank was computed. The average rank was used in order to identify the best approach and a future baseline to deal with this challenge. The training and test splits used are the same for each algorithm. The whole procedure was applied to each dataset and each algorithm. In Table 3 the highest accuracy scores achieved by each algorithm in each dataset have been reported. This score represents the highest accuracy score reached by the given algorithm, combined with one of the aforementioned features extraction techniques, on the specified dataset. To indicate that all features extraction techniques mentioned in Sect. 3.4 were exploited at the same time, we have used the abbreviation word All. The average rank is also exposed. From Table 3 we can notice that Linear SVM and Dense Neural Networks achieved the best overall performances. Moreover, it is interesting to notice how Random Forest achieved good performances, too. Even though CNNs, on Milk dataset, achieved the highest score similar to SVM and Random Forest, in general, they obtained poor performances more similar to LSTMs. In light of the results mentioned above, the scores achieved by CNNs and LSTMs suggest that these networks have been strongly penalized by the low number of samples in the datasets and by the embedding technique. From the data representation point of view, Table 3 highlights that the highest scores were achieved by using the concatenation of Word2Vec, TF-IDF and Bag of Word representations or solely by using Bag of Word representation.

NLP for Food Ingredient Label Analysis


According to literature, this confirms that BoW can work better than other techniques when the context of the domain is very specific, as in this case. Moreover, despite representation supplied by Word2Vec and TF-IDF seems to be not a good fit for those tasks, the information provided by these feature extraction techniques is still useful. As confirmed by the results, in fact, the combination of Word2Vec and TF-IDF with the Bags of Words has often overcome the results obtained by the single techniques. Table 3. Results achieved in each dataset by each model. Keyword All means the model exploits the combination of all techniques mentioned in Sect. 3.4. Dataset Milk Yeast Salicylates Nickel Wheat Average rank Random Forest 0.974 0.910 BoW All


0.936 BoW

0.949 All

0.957 All


Linear SVM

0.974 0.930 0.939 All All All

0.953 0.960 BoW BoW


Dense NN




0.953 0.963












Convolutional 0.974 0.918 0.912 Neural Network ±0.044 ±0.170 ±0.061
















4.00 4.75


The results shown in Sect. 3.5 claim that Linear SVM and Dense NN best fit the initial challenges. On the other hand, according to the experiment, and thus confirming the literature, Random Forest shows very promising results as well. The results achieved highlight that the representation offered by Word2Vec, which exploits semantic-relationships between words, does not seem to properly fit these specific tasks. On the other hand, it is necessary to underline that the size of the datasets and the size of the samples tend to be too small for Word2Vec. Therefore, it is reasonable to conduct more extensive experiments with word embedding on other datasets before completely excluding these techniques. In the same way, TF-IDF does not seem to offer a good performance; this suggests that some important words may appear more frequently on several training records than in the same document and in doing so, they are being penalized. An example of this is the word “semi”. This word is important for the Nickel cluster since it may be a strong indication of the presence of nickel


S. Campese and D. Pozza

due to the production process of food. In this case, despite this word’s potential usefulness in solving the task, since it is present in several samples, its importance is drastically reduced. On the other hand, we can claim that the combination of Word2Vec, TF-IDF and BoW allowed us to achieve better results in several tasks, meaning that different representations allow the algorithms to better generalize between classes. This confirms that the representations obtained by TF-IDF and Word2Vec, despite poor results, are still useful since they are complementary to the features extracted by Bag of Words. This confirms our initial hypothesis, according to which the union of the three techniques can improve the final results. Results shown in Table 3 suggest that techniques that generally work very well in several NLP tasks like Convolutional Neural Networks and LSTMs do not seem to properly fit this domain due to their poor performances. However, it is premature to claim that CNNs and LSTM do not work on those tasks until further and exhaustive experiments on bigger datasets have been performed. In the end, despite the good results, we believe that these models can be improved. In future works, we would like to collect more data to produce datasets with a larger number of samples, in order to improve the performances and to perform more exhaustive experiments. In this perspective, we are also planning to add a new cluster of food: i.e. sugar. From a machine learning point of view, thanks to this work we have fixed the baselines for future works into Linear SVM and/or Dense NN with both Bag of Words and the combination of Word2Vec, TF-IDF and Bag of Word. For future experiments, we are planning to use other embedding techniques and at different levels, like character-embedding which, from preliminary studies, seems to be a good fit for this challenge in regard to these baselines. In addition, we intend to test other algorithms and existing pre-trained deep neural network architectures through the transfer learning technique. Our future main goal is to generate a robust and reliable model to solve these challenges that can be included in our software-products in order to help people in their everyday life.

References 1. Altschul, A.S., Scherrer, D.L., Mu˜ noz-Furlong, A., Sicherer, S.H.: Manufacturing and labeling issues for commercial products: relevance to food allergy. J. Allergy Clin. Immunol. 108(3), 468 (2001) 2. De Silva, D., Geromi, M., Halken, S., Host, A., Panesar, S.S., Muraro, A., Werfel, T., Hoffmann-Sommergruber, K., Roberts, G., Cardona, V., et al.: Primary prevention of food allergy in children and adults: systematic review. Allergy 69(5), 581–589 (2014) 3. Enriquez, F., Troyano, J., L´ opez-Solaz, T.: An approach to the use of word embeddings in an opinion classification task. Expert Syst. Appl. 66, 1–6 (2016) 4. Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: Advances in neural information processing systems, pp. 1019–1027 (2016)

NLP for Food Ingredient Label Analysis


5. Khan, W., Daud, A., Nasir, J.A., Amjad, T.: A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait J. Sci. 43, 95–113 (2016) 6. Khatua, A., Khatua, A., Cambria, E.: A tale of two epidemics: contextual Word2Vec for classifying twitter streams during outbreaks. Inf. Process. Manag. 56(1), 247–257 (2019) 7. Loper, E., Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1 (ETMTNLP 2002), Stroudsburg, PA, USA, pp. 63–70. Association for Computational Linguistics (2002) 8. Muraro, A., Werfel, T., Hoffmann-Sommergruber, K., Roberts, G., Beyer, K., Bindslev-Jensen, C., Cardona, V., Dubois, A., Dutoit, G., Eigenmann, P., et al.: EAACI food allergy and anaphylaxis guidelines: diagnosis and management of food allergy. Allergy 69(8), 1008–1025 (2014) 9. Riloff, E.: Little words can make a big difference for text classification. In: SIGIR, vol. 95, pp. 130–136 (1995) 10. Sarma, P.K., Liang, Y., Sethares, W.A.: Domain adapted word embeddings for improved sentiment classification. arXiv preprint arXiv:1805.04576 (2018) 11. Scuderi, A., Pecorino, B.: Protected designation of origin (PDO) and protected geographical indication (PGI) Italian citrus productions. Acta Hortic. 1065, 1911– 1917 (2015) 12. Sicherer, S.H.: Food Allergy: Practical Diagnosis and Management. CRC Press, Boca Raton (2016) 13. Simons, E., Weiss, C.C., Furlong, T.J., Sicherer, S.H.: Impact of ingredient labeling practices on food allergic consumers. Ann. Allergy Asthma Immunol. 95(5), 426– 428 (2005) 14. Speciani, A.F., Soriano, J., Speciani, M.C., Piuri, G.: Five great food clusters of specific IgG for 44 common food antigens. A new approach to the epidemiology of food allergy. Clin. Transl. Allergy 3(S3), P67 (2013) 15. Toman, M., Tesar, R., Jezek, K.: Influence of word normalization on text classification. Proc. InSciT 4, 354–358 (2006) 16. Wu, Q., Ye, Y., Zhang, H., Ng, M.K., Ho, S.-S.: ForesTexter: an efficient random forest algorithm for imbalanced text categorization. Knowl.-Based Syst. 67, 105– 116 (2014)

Classification Based Method for Disfluencies Detection in Spontaneous Spoken Tunisian Dialect Emna Boughariou1(B) , Youn`es Bahou1,2 , and Lamia Hadrich Belguith1 1 Sfax University, Sfax, Tunisia [email protected], [email protected], [email protected] 2 HA’IL University, Hail, Kingdom of Saudi Arabia

Abstract. Disfluencies processing is the task of detecting the infelicities in spoken language. This paper proposes a transcription-based method for automatically detecting disfluencies in spontaneous spoken Tunisian dialect. Our method uses a classification model based on a sequencetagging approach with purely linguistic features for detecting disfluent segments of the utterance. According to our study of the Tunisian dialect, we have identified eight types of disfluencies namely, syllabic elongations, speech words, word-fragments, simple repetitions, complex repetitions and self-corrections that include insertions, substitutions and deletions. We have implemented the proposed method and the experiments show that our method improves the disfluency detecting task in the spoken Tunisian dialect with a rate of 87.4%.

Keywords: Disfluencies detection Sequence-tagging approach.


· Tunisian dialect ·


A characteristic of the spontaneous speech that makes it different from written text, is the presence of disfluencies. They are additional noises that are corrected by the speaker during the dialogue. Disfluencies are frequent in all forms of spontaneous speech, whether casual discussions or formal arguments [35] and their rate vary depending on the speaker and the context [32]. Detecting disfluencies presents a significant challenge for tasks dealing with spontaneous speech processing, such as speech parsing and machine translation. The Tunisian Dialect (TD) has sparked increased interest especially in dialect identification, morpho-syntactic tagging [8], sentiment analysis [25], speech understating [1] and recognition [24], among others. According to our knowledge, Neifar et al. [26] is the only study proposed for a restricted domain, for the case of disfluencies processing in spoken TD. c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 182–195, 2021.

Classification Based Method for Disfluencies Detection in TD


Depending on our study of the TD, we have identified eight types of disfluencies, namely, syllabic elongations, speech words, word-fragments, simple repetitions, complex repetitions, insertions, substitutions and deletions. Syllabic elongations are abnormal vowel lengthening of as syllable lasting more than 1 s [6]. The elongations appear usually at the final syllable of the word. In contrast, we noticed that 38% of elongations affect the first syllable of the word in the TD. Speech words are fillers and discourse markers used only in spontaneous speech. Word-fragments are truncated words started and interrupted by the same speaker. They may be dropped, taken or replaced. Simple repetitions are words that occur several times consecutively. Complex repetitions can be either one word that is repeated not consecutively in the utterance or group of words identically repeated. Insertions are the case of correcting a part of the speech by adding new words. Substitutions are the case of correcting a part of the speech by replacing some words with new ones. Finally, deletions are the case of correcting a part of the speech by removing words. Insertions, modifications and deletions are considered as types of self-corrections. The heterogeneity for to how the different disfluencies can be detected, brought us to divide the types of disfluencies into two groups: Simple Disfluencies and Complex Disfluencies. Simple disfluencies include syllabic elongations, speech words, word-fragments and simple repetitions. They need a simple rule-based approach to be identified. Complex disfluencies include complex repetitions, insertions, substitutions and deletions. Their structure fits into the reparandum-interregnum-repair pattern according to Shriberg (1994) [32]. The reparandum is the disfluent portion of the utterance that could be corrected or abandoned. The interregnum (also called editing term) is the optional portion of the utterance; it could include speech words. The repair is the portion of the utterance that corrects the reparandum. In this paper, we present our transcription-based method for automatically detecting the eight types of disfluencies in spontaneous spoken TD. Our method is based on both a rule-based approach to detect simple disfluencies and a statistical approach with purely linguistic features for detecting complex disfluencies of the utterance. The remainder of this paper is organized as follows: Sect. 2 presents an overview of disfluencies processing approaches. Section 3 highlights the challenges of disfluency detection in TD. Our disfluencies processing method for TD is detailed in Sect. 4. Section 5 shows the results of the experiment of our proposal before drawing our major conclusion in Sect. 6.


Related Work

Detecting disfluencies in spontaneous speech has been widely studied by researchers in different communities in natural language processing, including speech processing and psycholinguistics. Approaches to the disfluency detection task fall into three main categories: noisy channel models, transition based models and sequence tagging models.


E. Boughariou et al.

The first approach uses “Noisy Channel Models” (NCM) which are statistical models usually integrated into spell checkers, question-answer systems, speech recognition systems, etc. These models are combined with the grammar “Tree Adjoining Grammars” (TAG) [31] to look for similarity between the reparandum and the repair. The NCM allocates higher probabilities to exact words of reparandum, using the similarity between the reparandum and the repair. Then, it generates n-best disfluency analyses for each utterance at test time, using the probabilities of NCM-TAG and a bigram Language Model (LM) derived from training data [21]. Later, some works have shown that rescoring the n-best analyses with deep learning LMs [22], trained on large speech and non-speech corpora, and using the LM scores along with other features (i.e. pattern match and NCM ones) into a MaxEnt reranker [19] improves the performance of the baseline NCM. These models achieve promising results in detecting edited words. However, the main weakness of NCM models is that restarts are not suitably detected where there is no repair. Also, NCM creates complex runtime dependencies (the parser in [19] has a complexity of O(N5)). The second approach uses transition-based analysis models that detect disfluencies while simultaneously identifying the syntactic structure of the utterance. Authors in [17,30,34] use a deterministic transition-based parser with the ArcEager Algorithm [27] to model the problem. Arc-Eager is a bottom-up parsing strategy that is used in greedy and k-beam transition-based parsers. Disfluency detection with transition based approach is achieved by adding new actions to the parser to detect and remove the disfluent parts of the utterance and their dependencies from the stack. These parsers have the advantage of being very accurate while being able to parse a utterance in linear time. They can capture long-range dependency of disfluencies as well as chunk-level information. However, the joint models require large annotated treebanks containing both disfluent and syntactic structures annotations for training. Besides, they introduce an additional annotated syntactic structure, which is very expensive to produce and can cause noise by significantly enlarging the output search space. The third approach uses word classification models. It is based on statistical models algorithms of Machine Learning [2,10,11] and Deep Learning [3,21,23]. These models consist in predicting the class of the candidate word according to the BIO encoding [29]. Experiments results in previous related works shows that the BIO encoding schema improves the detection task of disfluencies in the utterance. A model labels words as being inside or outside of the edit region. Then to come up with different pattern matching lexical cues for repetition and self-correction disfluencies, they extend the baseline state space with new explicit repair states to consider the words at repair region, in addition to edit region [28,35,36]. In other works, [12,13], integer linear programming constraints are applied to the output of classifier to avoid inconsistencies between neighbouring labels. These methods achieve state-of-the-art performance using the extensions of BIO encoding schema. However, they are not powerful enough to capture complicated disfluencies with longer spans or distances. Another drawback of these models is that they are unable to exploit chunk-level features [33].

Classification Based Method for Disfluencies Detection in TD


With regard to the work dealing with disfluencies, we notice a huge number of studies for dotted languages, regarding some studies reserved for the standard or dialectal Arabic language. [20] proposed a numerical learning-based method for the automatic processing of complex disfluencies in spontaneous oral Arabic utterances. This method allows, from a pretreated and semantically labeled utterance, to delimit and label the conceptual segments of a processed utterance [6]. Then, it allows, from a segmented utterance, to delimit and to correct the disfluent segments. [1,26] adopt the symbolic rule-based method of [5] for the automatic processing of complex disfluencies in spontaneous oral Arabic utterances. Disfluencies are delimitated and corrected using a set of rules and patterns that follow the reparandum-interregnum-repair structure. Besides, a step of OutOf-Vocabulary (OOV) words processing improves the disfluencies processing. The main weakness of these works is since they are conceived for a restricted domain (e.g. railway information) where the lexicons are relatively small and there are no annotation tools and guidelines to annotate the disfluencies.

3 3.1

Challenges in Disfluencies Processing for Spoken Spontaneous TD Spontaneous Spoken Language Characteristics

Although the spontaneous spoken language is the most natural, easy and spontaneous means of communication, it presents several challenges such for the task of disfluencies processing that make it different from the written language. Linearity. The speech signal is irreversible, it is impossible for a speaker to correct, add or remove elements from the speech signal already pronounced. Prosody. Prosody plays an important role in the segmentation of the spoken language, without allowing a structuring as important as one would expect from written texts [18]. Filled pauses, as an aspect of the prosody, reveals itself imprecise to detect punctuations in a signal corresponding to a reading text. Pronunciation Variability. The same utterance can be expressed in different ways between speakers to regions, social classes, ages and eras [9]. They mainly concern phonology through regional accents. Spontaneity. The speech spontaneity brings the disfluency phenomenon presented with multiple difficulties. The disfluencies types are numerous, we can cite the phonological disfluencies (e.g., schwas, assimilations, syllabic elongations), the lexical disfluencies (e.g., discourse markers, filled and silent pauses) and the morpho-syntactic disfluencies (e.g., repetitions, word-fragments, selfcorrections).


E. Boughariou et al.

Out-of-Vocabulary Words. OOV words are speech recognition errors that correspond to insertions, deletions, and confusions of words generated by automatic recognition systems [4]. Furthermore, spontaneous spoken language shares some characteristics with the written language that could hinder the detection of disfluencies. Synonymy. The speaker adds synonyms of some words in the utterance, which is unnecessary for the syntactic structure of the utterance. Besides, for languages that code switch between several languages such as TD, a word can be repeated by its similar of other foreign languages. Enumeration. An enumeration consists of successively detailing various elements of which a generic concept or an overall idea is composed. In some cases, the speaker adds, replaces or resumes information. This can be explained as a selfcorrection disfluency. 3.2

TD Characteristics

TD is a spoken variety of Arabic that Tunisians code-switch between Modern Standard Arabic (MSA) and foreign languages especially French, Maltese, Spanish and English. All these borrowed terms are often used without being adapted to TD phonology [37]. Besides, the TD is divided into six major dialectal areas according to the Tunisian regions [16]. Therefore, the vocabulary varies through areas, involving phonological, morphological, lexical and syntactic variations. At the phonological level, short vowels are neglected especially if they are located in the last position of the word. Besides, some consonants are pronounced differently through regions. At the morphological level, new dialectal affixes and suffixes are added when others are removed. New clitics are introduced such as the negation particle “ [mA] (what)” preceding verbs and the interrogative [Siy]”. The TD uses also the numeral word “ [zwz] (two)” after suffix “ and before plural nouns instead of MSA dual suffixes. Besides, TD is characterized by the absence of feminine or masculine dual and the feminine in the plural. At the lexical level, some foreign words are affected by the conjugation rules of TD and the Arabic enclitic or proclitic to express either an action or an order or a possession of things [24]. And at the syntactic level, the canonical order of words in a TD verbal sentence usually follows the SVO (Subject Verb Object) structure. Also, the TD reduces the number and the form of MSA pronouns from twelve to seven personal pronouns [24]. TD shares some specificities of MSA such as the agglutination and the irregularity of syntactic constituents.


Classification Based Method

In this paper, we present a transcription-based method guided by the morphosyntactic level, to handle disfluency removal from transcribed utterances of spoken TD. Simple disfluencies are processed using a rule-based approach, while

Classification Based Method for Disfluencies Detection in TD


complex disfluencies are processed using statistical machine learning model within a the sequence-tagging approach. The choice of the sequence-tagging approach is due to two main reasons: • As mentioned in Sect. 3.2, the morpho-syntactic aspect of TD does not follow an exact grammar, therefore the use of noisy channel models and transition based models requires complex processing and resources, • The extension of BIO encoding schema improves the detection task of complex repetitions and self-correction’s types. Our proposed method is carried out using the corpus DisCoTAT (Disfluencies Corpus from Tunisian Arabic Transcription) [7]. It consists of 38.657 utterances coming mainly from recordings of railway information services and Tunisian TV channels and radio programs. The transcription was acheived manually according to OTTA and CODA-TUN conventions [37]. DisCoTAT is composed of a mosaic of words coming from various languages mainly TD (72%), MSA (16%), French (12%) and English (3%). DisCoTAT is enriched with two types of annotation: morpho-syntactic annotation using TD-WordNet and hand-crafted disfluencies annotation using the annotator tool DisAnT [6]. We divided randomly DisCoTAT into two parts. 80% of the utterances are used for training and 20% for tests. The proposed method consists of three main steps namely, pre-processing, simple disfluencies processing and complex disfluencies processing, illustrated in Fig. 1.

Fig. 1. Steps of the proposed method.



4.1.1 Standardization The purpose of standardizing the utterance consists of segmenting the utterance into lexical units. A lexical unit can be a word or a group of words (i.e., compound words). Thus, standardization allows grouping some words into a single unit [mnzl] (house)” and that will have a single POS label. For example, “ [bwzy¯an] (Bouzayen)” constitute one lexical unit “ ” which “ is labelled with (“N Prop”, proper noun) in the compound words lexicon of the TD-WordNet [7].


E. Boughariou et al.

4.1.2 Morph-Syntactic Analysis Lexical units found are labeled with POS labels using TD-WordNet [7]. The morpho-syntactic analysis also allows lemmatizing no-labeled words based on the TD-WordNet list of prefixes and suffixes to find its POS label. For example: [m¯az¯alˇsy] (Is it always)” is an inflected form of the verb the word “ [m¯az¯al] (still)”, concatenated with the suffix “ [ˇsy]” which refers to an “ interrogation tool. 4.2

Simple Disfluencies Processing

Simple disfluencies detection does not require sophistic processing and can help in improving the overall method performance. For detecting simple disfluencies, we focus mainly on POS tags of words. Syllabic elongations are abnormal vowel lengthening of a syllable that affects either the first or the last syllable of the word. Their processing consists of detecting words that are not POS-tagged and contain more than two extensions in the first or last syllable. The following examples illustrate the two cases: - A syllable lengthening at the beginning of the word [Swwwtk mw$ wADH] (Your voiiiice is not clear)”1 . - A syllable lengthening at the end of the word [fm¯ a¯a¯a¯a tr¯ an brk] (Thereee’s only one train)”. “

Speech words include hesitation marks (e.g., [¯ ah] (Ah)) and discursive mark[y’ny] (meaning)). They are frequently used in spontaneous oral ers (e.g., productions. Detecting speech words is a word-based matching of words tagged (Marq Disc, discourse mark) and (Marq Hesit, hesitation mark). Experts have extracted all speech words recorded in DisCoTAT. Although speech words belong to the simple disfluencies category, they are eliminated only during the next step of the method, as they help to detect complex disfluencies. Word-fragments are “syllables, speech sounds or single consonants, which are similar to the beginning of the next fully articulated word . . . [and] they may neither be equal to the whole next word ” [14]. Word-fragments processing consists of checking if the current word is not POS-tagged and is an integral part of the next word as illustrated in the following example: “ begins)”.

[wqtA$ ybdA Al AlmWtmr] (When the the-conference

Additionally, a thorough investigation of our corpus showed that a speech word can appear in the middle of a word-fragment case as illustrated in the following example: “ 1

[tr¯ an mt¯ a ¯ah mt¯ aa twns] (The train of euh of-Tunis)”. The examples are translated word for word to exactly appear the disfluencies.

Classification Based Method for Disfluencies Detection in TD


However, this speech word is removed in concordance with the word-fragments processing. Simple repetitions are words that occur several times consecutively. Simple repetitions processing consists of checking if the current word is equal to the next word case, as illustrated in the following example: “

[EnA zwz zwz mHrjAnAt] (We have two two festivals)”.

A speech word also can appear between the repeated words. In this case, the speech word is removed in concordance with the simple repetitions processing, as illustrated in the following example: “ ters)”. 4.3

[vlAvp Omm vlAvp wzrA’] (Three emm three minis-

Complex Disfluencies Processing

This step is used to handle complex repetitions and self-corrections (i.e., insertions, modifications and deletions). Complex disfluencies depend on well-defined structure rules for their detection. The complex disfluencies processing task is converted to a word classification task using BIO encoding [29]. We have fixed six classes based on the reparanduminterregnum-repair structure of [32], to detect the boundaries of the disfluent part (i.e., reparandum + interregnum) and the fluent part (repair): • • • • • •

B RM: the beginning of the reparandum part, I RM: belongs to the reparandum part, B RP: the beginning of the repair part, I RP: belongs to the repair part, IP: Interregnum (i.e., belongs to speech words), O: fluent word.

4.3.1 Features Embedding The performance of a statistical model is strongly influenced by the set of features used for classification. The task of detecting disfluencies is mainly related to either linguistic or prosodic features. Prosodic information (such as duration, rhythm, etc.) are absent in our work since it belongs to the processing of transcripts. However, we rely on only linguistic features. We also used contextual features with a window of ±3 words. We experimented with a window ±1 word, a window ±2 word and a window ±3 word. In this work, the choice of the word window is justified by the fact that the TD utterances are not too long. Similarly, the corpus analysis has shown that the repair starts after a window that does not exceed three words after the disfluent part, and this without taking into account the interregnum. Finally, we used the dynamic criterion. It considers the class assigned dynamically to the three previous words. Features are presented in Table 1.


E. Boughariou et al. Table 1. Features for CFR classifier.



The current word is subordinating conjunction


The current word is coordinating conjunction


The current word is an interjection


Repetition number of the current word in the utterance


POS tag of the current word


POS of the n preceding word with n = (1, 2, 3)


POS of the n next word with n = (1, 2, 3)


The current word starts with an uncompleted word in a window of 3 Boolean preceding words The current word is repeated in a window of 3 preceding words


The current word is repeated in a window of 3 next words


The current word is preceding by a subordinating or coordinating conjunction


The current word is followed by a subordinating or coordinating conjunction


Class of the n preceding words with n = (1, 2, 3)


4.3.2 Model Generation For the automatic generation of the classification rules, we have experimented with several machine learning algorithms. The results of the stochastic models generated showed that CRF (89.2%) gives high performance. Conditional Random Fields (CRF) are a class of linear statistical models which are known to exhibit high performance in sequence labelling tasks [2]. CRF have been applied extensively in diverse tasks of NLP, such as sentence segmentation, POS tagging and disfluencies processing due to its advantages of representing long-range dependencies in the observations. CRF classifier takes into account the probability of co-occurrence between neighbouring labels and simultaneously estimates the best sequence of predicted labels for a given input utterance. Alharabi et al. [2] define the conditional probability of a sequence of words that may include some disfluent words by the following equations: p(y|x, λ) =

exp(λT f (x, y)) , Z(λ, x)


with y is a sequence of labels for the observation x, f(x, y) is the set of feature functions, λ is the model’s parameters and Z is the normalisation term. One weight (w) is determined for each feature. These weights are learned during training such that (2): λ = argmaxλp(Y |X, λ), (2)

Classification Based Method for Disfluencies Detection in TD


and the label sequence can then be predicted from the following Eq. (3): y ∗ = argmaxy p(y|x, λ)argmaxy p(y|x, w).


Words labelled B RM, I RM and IP are automatically deleted. 4.4


We detail each step through utterance (1) to demonstrate the principle of our proposed method:

Utterance 2 represents the result of the pre-processing step:

Utterance 3 represents the result of the simple disfluencies processing step. 4 and 5 are recognized as a simple disfluency, thus, 5 and 6 are recognized as a word-fragment disfluency.

Utterance 4 represents the result of the complex disfluencies processing step based on the classification-based model:

Utterance 5 represents the final outcome of the proposed method:



We have evaluated how well our method could detect disfluencies in the evaluation set of DisCoTAT. We have implemented the method using the Java programming language with NetBeans environment. We used F-Measure metric for the evaluation. We have compared the evaluation results obtained with the reference results made by two experts. The F-measure of our disfluencies processing


E. Boughariou et al.

method is about 87.4% which is accurate. For the simple disfluencies processing module, we gained an F-measure of 96.3% while we achieved an F-measure of 78.5% for the complex disfluencies processing module. Table 2 summarizes the evaluation rates of disfluencies processing. Table 2. Evaluation rates of disfluencies types Disfluency type

F-Measure rate

Simple Disfluencies Syllabic elongation Speech words Word-fragments Simple repetitions

98.6% 100% 100% 86.7%

Complex Disfluencies Complex repetitions Insertions Substitutions Deletions

87.4% 76.1% 80.7% 69.9%

The performance of how well the simple disfluencies processing module can detect speech words and word-fragments perfectly, is due to the effect of the pre-processing step specially the lemmatization. From obtained results, we can conclude that the error analysis cases are mainly due to the following reasons: a) The absence of semantic features. For simple and complex repetitions and substitutions, a significant number of cases are failed because of the absence of [qTAr]” and “ [trAn]” mean the synonymy feature. For example, both “ (train) and must be recognized as synonyms. b) The POS tagging. We noticed the presence of words with many POS tags, [AlmqAblp]” means in TD-WordNet both (game, Noun) for example, “ and (across, Adverb) in TD-WordNet. c) The presence of enumerations in the utterance. An enumeration structural aspect is very close to self-corrections [26]. d) The presence of OOV words. Some words in the utterance are no-labelled because they are not found in TD-WordNet.



In this paper, we investigated the automatic processing of simple and complex disfluencies in spontaneous spoken TD. We proposed a transcription-based method that investigates a sequence-tagging approach. We proposed a statistical model with purely linguistic features for detecting disfluencies of the utterance.

Classification Based Method for Disfluencies Detection in TD


The statistical model improves the disfluencies processing task in the spoken TD with an F-measure of 87.4%. As future work, we intend to add semantic features to improve the efficiency of our method. We also plan to realize an extrinsic evaluation of our method in an automatic POS tagger. Finally, we aim to evaluate our method using other corpora such as Switchboard conversations [15].

References 1. Abbassi, H., Bahou, Y., Maaloul, M.H.: L’apport d’une approche hybride dans la compr´ehension de l’oral arabe spontan´e. In: 29th of Proceedings of International Business Information Management Association, Vienna, Austria, pp. 2145–2157, May 2017 2. Alharbi, S., Hasan, M., Simons, A.J., Brumfitt, S., Green, P.: Sequence labeling to detect stuttering events in read speech. Comput. Speech Lang. 62, 101052 (2020) 3. Bach, N., Huang, F.: Noisy bilstm-based models for disfluency detection. Proc. Interspeech 2019, 4230–4234 (2019) 4. Bahou, Y., Maaloul, M., Boughariou, E.: Towards the supervised machine learning and the conceptual segmentation technique in the spontaneous Arabic speech understanding. In: Procedia computer science, ACLING2017, Dubai, UAE, pp. 225–232 (2017) 5. Bahou, Y., Masmoudi, A., Belguith, L.H.: Traitement des disfluences dans le cadre de la compr´ehension automatique de l’oral arabe spontan´e. TALN’2010 (2010) 6. Boughariou, E., Bahou, Y., Maaloul, M.H.: Application d’une methode num´erique a base d’apprentissage pour la segmentation concep-tuelle de l’oral arabe spon` tane. In: 29th of Proceedings of International Business Information Management Association, Vienna, Austria, pp. 2820–2835, May 2017 7. Boughariou, E., Bahou, Y., Bleguith, L.H.: Linguistic resources construction: towards disfluency processing in spontaneous tunisian dialect speech. In: International Conference on Text, Speech, and Dialogue, pp. 316–328. Springer (2019) 8. Boujelbane, R., Mallek, M., Ellouze, M., Belguith, L.H.: Fine-grained POS tagging of spoken Tunisian dialect corpora. In: International Conference on Applications of Natural Language to Data Bases/Information Systems, pp. 59–62. Springer (2014) 9. Bove, R.: A tagged corpus-based study for repeats and self-repairs detection in French transcribed speech. In: International Conference on Text, Speech and Dialogue, pp. 269–276. Springer (2008) 10. Cho, E., Ha, T.L., Waibel, A.: Crf-based disfluency detection using semantic features for German to English spoken language translation. IWSLT, Heidelberg, Germany (2013) 11. Dutrey, C., Clavel, C., Rosset, S., Vasilescu, I., Adda-Decker, M.: A crf-based approach to automatic disfluency detection in a French call-centre corpus. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) 12. Georgila, K.: Using integer linear programming for detecting speech disfluencies. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pp. 109–112. Association for Computational Linguistics (2009)


E. Boughariou et al.

13. Georgila, K., Wang, N., Gratch, J.: Cross-domain speech disfluency detection. In: Proceedings of the 11th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 237–240. Association for Computational Linguistics (2010) 14. Germesin, S., Becker, T., Poller, P.: Hybrid multi-step disfluency detection. In: International Workshop on Machine Learning for Multimodal Interaction, pp. 185– 195. Springer (2008) 15. Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: telephone speech corpus for research and development. In: [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 517–520. IEEE (1992) 16. Graja, M., Jaoua, M., Bleguith, L.H.: Statistical framework with knowledge base integration for robust speech understanding of the Tunisian dialect. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(12), 2311–2321 (2015) 17. Honnibal, M., Johnson, M.: Joint incremental disfluency detection and dependency parsing. Trans. Assoc. Comput. Linguist. 2, 131–142 (2014) 18. Huet, S., Gravier, G., S´ebillot, P.: Morphosyntactic resources for automatic speech recognition (2008) 19. Johnson, M., Charniak, E., Lease, M.: An improved model for recognizing disfluencies in conversational speech. In: Proceedings of Rich Transcription Workshop (2004) 20. Labiadh, M., Bahou, Y., Maaloul, M.H.: Complex disfluencies processing in spontaneous Arabic speech. In: Language Processing and Knowledge Management International Conference (2018) 21. Lou, P.J., Anderson, P., Johnson, M.: Disfluency detection using auto-correlational neural networks. arXiv preprint arXiv:1808.09092 (2018) 22. Lou, P.J., Johnson, M.: Disfluency detection using a noisy channel model and a deep neural language model. arXiv preprint arXiv:1808.09091 (2018) 23. Lu, Y., Gales, M., Knill, K., Manakul, P., Wang, Y.: Disfluency detection for spoken learner English. In: Proceedings SLaTE 2019: 8th ISCA Workshop on Speech and Language Technology in Education, pp. 74–78 (2019) 24. Masmoudi, A., Bougares, F., Khmekhem, M.E., Est`eve, Y., Bleguith, L.H.: Automatic speech recognition system for tunisian dialect. Lang. Resour. Eval. 52(1), 249–267 (2018) 25. Mdhaffar, S., Bougares, F., Est`eve, Y., Belguith, L.H.: Sentiment analysis of tunisian dialects: Linguistic resources and experiments. In: Proceedings of the Third Arabic Natural Language Processing Workshop, WANLP, pp. 55–61 (2017) 26. Neifar, W., Bahou, Y., Graja, M., Jaoua, M.: Implementation of a symbolic method for the Tunisian dialect understanding. In: Proceedings of 5th International Conference on Arabic Language Processing, Oujda, Maroc, November 2014 27. Nivre, J., Scholz, M.: Deterministic dependency parsing of English text. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 64. Association for Computational Linguistics (2004) 28. Ostendorf, M., Hahn, S.: A sequential repetition model for improved disfluency detection. In: INTERSPEECH, pp. 2624–2628 (2013) 29. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Natural language processing using very large corpora, pp. 157–176. Springer (1999) 30. Rasooli, M.S., Tetreault, J.: Joint parsing and disfluency detection in linear time. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 124–129 (2013)

Classification Based Method for Disfluencies Detection in TD


31. Shieber, S.M., Schabes, Y.: Synchronous tree-adjoining grammars. In: Proceedings of the 13th Conference on Computational Linguistics-Volume 3, pp. 253–258. Association for Computational Linguistics (1990) 32. Shriberg, E.E.: Preliminaries to a theory of speech disfluencies. Ph.D. thesis, University of California, Berkeley (1994) 33. Wang, F., Chen, W., Yang, Z., Dong, Q., Xu, S., Xu, B.: Semi-supervised disfluency detection. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3529–3538 (2018) 34. Yoshikawa, M., Shindo, H., Matsumoto, Y.: Joint transition-based dependency parsing and disfluency detection for automatic speech recognition texts. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1036–1041 (2016) 35. Zayats, V., Ostendorf, M., Hajishirzi, H.: Disfluency detection using a bidirectional LSTM. arXiv preprint arXiv:1604.03209 (2016) 36. Zayats, V., Ostendorf, M., Hajishirzi, H.: Multi-domain disfluency and repair detection. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) 37. Zribi, I., Kammoun, I., Khemekhem, M.E., Bleguith, L.H., Blache, P.: Sentence boundary detection for transcribed Tunisian Arabic. In: Bochumer Linguistische Arbeitsberichte, pp. 223–231 (2016)

A Comprehensive Methodology for Evaluating Conversation-Based Interfaces to Relational Databases (C-BIRDs) Majdi Owda1(B) , Amani Yousef Owda2 , and Fathi Gasir3 1 Department of Computing and Mathematics, Manchester Metropolitan University, Chester

Street, Manchester M1 5GD, UK [email protected] 2 Department of Electrical and Electronic Engineering, The University of Manchester, Sackville Street, Manchester M13 9PL, UK [email protected] 3 Computer Science Department, Faculty of Information Technology, Misurata University, Misurata, Libya [email protected]

Abstract. Evaluation can be defined as a process of determining the significance of a research output. This is usually done by devising a well-structured study on this output using one or more evaluation measures in which a careful inspection is performed. This paper presents a review of evaluation techniques for Conversational Agents (CAs) and Natural Language Interfaces to Databases (NLIDBs). It then introduces the developed customized evaluation methodology for ConversationBased Interface to Relational Databases (C-BIRDs). The evaluation methodology created has been divided into two groups of measures. The first is based on quantitative measures, including two measures: task success and dialogue length. The second group is based on a number of qualitative measures, including: prototype ease of use, naturalness of system responses, positive/negative emotion, appearance, text on screen, organization of information, and error message clarity. Then an elaboration is carried out on the devised methodology by adding a discussion and recommendations on the sample size, the experimental setup and the scaling in order to provide a comprehensive evaluation methodology for C-BIRDs. In conclusion the evaluation methodology created is better way for identifying the strengths and weaknesses of C-BIRDs in comparison to the usage of single measure evaluations. Keywords: Natural language interfaces to databases · Conversational agents · Evaluation methodologies and evaluation measures

1 Introduction There is an increasing demand on querying the relational databases through the use of natural language instead of having experts in Structured Query Language (SQL). This © Springer Nature Switzerland AG 2021 K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 196–208, 2021.

A Comprehensive Methodology for Evaluating (C-BIRDs)


requires developing Natural Language Interfaces to Relational Databases (NLI-RDBs), Conversation Interfaces to Relational Databases (C-BIRDs) or Dialogue Systems Interfaces to Relational Databases. The main challenge that developers of natural language interfaces continue to face, is how to map or convert the natural language questions into SQL queries automatically. Several NLI-RDB architectures, solutions and prototypes have been proposed for developing such systems, which can be grouped in five main architectures: Pattern-Matching Architecture, Intermediate Representation Language Architecture, Syntax-Based Architecture, Semantic Grammar-Based Architecture and Intelligent Agents-Based Architecture. The selection of proper evaluation measures is important in order to carry out a proper investigation of the usefulness of a proposed NLI-RDB solutions or prototype systems. It is best to categorise these measures and critically assess them in order to be able to formulate appropriate evaluation strategy. The following two sections reviews measures used to evaluate Conversational Agents (CAs) and measures used to evaluate Natural Language Interfaces to Databases (NLIDBs). In this review, the measures are categorised into two sections: • Qualitative measures, which are measures connected with answering the question “How well did we do?” • Quantitative measures, which are measures connected with answering the question “How much did we do?”

2 Evaluation Measures for Conversational Agents (CAs) A number of qualitative and quantitative measures have been used to evaluate CAs, dialogue systems, and embodied CAs. The following two sub-sections review these measures. 2.1 Qualitative Measures • Ease of Use This measure is concerned mainly with describing the system or the task in terms of how easy it is to utilise. Litman et al. and Walk et al. [1, 2] used this measure within a metrics of measures to evaluate spoken dialogue systems. This measure has also been used within a metric to evaluate embodied CAs [3]. • Evaluating Coverage This measure is to deal with how much the system will be able to cover the domain and has been used by Allen et al. and López-Cózar et al. [4, 5]. • Naturalness This measure is about the natural the behaviour of the agent while dealing with humans [6]. • Positive Emotional Attributes This measure contains attributes such as friendliness, fun and enjoyment [7–10]. • Negative Emotional Attributes This measure contains attributes such as boredom [7, 11].


M. Owda et al.

• Future Use of the System This measure deals with whether the user would use the system in the future or not. Some metrics involve a comparison between usages of dialogue systems service and human’s service [2, 11]. • Understanding the Agent This measure is concerned with the ability of the user to understand the agent. This measure has been used as part of evaluation metrics such as Walker et al., Litman & Pan and López-Cózar et al. [1, 2, 5]. • Meeting Expectations This measure is concerned with user expectations and if the system met these expectations [1, 2]. • Coping with Errors This measure is used to evaluate the user satisfaction with the way the CA handles errors and misunderstandings [10]. • General Software Requirement Measures There are a number of software features that could be measured, such as how fast the software is, how much space it uses, compatibility with operating systems, hardware compatibility, and memory usage [10]. • Other Qualitative Measures The following qualitative measures are also used to evaluate CAs: dialogue quality, strategies to recover from errors, and correct/direct the user interaction [5, 12]. 2.2 Quantitative Measures • Task Success This measure can be applied at the utterance level and at task level. Also, it can be adopted as an evaluation strategy by itself since it can be used to measure the system accuracy. This measure has been used with other measures as an evaluation metric [2, 4, 13–15]. • Dialogue Length This category measures attributes such as the average length of the conversations and the average count of dialogue turns within conversations. Usually these attributes alone do not show a contribution to knowledge, unless they are used in conjunction with other qualitative attributes such as correct responses, polite responses, and insulting responses [2, 16]. • Counts of Errors This measure deals with attributes such as counting errors, correctness, and percentage of errors rates [17, 18]. • Counts of Correct Actions This measure focuses on counting the correct actions taken by the CA [19]. • Other Quantitative Measures The following quantitative measures are also used to evaluate CAs: count of help messages, system response time, percentage of words correctly understood, percentage of sentences correctly analysed, percentage of words outside the dictionary, and percentage of times the system succeeds in solving a problem [5].

A Comprehensive Methodology for Evaluating (C-BIRDs)


3 Evaluation Measures for Natural Language Interfaces to Databases (NLIDBs) A number of qualitative and quantitative measures have been used to evaluate natural language interfaces to databases. The following two sections will review these measures. 3.1 Qualitative Measures • Comparative Studies with Other NLIDBs This measure is used to compare two natural language interfaces to databases. In particular, it compares general software requirements, since the two interfaces, most of the time, are incomparable, because they aim at achieving different capabilities [20, 21]. • Coverage and Learnability Coverage measures how much the system will be able to answer queries about the domain of the Interface. Learnability is how well new users are able to identify coverage limitations and operate within them while carrying out a task [20, 21]. • Ease of Use This measure deals mainly with describing how easy to use the system is. • Portability Evaluation This measure is mainly about how much complexity is involved in modifying the system in order to port it to another domain [22]. • General Software Criteria This measure assesses attributes such as NLIDBs speed, size, portability, modifiability, installation, and maintenance [21, 37–40]. 3.2 Quantitative Measures • Task Success This measure can be applied at the task level in NLIDBs. The success can be measured when a natural language query is successfully mapped into SQL statement. This measure has been used with other measures in an evaluation metric and alone as an evaluation method on its own [22–24]. • Precision, Recall and Willingness The precision and recall measures have been used to evaluate a number of NLIDBs efficiency [25–28]. The two measures can be calculated based on the following formulas: precision (A), and recall (B).

precision =

#correctly answered #correctly answered + #incorrectly answered recall =

#correctly answered #queries




M. Owda et al.

Mapping a natural language query can fall into three categories: 1) the query is correctly answered; 2) the query is incorrectly answered; and 3) the query is unanswered. Given that: number of queries = number of correctly answered + number of incorrectly answered + number of unanswered. Minoch added a further measure, which is called willingness (C) [28]. Willingness =

#correctly answered + #incorrectly answered #queries


4 The Correlation Between CAs Measures and NLIDBs Measures Figure 1 shows the correlation between CAs evaluation measures and NLIDBs evaluation measures. The figure highlights that there are a number of common evaluation measures between the two areas, namely: task success, ease of use, coverage and general software requirements. This correlation analysis helps in deciding which evaluation measures to use to build an evaluation methodology for a more comprehensive evaluation approach for Conversation-Based Interface to Relational Databases (C-BIRDs).

CAs Measures

NLIDBs Measures

Naturalness Coping with Errors Dialogue Length

Meeting Expectations Understand the Agent Future Use of the System Positive/Negative Emotional Attributes

Portability Task Success Ease of Use General Software Requirements Coverage

Comparative Studies with other NLIDBs Learnability Precision, Recall & Willingness

Fig. 1. The correlation between CAs evaluation measures and NLIDBs measures

5 Experimental Design for Evaluating Conversation-Based Interface to Relational Databases (C-BIRDs) Choosing appropriate measures to evaluate Conversation-Based Interface to Relational Databases (C-BIRDs) prototypes is a crucial step in the evaluation process, because it

A Comprehensive Methodology for Evaluating (C-BIRDs)


determines the data that will be collected. Therefore, choosing the measures will affect the validity and reliability of the evaluation. Researching different measures used to evaluate CAs and NLIDBs is useful, since the C-BIRD framework is a special kind NLIDB which uses CAs. The correlation presented in Fig. 1 suggests that there are a number of common measures used in both the NLIDBs and CAs, namely: task success, ease of use, coverage and general software requirements. These common measures can be counted as possible evaluation measures to be used in an evaluation methodology for evaluating conversation-based interfaces to relational databases. A number of factors have been kept in mind while choosing the evaluation measures for conversation-based interfaces to relational databases: • The evaluation measures should at least measure the performance of the core functionalities of a C-BIRD. This is represented by measuring the functionality of a C-BIRD, thereby measuring their capabilities to map natural language queries into SQL queries. • The evaluation strategy should be specific in order to keep evaluators focused on the task. General evaluations tend to fail because different users have different goals in mind; for example, users who want to get a good quality result may evaluate the agent accuracy and completeness of agent’s help, whereas users who want to get a quick reference may evaluate completion time and efficiency of the operation [29]. • The user evaluation should include usability measures such as naturalness of the system, ease of use, etc. The devised evaluation strategy is based on quantitative measures and qualitative measures, as described in the following section.

6 Quantitative Measures The following measures are suggested to be used to evaluate C-BIRDs: • Task Success This measure was used in evaluating both CAs and NLIDBs. This measure is suitable for evaluating C-BIRDs prototypes, since it can measure the prototype success in generating SQL queries in response to user’s queries, thereby measuring task success in answering user’s queries. This measure can be captured at the task level by asking the user after each task to confirm whether the task was successful or not. Also, log file analysis could show if the task was successful or not by observing the generation of an SQL query for a specific scenario. The log file analysis could also be useful in finding the reason for failure if the prototype failed to answer the user question. • Dialogue Length The dialogue length measurement can be used to evaluate the prototype ability to lead the users towards their goals with a minimum number of turns (called dialogue efficiency). This measure can be used in conjunction with the task success in order to calculate the prototype efficiency for each task. The following equation shows how the dialogue efficiency could be calculated:


M. Owda et al.

Dialogue Efficiency =

# of successful tasks with minimum # of turns Total number of successful tasks

In which dialogue efficiency equals the number of successful tasks with minimum number of turns divided by the total number of successful tasks.

7 Qualitative Measures Table 1 shows the qualitative measures used to evaluate the developed prototype. Capturing these measures is done through a user questionnaire after using the C-BIRD. Table 1. The qualitative measures to be used to evaluate a C-BIRD. Measure


System ease of use

Measures user satisfaction with the prototype system simplicity, friendliness, flexibility, and effortlessness


Measures user satisfaction with the language used

Positive/negative emotions

Measures user emotions such as happiness, sadness etc.


Measures user satisfaction with the look of the system (e.g. colours)

Text on screen

Measures user satisfaction with the text type and size

Organisation of information Measures user satisfaction with the distribution of information in the interface Error message clarity

Measures user satisfaction with error messages such as the clarity of understanding the messages

8 The Sample Experiments that involve humans need time, effort and resources. The group size was chosen carefully to provide a balance of a statistically significant evaluation for the quantitative measure used and a sufficient number of evaluators for the qualitative evaluation. The qualitative evaluation used Nielsen’s research results [30–32] show that there is no need to use more than 15 evaluators to evaluate a prototype system interface. Nielsen stated that after the fifth user, the researcher will waste time observing the same findings repeatedly but not learning much new. Figure 2 produced by Nielsen shows that testing requires at least 15 users to discover all the usability problems in the design of a prototype. This paper suggests that this model can be used to decide the number of participant to be used as a sample for evaluating C-BIRDs. In addition the recruitment of experts to evaluate C-BIRDs is very difficult and expensive. The best case scenario for choosing the participants is bringing participants

A Comprehensive Methodology for Evaluating (C-BIRDs)


Fig. 2. The number of test users needed for a usability evaluation

from different backgrounds. The ideal experts would be managers of large corporations who want to directly access information held in databases without having knowledge of SQL. The alternative approach is to bring a set of people who do not know SQL and who also do not know the structure of the relational database to evaluate C-BIRDs.

9 Experimental Procedure Each of the participants will be given the evaluation sheet which included tasks (i.e. task success measure) and a questionnaire at the end of the evaluation (after finishing the task success jobs). The participants also can be given a handout that includes an explanation of domain relational database, the relational database entity relationships, relational database data dictionary and the prototype interface description. The C-BIRD prototypes designed for people who are unaware of the relational database structure and do not have knowledge of SQL, therefore setting specific tasks for them to do is important, as otherwise they would not know what to ask for. Setting the scene can help at the beginning of the evaluation but the user may soon forget the aims of the evaluation and might go off to start testing out of curiosity. Cognitive walkthrough is an evaluation method used to test specific measures through the use of tasks-based approach [33–36]. The user can be asked to perform a sequence of steps in order to accomplish a task. Using the cognitive walkthrough approach to evaluate C-BIRDs prototypes involved the following steps: • For each task the user set a goal to be accomplished with the C-BIRD prototype. • The user performed a series of actions in order to achieve the goal. • The user to evaluate the C-BIRD progress towards achieving the goal (i.e. generating an SQL query that answered their question). Using a cognitive walkthrough approach to evaluate C-BIRDs has a number of advantages:


M. Owda et al.

• Clear and simple evaluation • Concise and specific which it evaluates very specific features. • Asks for the information in a structured, highly visible way, which is clearly linked to the goals and objectives of the developed C-BIRD prototype. At the end of the evaluation the participants have to answer a questionnaire about the usability of the prototype. Tables 2 show sample tasks used in to evaluate C-BIRD prototype and what they were used to evaluate specifically. Table 2. Sample tasks used to evaluate C-BIRDs prototypes Tasks


Task 1. The goal: Chatting with the system in Measures the prototype success in a step by order to retrieve all customers in the database step approach to generate SQL statement Step 1: Ask the system about what it can tell about the customers Step 2: Ask the system to list you all customers in the database Task 2. The goal: Chatting with the system in Measures the prototype success in a step by order to retrieve all customers belong to the step approach to generate SQL statement following account name: Business World Step 1: Ask the system about what it can tell about the customers Step 2: Follow the agent’s guidance in order to list customer’s names who belong to the following account: Business World

10 Scaling Measuring scaling is a very important step in the evaluation, since it affects the significance of the results and therefore the statistical significance of the evaluation. The quantitative evaluation i.e. task success was ascertained to evaluate C-BIRDs prototypes systems by the user answering either Yes (if the task was successful) or No (if the task was unsuccessful). This resulted in each task having its own percentage of task success. Therefore, task success can be calculated on a task level and also on approach level, by grouping the tasks for each approach and produce the percentage of successful tasks. The second quantitative measure is the conversation length, based on counting the number of conversation turns for each task. This is done through the analysis of the log files created for each task. Table 3 below shows the qualitative measures scaling used. The user has four options to choose:

A Comprehensive Methodology for Evaluating (C-BIRDs)


Table 3. The qualitative measures scales System is easy to use

 Difficult to use


 Easy to use


Naturalness of system responses

 Difficult to Understand


 Easy to Under-stand


Positive/Negative emotions










Text on screen

 Difficult to read


 Easy to read


Organisation of information





Error messages





• The prototype met measures such as for the first measure “System is easy to use”, in which the user chooses easy to use. • The prototype met the measure half way, such as for the first measure “System is easy to use”, in which the user chooses moderate. • The prototype did not meet the measure, such as for the first measure “System is easy to use”, in which the user chooses difficult to use. • The user might feel that this measure is not applicable and choose N/A. The scales shown in Table 3 are simple to understand by the users and easy to collect and analyse the data after the evaluation. The analysis can either talk about the most significant value within the four options or talk about all of them if needed.

11 Collating the Results Data Collating the results data can be done through two phases. The first phase collects the questionnaire data which include: • Collecting the user inputs for the tasks; this includes the user ranking of the task success, by writing/ticking yes or no and user comments on each task. • Collecting the user ranking for the prototype usability on each measure located at the end of the evaluation of all the tasks. The second phase of collating the results data is through collecting data from the log files. This data can be: • User inputs on a specific task in both the static approach tasks and the dynamic approach tasks. This data is analysed when the prototype fails to answer a user query.


M. Owda et al.

• Task specific log files to find the number of turns per task and this is going to be used later on in this paper.

12 Conclusion This paper presented a comprehensive evaluation methodology to be used to evaluate C-BIRDs by reviewing the literature of both areas of Conversational Agents (CAs) and Natural Language Interfaces to Relational Databases (NLIDBs). The methodology took into account the features used in both areas and also examined the evaluation features correlation between CAs and NLIDBs. Then the methodology set of features to use introduced. In addition, the paper also discussed the sample size, experimental procedure, scaling and collating the results. Therefore providing a comprehensive overview of how a research can conduct an evaluation of C-BIRDs. The comprehensive evaluation introduced in this paper uses both quantitative evaluation measures, which include both the task success measure and dialogue length measure; and qualitative evaluation measures, which include all the measures used to evaluate the usability of a C-BIRD prototype.

References 1. Litman, D., Pan, S.: Designing and evaluating an adaptive spoken dialogue system. User Model. User-Adapted Interact. 12(2–3), 111–137 (2002) 2. Walker, M., Hirschman, L., Aberdeen, J.: Evaluation for DARPA communicator spoken dialogue systems. In: Proceedings Second International Conference on Language Resources and Evaluation (2000) 3. Sanders, G., Scholtz, J.: Measurement and evaluation of embodied conversational agents. In: Embodied Conversational Agents, pp. 346–373. MIT Press (2000) 4. Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A.: Toward conversational human-computer interaction. Am. Assoc. Artif. Intell. 22(4), 27–37 (2001) 5. López-Cózar, R., Callejas, Z., Espejo, G., Griol, D.: Enhancement of conversational agents by means of multimodal interaction. In: Perez-Marin, D., Pascual-Nieto, I. (eds.) Conversational Agents and Natural Language Interaction: Techniques and Effective Practices, pp. 223–252 (2011) 6. Hung, V., Elvir, M., Gonzalez, A., DeMara, R.: Towards a method for evaluating naturalness in conversational dialog systems. In: Proceedings of the 2009 IEEE International Conference on Systems, Man and Cybernetics, San Antonio, TX, USA, pp. 1236–1241. IEEE Press (2009) 7. Lamel, L., Bennacef, S., Gauvain, J.L., Dartigues, H., Temem, J.N.: User evaluation of the MASK kiosk. Speech Commun. 38(1), 131–139 (2002) 8. Cassell, J., Bickmore, T.: Negotiated collusion: modeling social language and its relationship effects in intelligent agents. User Model. User-Adapted Interact. 13(1–2), 89–132 (2003) 9. Semeraro, G., Andersen, H.H., Andersen, V., Lops, P., Abbattista, F.: Evaluation and validation of a conversational agent embodied in a bookstore. In: Proceedings of the User Interfaces for all 7th International Conference on Universal Access: Theoretical Perspectives, Practice, and Experience, Paris, France, pp. 360–371. Springer (2003) 10. Bernsen, N.O., Dybkjær, L.: User interview-based progress evaluation of two successive conversational agent prototypes. In: INTETAIN, pp. 220–224. Springer (2005)

A Comprehensive Methodology for Evaluating (C-BIRDs)


11. Bouwman, G., Hulstijn, J.: Dialogue strategy redesign with reliability measures. In: Proceedings of First International Conference on Language Resources and Evaluation, pp. 191–198 (1998) 12. Foster, M.E., Giuliani, M., Knoll, A.: Comparing objective and subjective measures of usability in a human-robot dialogue system. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 879–887. Association for Computational Linguistics (2009) 13. Bigot, L., Jamet, E., Rouet, J.-F.: Searching information with a natural language dialogue system: a comparison of spoken vs. written modalities. Appl. Ergon. 35(6), 557–564 (2004) 14. Artstein, R., Gandhe, S., Gerten, J., Leuski, A., Traum, D.: Semi-formal evaluation of conversational characters. In: Orna, G., Michael, K., Shmuel, K., Shuly, W. (eds.) Languages: From Formal to Natural, pp. 22–35. Springer (2009) 15. Silvervarg, A., Jönsson, A.: Subjective and objective evaluation of conversational agents in learning environments for young teenagers. In: The Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain. AAAI Press/International Joint Conferences on Artificial Intelligence (2011) 16. Kopp, S., Gesellensetter, L., Kramer, N., Wachsmuth, I.: A conversational agent as museum guide: design and evaluation of a real-world application. In: Lecture Notes in Computer Science, pp. 329–343. Springer (2005) 17. McKevitt, P., Partridge, D., Wilks, Y.: Why machines should analyse intention in natural language dialogue. Int. J. Hum.-Comput. Stud. 51(5), 947–989 (1999) 18. Bickmore, T., Giorgino, T.: Health dialog systems for patients and consumers. J. Biomed. Inform. 39(5), 556–571 (2006) 19. Yuan, X., Chee, Y.S.: Design and evaluation of Elva: an embodied tour guide in an interactive virtual art gallery: research Articles. Comput. Animat. Virtual Worlds 16(2), 109–119 (2005) 20. Palmer, M., Finin, S.T.: Workshop on the evaluation of natural language processing systems. Comput. Linguist. 16, 175–181 (1990) 21. Forsmark, M.: Evaluating Natural Language Access to Relational Databases. UMEA University, Computing Science, Sweden (2005) 22. Jung, H., Lee, G.G.: Multilingual question answering with high portability on relational databases. In: Proceedings of the 2002 Conference on Multilingual Summarization and Question Answering - Volume 19, pp. 1–8. Association for Computational Linguistics (2002) 23. Popescu, A.-M., Armanasu, A., Etzioni, O., Ko, D., Yates, A.: Modern natural language interfaces to databases: composing statistical parsing with semantic tractability. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland. Association for Computational Linguistics (2004) 24. Sharma, H., Kumar, N., Jha, G.K., Sharma, K.G., Wyld, D.C., Wozniak, M., Chaki, N., Meghanathan, N., Nagamalai, D.: A natural language interface based on machine learning approach. In: Communications in Computer and Information Science, vol. 197 Trends in Network and Communications, pp. 549–557. Springer, Heidelberg (2011) 25. Tang, L., Mooney, R.: Automated construction of database interfaces: integrating statistical and relational learning for semantic parsing. In: Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13, Hong Kong, pp. 133–141. Association for Computational Linguistics (2000) 26. Yates, A., Etzioni, O., Weld, D.: A reliable natural language interface to household appliances. In: Proceedings of the 8th International Conference on Intelligent User Interfaces, Miami, Florida, USA, pp. 189–196. ACM (2003)


M. Owda et al.

27. Minock, M.: A phrasal approach to natural language interfaces over databases. In: Lecture Notes in Computer Science, Volume 3513, 2005 Natural Language Processing and Information Systems, pp. 333–336. Springer, Heidelberg (2005) 28. Minock, M.: C-Phrase: a system for building robust natural language interfaces to databases. Data Knowl. Eng. 69(3), 290–302 (2010) 29. Xiao, J., Stasko, J., Catrambone, R.: Embodied conversational agents as a UI paradigm: a framework for evaluation. In: Proceedings of AAMAS 2002 workshop: Embodied Conversational Agents Let’s Specify and Evaluate Them!, Bologna, Italy (2002) 30. Molich, R., Nielsen, J.: Improving a human-computer dialogue. Commun. ACM 33(3), 338– 348 (1990) 31. Nielsen, J., Molich, R.: Heuristic evaluation of user interfaces. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Empowering People, Seattle, Washington, United States, pp. 249–256. ACM (1990) 32. Nielsen, J., Landauer, T.: A mathematical model of the finding of usability problems. In: Proceedings of the INTERACT 1993 and CHI 1993 Conference on Human Factors in Computing Systems, Amsterdam, The Netherlands, pp. 206–213. ACM (1993) 33. Blackmon, M.H., Polson, P.G., Kitajima, M., Lewis, C.: Cognitive walkthrough for the web. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Changing Our World, Changing Ourselves, Minneapolis, Minnesota, USA, pp. 463–470. ACM (2002) 34. Blackmon, M.H., Kitajima, M., Polson, P.G.: Repairing usability problems identified by the cognitive walkthrough for the web. In: Proceedings of the SIGCHI conference on Human factors in computing systems, Ft. Lauderdale, Florida, USA, pp. 497–504. ACM (2003) 35. Gabrielli, S., Mirabella, V., Kimani, S., Catarci, T.: Supporting cognitive walkthrough with video data: a mobile learning evaluation study. In: Proceedings of the 7th International Conference on Human Computer Interaction with Mobile Devices & Services, Salzburg, Austria, pp. 77–82. ACM (2005) 36. Mahatody, T., Sagar, M., Kolski, C.: State of the art on the cognitive walkthrough method, its variants and evolutions. Int. J. Hum. Comput. Interact. 26(8), 741–785 (2010) 37. Baik, C., Jagadish, H.V., Li, Y.: Bridging the semantic gap with SQL query logs in natural language interfaces to databases. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, Macao, pp. 374–385 (2019) 38. Owda, M., Bandar, Z., Crockett, K.: Information extraction for SQL query generation in the conversation-based interfaces to relational databases (C-BIRD). In: Agent and Multi-Agent Systems: Technologies and Applications, pp. 44–53. Springer, Heidelberg (2011) 39. Yuan, C., Ryan, P., Ta, C., et al.: Criteria2Query: a natural language interface to clinical databases for cohort definition. J. Am. Med. Inform. Assoc. 26(4), 294–305 (2019) 40. Xu, B.: NADAQ: natural language database querying based on deep learning. IEEE Access 7, 35012–35017 (2019)

Disease Normalization with Graph Embeddings D. Pujary1,2 , C. Thorne2(B) , and W. Aziz1 1

University of Amsterdam, Amsterdam, The Netherlands [email protected], [email protected] 2 Elsevier, Amsterdam, The Netherlands [email protected]

Abstract. The detection and normalization of diseases in biomedical texts are key biomedical natural language processing tasks. Disease names need not only be identified, but also normalized or linked to clinR . In this paper we ical taxonomies describing diseases such as MeSH describe deep learning methods that tackle both tasks. We train and test our methods on the known NCBI disease benchmark corpus. We R ’s graphical propose to represent disease names by leveraging MeSH structure together with the lexical information available in the taxonomy using graph embeddings. We also show that combining neural named entity recognition models with our graph-based entity linking methods via multitask learning leads to improved disease recognition in the NCBI corpus. Keywords: Disease named entity normalization · biLSTM-CRF Models · Graph embeddings · Multi-task learning



In biomedical search applications, e.g. clinical search applications, it is of key importance to query not simply by keywords but rather by concept, viz., resolve names into their synonyms and return all matching documents. This is particularly true regarding diseases [4]. When we issue a query about “ovarian cancer”, we expect the system to return hits mentioning standard synonyms such as “ovarian neoplasms”. To achieve this it is necessary not only to perform named entity recognition (NER) to detect all entities mentioned in a document, but also to disambiguate them against databases of canonical names and synonyms, a task known as entity normalization or linking (EL). Detecting and normalizing disease mentions are challenging tasks due to linguistic variability (e.g. abbreviations, morphological and orthographic variation, word order) [16,24]. Interest in these tasks led to the creation of disease knowlR ) taxedge bases and annotated corpora. The Medical Subject Headings (MeSH onomy [18] is a repository of clinically-relevant terms covering (among others) a large range of standardized disease names, where standard names and known c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 209–217, 2021.


D. Pujary et al.

Identification of APC2, a homologue of the [adenomatous polyposis coli tumour ]D011125 suppressor. MeSH Heading Adenomatous Polyposis Coli Scope Note A polyposis syndrome due to an autosomal dominant mutation of the APC genes (GENES, APC) on CHROMOSOME 5. (. . . ) Tree Number(s) C04.557.470.035.215.100 (. . . ) Entry Term(s) Polyposis Syndrome, Familial (. . . ) R Fig. 1. (Top) An NCBI corpus excerpt: title of PubMed abstract 10021369. In italics, R R identifier. (Bottom) Snapshot of MeSH entry for the mention, in red the MeSH disease concept D011125. The “Scope Note” is a short (10–20 tokens) definition of the concept.

synonyms are organised hierarchically. The NCBI disease corpus [5] is a collecR R abstracts with disease mentions manually resolved to MeSH tion of PubMed R or OMIM concepts. Annotated resources allow the training of supervised learning algorithms. The first such system was [16], which used a learning-to-rank approach. More recently, systems based on deep learning have been proposed [2,6,17]. The state-of-theart for NER employs bidirectional recurrent neural networks to parameterize a conditional random field [15]. Normalization is typically based on learning a R , similarity metric between disease mentions and their standard names in MeSH where mentions and names are represented by a composition of word embeddings [2,17]. This typically requires first identifying potential candidate diseases, for example by exploiting handcrafted rules [17] or bidirectional recurrent neural networks [7]. While [6] leverages in addition a coherence model that enforces a form of joint disambiguation. Crucially, previous work has consistently ignored R turning the taxonomy into a flat list of the hierarchical structure of MeSH concepts. In this paper, we study an alternative approach to disease normalization. In R and employ particular, we acknowledge the hierarchical structure of the MeSH graph encoders—both supervised and unsupervised—to learn representations of R nodes). To make node representations easier to cordisease names (i.e. MeSH relate to textual input, we propose to also exploit lexical information available in scope notes. Finally, given the similarities between NER and EL, we explore multitask learning [MTL; 1] via a shared encoder. MTL has been traditionally applied in biomedical NLP to solve NER over several, though related corpora, covering different, but related entities [25], but has seldom been used to learn related tasks over the same corpus and entities. Our findings suggest that using lexicalized node embeddings improves EL performance, while MTL allows to transfer knowledge from (graph-embedding) EL to NER, with performance topping comparable approaches for the NCBI corpus [9,25].

Disease Normalization with Graph Embeddings



Datasets and Methods

R Knowledge Base. MeSH is a bio-medical controlled vocabulary produced by the National Library of Medicine [18] covering a large number of health concepts, including 10, 923 disease concepts, and is used for indexing, cataloging and searching bio-medical and health-related documents and information. There R records of several types divided further into several categories. In are MeSH this paper we consider only category C Descriptors, in other words on disR disease nodes are arrayed hierarchically (in the form of an acyclic eases MeSH directed graph or tree) from most generic to most specific in up to thirteen hierarchical levels, and have many attributes, such as, MeSH Heading, Tree Number(s), Unique ID, Scope Note and Entry Term(s). See Figure 1 (bottom)1 .

Supervision. We use the NCBI disease corpus [5] for training and validating R our methods. It consists of 7922 PubMed abstracts separated into training, validation and test subsets (see Table 1). The corpus is annotated with disease R R or OMIM [10] (whose identifiers we conmentions mapped to either MeSH R  vert to MeSH identifiers if possible using CTD [3]), Fig. 1 (top) illustrates an annotated instance in the corpus. NER. For NER we use variants of [15] and [19] biLSTM-CRF models. First, we encode words in an abstract using either a fixed and pre-trained contextualized encoder, i.e. bioELMo [11], or a concatenation of a fixed and pre-trained biomedical embedding, i.e. SkipGram [22], and a trainable character-level encoder (e.g. LSTM [15] or CNN [19]). We then re-encode every token with a trainable biLSTM whose outputs are linearly transformed to the potentials of a linear-chain CRF [14]. All trainable parameters are optimized to maximize the likelihood of observations (i.e. IOB-annotated spans). At prediction time, Viterbi decoding finds the optimum IOB sequence in linear time. EL with node2vec Embeddings. We want to capitalize on the structure R . Thus, for every disease node, we learn an of the disease branch of MeSH embedding with node2vec [8,21]. The algorithm starts from randomly initialized node embeddings which are then optimized to be predictive of neighbourhood R graph (for this we ignore the directionality inherit to relations in in the MeSH R  MeSH ). We use the node2vec framework [8] to generate two types of representations. For type-I, we use node2vec as is, that is, with randomly initialized node embeddings. For type-II, we leverage lexical information by initializing a node’s embedding with the bioELMo encoding of that node’s scope note (we use average pooling). The idea behind type-II embeddings is to combine lexical features R . Our EL model is a logistic in the scope note and structural relations in MeSH regression classifier parameterized by a bilinear product between the embedding emb(y) of a node y and the contextualized encoding enc(x, m) of a mention m in an abstract x, i.e. p(y|x, m, θ) ∝ exp(enc(x, m) W emb(y)), where θ = {W} are 1 2

R Full MeSH example: We drop one training abstract because it is repeated: abstract 8528200.


D. Pujary et al. Table 1. NCBI disease corpus statistics. Split

Abstracts Total entities

Training 592 Validation 100 100 Test

5,134 787 960

Unique entities

Unique concept IDs


1,691 363 424

657 173 201

136,088 23,969 24,497

Fig. 2. Visualization of the MTL architecture.

trainable parameters. The node embedding is either a type-I or type-II embedding, and the mention embedding is the average of the bioELMo contextualized representation of the tokens in the disease mention. Unlike previous work, we R , that is, normalize p(y|x, m, θ) against the set of all known entities in MeSH without a pre-selection of candidates. The model is optimized to maximize the likelihood of observations—triples (m, x, y) of mentions in an abstract resolved R nodes. to MeSH EL with GCN Encoders. Graph convolution networks [GCN; 13] generalize CNNs to general graphs. A GCN is an encoder where node representations are updated in parallel by transforming messages from adjacent nodes in a graph structure. A message is nothing but a parameterized (possibly non-linear) transformation of the representation of the neighbour. Stacking L GCN layers allows information to flow from nodes as far as L hops away in graph structure. In () particular we follow [20], where  the encoding hv of a node v in the taxon() () (−1) omy is computed as hv = σ + b() , where  indexes the u∈N (v) W hu R layer, N (v) are neighbors of v in MeSH (we again discard directionality), b() () and W are the parameters of the th GCN layer, σ is a nonlinearity (we (0) use ReLU), and hv = emb(v) is the initial embedding of the disease node v. For the 0th representation of nodes, we use the bioELMo encoding of a node’s scope note. As a trainable encoder, a GCN layer is an integral part of the clas-

Disease Normalization with Graph Embeddings


Table 2. Results on test set and validation set of the NER experiments. We use [15] and [19] models as a baseline, and compare with Habibi’s et al. [9] known reimplementation of [15]. For bioELMo we use the same architecture common to both the baseline. Results reported are the mean and standard deviation of 5 runs. *Our implementation. **Result taken from original paper. Encoder




Val. F1

Lample et al.*

0.824 ± 0.022 0.742 ± 0.019 0.781 ± 0.003 0.805 ± 0.007

Ma and Hovy*

0.823 ± 0.011 0.776 ± 0.023 0.799 ± 0.012 0.792 ± 0.005


0.878 ± 0.003 0.856 ± 0.005 0.867 ± 0.002 0.884 ± 0.001

Lample et al.** [9] 0.875**



sifier and its parameters are updated by backpropagation from the downstream objective. Once again, our objective is to maximize the likelihood of observations under a probabilistic classifier, i.e. p(y|x, m, θ) ∝ exp(enc(x, m) gcn(y; θ)), (L) where gcn(y; θ) = hy and θ = {(W() , b() )L =1 } are the trainable parameters. Note that unlike node2vec embeddings, GCN encoders are trained directly on EL supervision, which is arguably of limited availability. Multitask Learning. Traditionally, the problem of disease normalization is tackled by first identifying the disease names (NER) and then normalizing them (EL). We attempt to learn from both types of supervision by having a NER and an EL model share parts of their architectures. This is known as multitask learning [1]. In particular, we share the encoder of the NER architecture, see Fig. 2, and derive mention features for the EL model from there.


Experiments and Results

Hyperparameters. We train our models using ADAM [12] and initial learning rate of 10−3 . We split our abstracts into sentences using NLTK3 and process sentences independently. We use 200-dimensional word2vec embeddings and 1, 024dimensional bioELMo encodings. For NER models, we learn 60-dimensional charR node embeddings using node2vec for 100 acter embeddings. We train MeSH epochs, whereas GCN-based EL models are trained for 500 epochs. All models employ 1, 024-dimensional node embeddings. We stack L = 2 GCN layers with 2, 048 hidden and 1, 024 output units. We use 0.5 dropout regularization [23]4 . Metrics and Results. We report precision (Pre), recall (Rec) and (micro averaged) F1 scores (F1) for NER. For EL it is customary to test whether the target R concept identifier) occurs among the top k prediccanonical entity (a MeSH tions [6]. Hence we report precision at confidence  rank k ([email protected]) and mean reciprocal rank (MRR) defined as MRR = 1/|E| · {1/ranki | 1 ≤ i ≤ |E|}. 3 4 The code of our experiments is available at:


D. Pujary et al.

Table 3. Results on test set and validation set of our EL models for different type of MeSH encoding. Results reported are the mean and standard deviation of 5 runs. We consider unlexicalized node2vec as our baseline. *Result taken from original paper. **We cannot compare fully due to different methodology. “S.N” refers to Scope Note. Disease emb




[email protected]

Val. MRR

bioELMo (S.N.) 0.748 ± 0.002 0.715 ± 0.004 0.715 ± 0.002 0.844 ± 0.004 0.791 ± 0.001 node2vec I

0.749 ± 0.002 0.718 ± 0.004 0.720 ± 0.004 0.819 ± 0.006 0.800 ± 0.003

node2vec II

0.757 ± 0.001 0.721 ± 0.004 0.724 ± 0.001 0.842 ± 0.004 0.804 ± 0.006


0.744 ± 0.006 0.710 ± 0.008 0.710 ± 0.007 0.831 ± 0.005 0.803 ± 0.007

DNorm* [16]


NormCo* [6]



We use early stopping for model selection based on performance on the validation set, and report the average result of 5 independent runs on the test and validation sets. Table 2 shows the results for NER. By replacing the word and character embeddings altogether with bioELMo embeddings, the results improve by a large margin. Increasing the number of parameters in the models did not affect model performance. Furthermore, we improve on [9]’s re-implementation of Lample et al.’s biLSTM-CRF model on the NCBI corpus (+0.028 F1 score points). R We tested our EL baseline—node2vec type-I, encoding only the MeSH structural information—against disease embeddings generated by averaging the bioELMo embedding of Scope Notes (see Table 3), and found it to perform better except on [email protected] Node lexicalization, node2vec type-II, yields results on a par or better than the bioELMo baseline. Using GCN for training, we do not achieve any improvement. This suggests that both structural and lexical inforR are important for normalization, and that taxonomy encoding mation in MeSH is best achieved when computed independently from the EL task. This strategy seems on the other hand to underperform other normalization methods reported in the literature [6,16]. Our systems however do not incorporate many of their R optimizations such as: normalization to disease synonym rather than MeSH identifier, abbreviation resolution, re-ranking and filtering, or coherence models. Table 4 shows, finally, that using MTL results in an improved NER score (+0.009 F1 score points w.r.t. our best NER model in Table 2). This indicates a certain degree of transfer from the EL to the NER task. It also yields a improvement w.r.t. similar NER and MTL for the NCBI corpus [9,25] (+0.028 points for NER and +0.015 for MTL). Error Analysis. Regarding NER, most errors arise when dealing with multitoken entities: e.g., for sporadic T-PLL only the head T-PLL is detected. On the other hand, the bioELMo model is able to identify abbreviated disease names much better than the two baselines: e.g., T-PLL occurring alone is always correctly detected. This might be because bioELMo starts at character-level and gradually captures token- and phrase-level information.

Disease Normalization with Graph Embeddings


Table 4. Results on test set of our MTL experiment where we report precision, recall and F1-score for NER task, and for EL we report MRR and precision. Results reported are the mean and standard deviation of 3 runs. *Result taken from oginal paper. Model








0.880 ± 0.003

0.872 ± 0.008

0.876 ± 0.003

0.747 ± 0.003

[email protected] 0.816 ± 0.006


0.875 ± 0.006

0.869 ± 0.001

0.872 ± 0.003


0.745 ± 0.001

0.816 ± 0.001

Wang* [25]

0.857 ± 0.009*

0.864 ± 0.004*

0.861 ± 0.003*

R Our best EL model makes errors by confusing diseases with their MeSH ancestors and neighbors: e.g., it confuses D016399- Lymphoma, T-Cell with D015458- Leukemia, T-Cell, with which it shares an ancestor: D008232- Lymphoproliferative Disorders. Often, it resolves correctly the first instance of a concept but returns instead a neighbour afterwards: e.g., in “Occasional missense mutations in ATM were also found in tumour DNA from patients with B-cell non-Hodgkins lymphomas (B-NHL) and a B-NHL cell line” B-cell nonHodgkins lymphomas is correctly resolved to D016393 but B-NHL (its abbreviation) is mapped to D008228 (its child) and the second occurrence to D018239 (another form of cancer). This might be due to the following facts: Both NCBI R are not particularly large resources, hence the mention and disease and MeSH encodings learnt by the model are not sufficiently discriminative. In addition, R taxonomiGCN and node2vec training tends to ignore the direction of MeSH cal edges. Finally, DNorm and NormCo normalize diseases to concepts indirectly via their synonyms, allowing them to exploit better lexical features5 .



R In this paper, we address the problem of normalizing disease names to MeSH concepts (canonical identifiers) by adapting state-of-the-art neural graph embedR dings (GCN and node2vec) that exploit both MeSH ’s hierarchical structure and the description of diseases. Our graph-based disease node encoding is the first of its kind, to best of our knowledge. We also apply multi-tasking to transfer information between disease detection (NER) and resolution (EL) components of the task, leveraging their common signals to improve on the single models. We observe that bioELMo embeddings lead to substantial improvement in NER performance. We demonstrate that node lexicalization does improve over either pure structural or lexical embeddings, and that MTL gives rise to state-of-the-art performance for NER in the corpus used (NCBI corpus). On the other hand, we do not currently outperform other disease normalR concepts with ization approaches [6,16], and often confuse neighbour MeSH their true targets. In the future we intend to experiment with the optimizations


They also include a wide range of optimizations such as re-ranking, coherence models or abbreviation resolution.


D. Pujary et al.

reported (especially: resolution to synonym rather than concept identifier and R is also a comparatively small graph with short re-ranking) by [6,16]. MeSH disease descriptions. As further work we plan to enrich it by linking it to larger scale resources (e.g., Wikipedia).

References 1. Caruana, R.: Multitask learning. Mach. Learn. 28(1), 41–75 (1997) 2. Cho, H., Choi, W., Lee, H.: A method for named entity normalization in biomedical articles: application to diseases and plants. BMC Bioinform. 18(1), 451 (2017) 3. Davis, A.P., Grondin, C.J., Johnson, R.J., Sciaky, D., King, B.L., McMorran, R., Wiegers, J., Wiegers, T.C., Mattingly, C.J.: The comparative toxicogenomics database: update 2019. Nucleic Acids Res. 47(D1), D948–D954 (2018) 4. Dogan, R.I., Lu, Z.: An inference method for disease name normalization. Information retrieval and knowledge discovery in biomedical text. In: AAAI Fall Symposium (2012) 5. Do˘ gan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus. J. Biomed. Inform. 47(C), 1–10 (2014) 6. Mehta, R., Wright, D., Katsis, Y., Hsu, C.-N.: NormCo: deep disease normalization for biomedical knowledge base construction. In: Proceedings of AKBC 2019 (2019) 7. Greenberg, N., Bansal, T., Verga, P., McCallum, A.: Marginal likelihood training of BiLSTM-CRF for biomedical named entity recognition from disjoint label sets. In: Proceedings of EMNLP 2018 (2018) 8. Grover, A., Leskovec, J.: Node2Vec: scalable feature learning for networks. In: Proceedings of KDD 2016 (2016) 9. Habibi, M., Weber, L., Neves, M., Wiegandt, D.L., Leser, U.: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017) 10. Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A., McKusick, V.A.: Online mendelian inheritance in man (OMIM), a knowledge base of human genes and genetic disorders. Nucleic Acids Res. 33(suppl 1), D514–D517 (2005) 11. Jin, Q., Dhingra, B., Cohen, W.W., Lu, X.: Probing biomedical embeddings from language models. CoRR, abs/1904.02181 (2019) 12. Kingma, D.P., Adam, J.B.: A method for stochastic optimization 2014. In: Proceedings of ICLR (2014) 13. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907 (2016) 14. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML 2001 (2001) 15. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. CoRR, abs/1603.01360, 2016 16. Leaman, R., Islamaj Do˘ gan, R., Lu, Z.: DNorm: disease name normalization with pairwise learning to rank. Bioinformatics 29(22), 2909–2917 (2013) 17. Li, H., Chen, Q., Tang, B., Wang, X., Hua, X., Wang, B., Huang, D.: CNN-based ranking for biomedical entity normalization. BMC Bioinform. 18(11), 385 (2017) 18. Lipscomb, C.E.: Medical subject headings (MeSH). Bull Med. Libr. Assoc. 88(3), 265–266 (2000)

Disease Normalization with Graph Embeddings


19. Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNsCRF. CoRR, abs/1603.01354 (2016) 20. Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. CoRR, abs/1703.04826 (2017) 21. Perozzi, B., Al-Rfou, R., Skiena, S.: DeepWalk: online learning of social representations. In: Proceedings of KDD 2014 (2014) 22. Pyysalo, S., Ginter, F., Moen, H., Salakoski, T., Ananiadou, S.: Distributional semantics resources for biomedical text processing. In: Proceedings of LBM 2013 (2013) 23. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 24. Thorne, C., Klinger, R.: On the semantic similarity of disease mentions in R and Twitter. In: Proceedings of NLDB 2018 (2018) PubMed 25. Xuan Wang, Y., Zhang, X.R., Zhang, Y., Zitnik, M., Shang, J., Langlotz, C., Han, J.: Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 35(10), 1745–1752 (2018)

Quranic Topic Modelling Using Paragraph Vectors Menwa Alshammeri1,2(B) , Eric Atwell1 , and Mhd Ammar Alsalka1 1

University of Leeds, Leeds LS2 9JT, Leeds, UK {scmhka,e.s.atwell,m.a.alsalka} 2 Jouf University, Sakakah, Saudi Arabia

Abstract. The Quran is known for its linguistic and spiritual value. It comprises knowledge and topics that govern different aspects of people’s life. Acquiring and encoding this knowledge is not a trivial task due to the overlapping of meanings over its documents and passages. Analysing a text like the Quran requires learning approaches that go beyond word level to achieve sentence level representation. Thus, in this work, we follow a deep learning approach: paragraph vector to learn an informative representation of Quranic Verses. We use a recent breakthrough in embeddings that maps the passages of the Quran to vector representation that preserves more semantic and syntactic information. These vectors can be used as inputs for machine learning models, and leveraged for the topic analysis. Moreover, we evaluated the derived clusters of related verses against a tagged corpus, to add more significance to our conclusions. Using the paragraph vectors model, we managed to generate a document embedding space that model and explain word distribution in the Holy Quran. The dimensions in the space represent the semantic structure in the data and ultimately help to identify main topics and concepts in the text. Keywords: Holy Quran · Semantic analysis · Distributional representation · Topic modeling · Deep learning · Document embedding · Paragraph vector



The Holy Quran is a significant resource that is very rich of patterns, topics, and information that make the core of the correct pure knowledge of Muslims. Analyzing the Quran requires special skills and a great deal of effort to get a comprehensive understanding of its meanings, gain useful knowledge, and ultimately build a robust resource for religious scholars, educators, and the public to understand and learn the Quran. The richness of the Quranic text and the deep layers of its meaning offer immense potentials for further study and experiments. Analyzing the Quranic text is not a trivial task due to the overlapping of its meanings. Thus, extracting the implied connections would require deep semantic analysis and domain knowledge. c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 218–230, 2021.

Paragraph Vectors with Quran


The Quran has been the subject of many NLP studies. Several studies were related to text mining and topic modeling with the Quran (for a recent survey, see [3]). Previous studies have explored the underlying knowledge of the holy book at different granularities. Scholars and researchers have built applications and tools that exploit such knowledge to allow search in the text. However, they all use different approaches to extract the information needed for their task. Many studies were devoted to topic modeling of the Quran. Most of these works used Latent Dirichlet Allocation LDA as the topic modeling algorithm. Nevertheless, the topic models do not always produce accurate results, and sometimes their findings are misleading [14]. Computational approaches for representing the knowledge encoded in texts play a central role in obtaining the deeper understanding of the content of natural language texts. A recent trend in machine intelligence is the use of distributed representation for words [7] and documents [8] as these representations work well in practice. Several researches have developed distributed word representations in Arabic as well [2]. Word embedding is a modern approach for representing text where individual words are represented as real-valued vectors in a predefined vector space. These representations preserve more semantic and syntactic information on words, leading to improved performance in NLP tasks. They offer richer representation of text that is the base for various machine learning models [20]. Analyzing a text like the Quran requires powerful learning approaches that go beyond word level to achieve phrase level or sentence level representation. Document embedding is a powerful approach, that is a direct extension of word embedding. It maps documents to informative vector representations. Paragraph vectors [8] is a recent breakthrough on feature embedding that has been proposed as an unsupervised method for learning distributed representations for sentences and documents. The model is capable of capturing many document semantics in dense vectors that can be used as input to many machine learning algorithms. In this work, we used the paragraph vector: an unsupervised document embedding model, to learn an informative representation of Quranic verses. Thus, transforming the text data into features to act as inputs for machine learning models. We utilize the derived features for clustering the verses of the Quran with the final goal of topic modeling of the Quran. Having a good representation of short text like the Quran can benefit the semantic understanding and inferring coherent topics, ultimately identifying inspiring patterns and details that deliver the pure knowledge of the sacred text. This paper is organized as the following: Sect. 2 reviews related work. The methodology we follow is described in detail in Sect. 3. Section 4 is devoted to the experiment and results. Lastly, conclusions and directions for the future are formulated in Sect. 5.


Related Work

This section describes deep learning methodologies that have achieved state-ofthe-art results on challenging NLP problems. It then presents existing works that are related to the Quranic semantic analysis and topic modeling.



M. Alshammeri et al.

Deep Learning of Word Embedding and Document Embedding

We start by discussing the recent discoveries in feature embedding and the latest approaches for learning the dominant representation of texts. These methods are the inspiration for our work. Acquiring semantic knowledge and using it in language understanding and processing has always been an active area of study. Researches have resulted in various approaches and techniques related to semantics representation [24]. One well known but simplistic representation is bag of words BOW (or bag of n-gram). However, it lacks the capability to capture the semantics and syntactic order of words in the text. Another common technique is Latent Dirichlet Allocation (LDA) that is usually used for topic modeling. However, it is tough to tune, and results are troublesome to evaluate. Probabilistic topic models [5] such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) generate a document embedding space to model and explain word distribution in the corpus where dimensions can be seen as latent semantic structures hidden in the data [4,11]. In recent years, machine learning and in particular deep learning with neural networks has played a central role in corpus-based Natural Language Processing NLP [22]. Deep learning related models and methods have recently begun to surpass the previous state-of-the-art results on a variety of tasks and domains such as language modeling, translation, speech recognition, and object recognition in images. One of the most noticeable developments in NLP is the use of machine learning to capture the distributional semantics of words, and in particular deep learning of word embeddings, where words are represented as vectors in a continuous space, capturing many syntactic and semantic relations [21]. Word embeddings and document embedding can be powerful approaches for capturing underlying meanings and relationships within texts, as a step towards presenting the meaningful semantic structure of the text [20]. Use of deep learning word embeddings has led to substantial improvements in semantic textual similarity and relatedness tasks. The impressive impact of these models has motivated researchers to consider richer vector representations to larger pieces of texts. In [8] Mikolov and Le released sentence or document vectors transformation. It is another breakthrough on embeddings such that we can use vector to represent a sentence or document. Document embedding maps sentences/documents to informative vectors representation that preserves more semantic and syntactic information. They call it paragraph vectors [8]. Paragraph vectors has been proposed as an unsupervised method for learning distributed representations for pieces of texts. The authors demonstrated the capabilities of their paragraph vectors method on several text classification and sentiment analysis tasks. [9] also examined paragraph vectors in the context of document similarity tasks. 2.2

Topic Modeling for the Quran

The Quran has been the subject of numerous studies due to its significance. Scholars have studied the Quran for its topics. They have drawn out knowledge

Paragraph Vectors with Quran


and patterns that were the base for many applications to allow search in the holy book. This section provides a review of literature related to text mining and probabilistic topic modeling of the Quran. Many studies were devoted to text mining and topic modeling with the Quran [3]. Such studies aim at extracting accurate, coherent topics from Quran, which promotes understanding of the text. Latent Dirichlet Allocation (LDA) [5], as a statistical method, was mainly adopted in most of the works related to Quranic topic modeling [14,15,19] and [13]. However, they were limited to a unigram model, and examined specific chapters and documents of the text [14]. Moreover, most research projects focused on the translation of the Quran in different languages instead of the original text [18]. Latent Dirichlet allocation (LDA) is a generative probabilistic model for a collections of documents (text corpora). LDA is a topic modeling unsupervised machine learning method that helps discover hidden semantic structures in a text [5]. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words [6]. [11] introduced the semantic representation method that extracts the core of the collection of documents based on the LDA model with the Gibbs sampling algorithm [23]. They demonstrated that the topic model is useful for semantic representation since it can be used in predicting word association and a variety of other linguistic processing and memory tasks. In [13], the authors presented a method to discover the thematic structure of the Quran using probabilistic topic model. They were able to identify two major topics in the Quran, characterized by the distinct themes discussed in the Makki and the Madani chapters. One limitation to their model was using a unigram language model. However, here, we consider phrases or verses of the Quran as the input for the clustering algorithm and the topic analysis. Another work [12] applied LDA to the Quran. Still, the focus was to compare different term weighting schemes and preprocessing strategies for LDA rather than exploring the thematic structure of the document collection. The Quran was used as the testing corpus, while the model was trained using Bible corpora. A recent work [14] explored a topic modeling technique to set up a framework for semantic search in the holy Quran. They applied LDA into two structures, Hizb quarters and verses of Joseph chapter. A similar research [15] has used clustering techniques in machine learning to extract topics of the holy Quran. The process was based on the verses of the Quran using nonnegative matrix factorization. However, it was unclear how they linked the keywords of each topic to the associated verses. [16] have proposed a simple WordNet for the English translation of the second chapter of the Quran. They have created topic-synonym relations between the words in that chapter with different priorities. They have defined different relationships that are used in traditional WordNet. They developed a semantic search algorithm to fetch all verses that contain the query words and their synonyms with high priority. Another work [17] extracted verses from the Quran using web ontology language. They also used the English translation of the


M. Alshammeri et al.

Quran. One recent work by [19] proposed topic modeling of the corpus in Indonesian translation of the Quran by generating four main topics that are firmly related to human life. They considered Makki and Madani surahs as the variable for topic modeling categorization. Their results showed Makki’s surahs contribute 50% compared to Madani’s surahs. These all together motivated us to further the progress in this field. The primary goal of this work is to exploit a recent trend in machine intelligence, which is the distributed representation of text, to learn an informative representation of the passages of the Quran, potentially allowing for the discovery of knowledge-related connections between its documents. Moreover, we aim at revealing hidden patterns that explain the profound relationship between the verses/passages of the sacred text.



We use an unsupervised document embedding technique: paragraph vector, to learn vector representation of the verses of the Quran, and potentially revealing significant patterns and inferring coherent topics. The machine learning pipeline is illustrated in Fig. 1.

Fig. 1. The ML Pipeline for the clustering and topic analysis

We used paragraph vectors to create fixed-length vector representations for each verse/sentence in the Quran. Paragraph vectors, or doc2vec, were proposed by Le and Mikolov [8] as a simple extension to word2vec to extend the learning of embeddings from words to word sequences. Doc2vec in Gensim1 , which is a topic modeling python library, is used to generate the paragraph vectors. There are two approaches within doc2vec: a distributed bag of words model and a distributed memory model of paragraph vector. The distributed bag of words model is a simpler model and ignores word order, while the distributed memory model is a more complex model with more parameters. The two techniques are illustrated in Fig. 2 and 3. 1

Doc2vec paragraph embedding was popularised by Gensim - a widely-used implementation of paragraph vectors:

Paragraph Vectors with Quran


The idea behind the distributed memory model is that word vectors contribute to a prediction task about the next word in the sentence. The model inserts a memory vector to the standard language model, which aims at capturing the topics of the document. The paragraph vector is concatenated or averaged with local context word vectors to predict the next word.

Fig. 2. Paragraph Vector: A distributed memory model (PV-DM) [8]

The paragraph vector can be further simplified when ignoring the context words in the input but forcing the model to predict words randomly sampled from the paragraph in the output. At inference time, the parameters of the classifier and the word vectors are not needed, and back-propagation is used to tune the paragraph vectors. That is the distributed bag of words version of the paragraph vector. The distributed bag of words model works in the same way as skip-gram [8], except that a special token representing the document replaces the input. From Mikolov et al. experiment [8], PV-DM has proven to be consistently better than PV-DBOW. Thus, in our experiment, we use the distributed memory implementation of the paragraph vector. Besides, we consider the recommendations on optimal doc2vec hyper-parameter settings for general-purpose applications as in [10]. We used Doc2vec implemented in Gensim to learn vector representation of the Quranic verses. We trained the paragraph vectors on the 6,236 verses/passages of the Quran using the original Arabic text from Tanzil project2 . First, we read the verses from a digitized version of the Quran as a data frame. We preprocess and clean the text using the NLTK library3 . We removed punctuation, Harakat, and stop-words. Figure 4 shows a snapshot of the data before it is been processed to be ready for training. Now, the document are ready for training. Next, to produce the verses embeddings, we used the python implementation of doc2vec as part of the Gensim package. We trained the Doc2vec model with different configuration of the hyper-parameters. The data has undergone multiple processes to tune the hyperparameters, and we drawn on our domain knowledge to poke the model in the 2 3


M. Alshammeri et al.

Fig. 3. Paragraph Vector: Distributed Bag Of Words (PV-DBOW) [8]

Fig. 4. A snapshot the input data

right way. Now, we can use the trained model to infer a vector for any verse by passing a list of words to the model.infer vector function. This vector can then be compared with other vectors via cosine similarity.


Experiments and Results

After training doc2vec, document embeddings are generated by the model. The vectors act as features for the Quranic verses. Here, we evaluated the vectors on the task of finding similar verses to examine their effectiveness in capturing the semantics of the verses/passages of the Quran. We inferred the vector for a randomly chosen test document/verse and compared the document to our model. Using intuitive self-evaluation, we were able to locate semantically similar verses and eventually created a dataset of pairs of related verses along with their similarity score. We decide on 50 as the vector size that produced best results in terms of the similarity between the verses in each pair. We used the Qurany ontology browser 4 to verify our results. The Qurany corpus is augmented with an ontology or index of key concepts, taken from a recognized expert source. 4 [1].

Paragraph Vectors with Quran


The corpus allows users to search the Quran corpus for abstract concepts via an ontology browser. It contains a comprehensive hierarchical index or ontology of nearly 1200 concepts in the Quran. Indeed, Doc2vec succeeded in exploring relationships between documents. Examples of our results are illustrated in Fig. 5.

Fig. 5. Pairs of related verses using the paragraph vectors as features of the Quranic verses


Clustering with K-Means Algorithm

Here, we investigate the structure of the data by grouping the data points (verses of the Quran) into distinct subgroups. With clustering, we try to find Verses that are similar to each other in terms of topics. The objective is to infer patterns in the data that can inform a decision, or sometimes covert the problem to a Supervised Learning problem. The goal of clustering is grouping unlabeled texts in such a way that texts in the same group/cluster are more similar to each other than to those in other clusters. With clustering, we seek to capture in some way the topics or themes in our corpus and the way they are shared between documents (verses) in it. K-Means is considered as one of the most used clustering algorithms due to its simplicity. K-Means puts the observations into k clusters in which each observation belongs to a cluster with the nearest mean. The main idea is to define k centroids, one for each cluster. We implement our K-Means clustering algorithm in our vectorized documents. We initially determine the number of clusters be 15 (the number of main topics from our Qurany corpus), assuming that we have a general sense of the right number of clusters. Another approach would be to do a couple of trial/errors


M. Alshammeri et al.

Fig. 6. Visualization of the clustering when number clusters = 15

to find the best number of clusters. We tried different values ranging from 2 to 20. We implemented the algorithm in python with the help of SciKit Learn library5 . We have done the clustering and visualized the clusters, as in Fig. 6. The list in Fig. 7 shows examples of the verses in an example cluster. Figure 8 shows a snapshot of clusters along with individual verses contained in each cluster. To evaluate our clustering, we used a tagged corpus: Qurany. We compared our results against the corpus to verify the relationships between the verses of the Quran, identify how they are related, and address the concepts covered in each cluster. Our findings confirmed that paragraph vectors representations offered a useful input representation that promoted the clustering performance. Moreover, we use two different metrics to identify how functional is this clustering and to measure its performance6 . First, we consider the inertia metric, which is the within-cluster sum of squares of distances to the cluster center. The algorithm aims to choose centroids that minimize the inertia, which can indicate how internally coherent clusters are. The other metric is Silhouette Score which can be used to determine the degree of separation between clusters. Silhouette Score can take values in the interval [−1, 1]:

5 6

Paragraph Vectors with Quran


Fig. 7. A List of Verses located in cluster 1

– If it is 0, then the sample is very close to the neighboring clusters. – It it is 1, then the sample is far away from the neighboring clusters. – It it is −1, then the sample is assigned to the wrong clusters. As the coefficient approaches 1, it indicates having good clustering. After we calculated the inertia and silhouette scores, we plotted them and evaluated the performance of the clustering algorithm. Figure 9 shows the result of the two metrics. The inertia score always drops when we increase the number of clusters. From the silhouette curve, As the plots in the figure show, n clusters = 14 has the best average silhouette score and all clusters being above the average shows that it is actually a good choice. The value is close to the number of main topics in the Qurany corpus (15), which indicates we got encouraging results. Joining the elbow curve with the silhouette score curve provides valuable insight into the performance of K-Means.


M. Alshammeri et al.

Fig. 8. The derived clusters

Fig. 9. Evaluation the performance of clustering using Elbow and Silhouette score curve


Conclusions and Future Directions

This work presented a new vector representation of the Quranic verses at the paragraph level. These vectors can be used as features and leveraged for the clustering and topic analysis. We then examined the capabilities of paragraph vectors on finding related verses/passages. We were able to locate semantically related verses, and created a dataset of pairs of related verses. We used the Qurany ontology browser to verify our results. Qurany corpus is augmented with an ontology, taken from a recognized expert source, and authenticated by experts with domain knowledge. Next, we fed the features to the clustering algorithm K-Means. The derived clusters suggested groups of related verses that share a common central concept.

Paragraph Vectors with Quran


In the future, we plan to evaluate the derived clusters against a tagged corpus automatically. We will figure out a classifier that best fits our data, and adequately captures the relations between the data points (verses of the Quran). Eventually, we add more significance to our conclusion and benchmark the derived features of the original Quranic verses. Acknowledgments. Menwa Alshammeri is supported by a PhD scholarship from the Ministry of Higher Education, Saudi Arabia. The author is grateful for the support from Jouf University for sponsoring her research.

References 1. Abbas, N.H.: Quran’ search for a concept’ tool and website. Ph.D. diss., University of Leeds (School of Computing) (2009) 2. Soliman, A.B., Eissa, K., El-Beltagy, S.R.: Aravec: a set of Arabic word embedding models for use in Arabic nlp. Proc. Comput. Sci. 117, 256–265 (2017) 3. Alrehaili, S.M., Atwell, E.: Computational ontologies for semantic tagging of the Quran: a survey of past approaches. In: LREC 2014 Proceedings. European Language Resources Association (2014) 4. Lu, Y., Mei, Q., Zhai, C.X.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2), 178–203 (2011) 5. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003) 6. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25(2–3), 259–284 (1998) 7. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013) 8. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014) 9. Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 (2015) 10. Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368 (2016) 11. Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychol. Rev. 114(2), 211 (2007) 12. Wilson, A., Chew, P.A.: Term weighting schemes for latent dirichlet allocation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 465–473. Association for Computational Linguistics (2010) 13. Siddiqui, M.A., Faraz, S.M., Sattar, S.A.: Discovering the thematic structure of the Quran using probabilistic topic model. In: 2013 Taibah University International Conference on Advances in Information Technology for the Holy Quran and its Sciences, pp. 234-239. IEEE (2013) 14. Alhawarat, M.: Extracting topics from the holy Quran using generative models. Int. J. Adv. Comput. Sci. Appl. 6(12), 288–294 (2015) 15. Panju, M.H.: Statistical extraction and visualization of topics in the Quran corpus. Student. Math. Uwaterloo. Ca (2014)


M. Alshammeri et al.

16. Shoaib, M., Yasin, M.N., Hikmat, U.K., Saeed, M.I., Khiyal, M.S.H.: Relational WordNet model for semantic search in Holy Quran. In: 2009 International Conference on Emerging Technologies, pp. 29–34. IEEE (2009) 17. Yauri, A.R., Kadir, R.A., Azman, A., Murad, M.A.: Quranic verse extraction base on concepts using OWL-DL ontology. Res. J. Appl. Sci. Eng. Technol. 6(23), 4492– 4498 (2013) 18. Putra, S.J., Mantoro, T., Gunawan, M.N.: Text mining for Indonesian translation of the Quran: a systematic review. In: 2017 International Conference on Computing, Engineering, and Design (ICCED), pp. 1–5. IEEE (2017) 19. Rolliawati, D., Rozas, I.S., Ratodi, M.: Text Mining Approach for Topic Modeling of Corpus Al Qur’an in Indonesian Translation (2018) 20. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT press, Cambridge (2016) 21. Sahlgren, M.: The distributional hypothesis. Ital. J. Disabil. Stud. 20, 33–53 (2008) 22. Goldberg, Y.: Neural network methods for natural language processing. Synth. Lectures Hum. Lang. Technol. 10(1), 1–309 (2017) 23. Carlo, C.M.: Markov chain monte carlo and gibbs sampling. In: Lecture notes for EEB 581 (2004) 24. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

Language Revitalization: A Benchmark for Akan-to-English Machine Translation Kingsley Nketia Acheampong1,2(B) and Nathaniel Nii Oku Sackey1 1


School of Information and Software Engineering (SISE), University of Electronic Science and Technology of China (UESTC), Chengdu 610054, Sichuan, China [email protected], [email protected] Information Intelligence Technology Lab, SISE-UESTC, Chengdu, China

Abstract. Language reconciles the ideas, beliefs and values of people from diverse cultural, social, economic, religious and professional backgrounds, and it is imperative for sustainable development. Undoubtedly, the emergence of Neural Machine Translation has gained significant advancement in language translation automation, and consequently, establishing seamless reconciliation between two communicating entities with diverse backgrounds. Comprising of sequence to sequence architecture, Neural Machine Translation models have outperformed statistical and rule-based models in terms of concordance and syntax, regardless of how complex the language structure is. However, structured language resources are scarce for low-resource languages like Akan. Research contributions toward the revitalization of these languages are limited, along with subtle adverse events such as cultural assimilation and language imperialism. In order to solve the problem of low-resource machine translation from Akan to English, we mine and use the first parallel corpus for the Akan-English translation. We establish a benchmark for the AkanEnglish translation using a deep hierarchical end-to-end attention-based neural machine translation model trained on the parallel corpus. Experimental results further confirm the effectiveness of our approach towards the task. Nevertheless, a confirmed grammatical parity between the two languages bequests improvement approaching a substantial revitalization of the Akan language. Keywords: Neural Machine Translation · Natural Language Processing · Language revitalization · Akan language



It is quite apparent that multilingualism contributes to knowledge acquisition and dissemination in education, travel, trade and business negotiations [32]. However, the sting of misguided language policies, linguistic prejudice, and forced assimilation threatens linguistic diversity, constituting 7,000+ languages currently spoken and signed. Global trade and networks are monolingually-skewed, c Springer Nature Switzerland AG 2021  K. Arai et al. (Eds.): IntelliSys 2020, AISC 1251, pp. 231–244, 2021.


K. N. Acheampong and N. N. O. Sackey

Table 1. Typical examples of translations made by Akan language speakers in a survey conducted in the University of Electronic Science and Technology of China (UESTC). English replacements are shown in bold. English


Volunteer 1 When it comes to TVs, I prefer larger sizes

Sε εba TV mu a, Mepε size kεse

Volunteer 2 I graduated from the University of Ghana

Me graduate firi University of Ghana

Volunteer 3 I know a whole lot about programming languages

Menim programming languages pii

Volunteer 4 Please relax for a while

Mepa wokyεw, relaxe kakra

and inherently, technological advancements in this regard have resulted in half the global population speaking one of only 13 languages. Also, resultant obscurations of colonization add up to the instability of languages and progressively push more languages towards language death. The Akan language is no exception [14]. Akan, preferably inferred as “Twi,” is the predominant native language of the Akan people of Ghana, spoken by over 80% of the Ghanaian population and in Cˆ ote d’Ivoire by 41% of the population. The language was introduced to the Caribbean and South America during the Atlantic slave trade, and it is currently prominent in Suriname, spoken by the Ndyuka, and in Jamaica, spoken by the Jamaican Maroons, also known as the Coromantee. Akan is a well-structured language, with word classes, inflexions, parts of speech and other word relations in the sentence, just as seen in English. It also follows a nuclear predication, having the basic word order Subject-Verb-Object (S-V-O) [31]. Once, having a rich language, the populace has lost its ability to relay the rich vocabulary of the language to younger generations. In order to compensate for vocabulary in speech and writing, the younger generational speakers have sufficed to the replacement of unknown words with English equivalents. At present, it is nearly impracticable to write any news in the language without the use of English [3] (see Table 1 for some examples). However, there is a good room for the revitalization of the Akan language. The language is a vibrantly used as a language of choice among Ghanaians on social media though constructions made are filled with bad grammar and English character and word replacements as mentioned earlier (see Fig. 1). Meanwhile, the vitality of modern languages depends strongly on the digital technologies of this new computing age [2,17]. Furthermore, with the current trends in language modelling and Machine translation, languages like Akan can be seamlessly vitalized, and knowledge from one language can be translated for the benefit of the monolingual speakers of the language [26].

Akan-to-English Machine Translation


Machine translation (MT) systems are applications or online services that use machine-learning technologies to translate vast volumes of the text of one source language to other supported target languages [6]. The primary aim of MT systems is towards the complete automation of software systems that can translate source language content into target languages without any human intervention [7]. Moreover, MT system translations are expected not to be trivial, word-forword substitutions. Instead, these systems are required to evaluate and analyze source content elements and discover how any word affects the other. As such, MT systems are rated just like human translators, required to perform well in constructing correct syntax and proper grammar [21]. As such, Natural Language Processing (NLP) industries and communities give immense funding to support the ultimate goal of MT systems and perfecting the goal with other serviceable extensions like language detection, text to speech, or dictionaries [12]. Despite most MT systems having simple interfaces with “input text” fields and “translate” buttons, several cutting-edge technologies, such as deep learning, big data, linguistics, cloud computing, and Application Programmable Interfaces (APIs) underlie the interface making translations realistic and reliable. A subsidiary of Machine Translation (MT) systems, that is, neural machine translation yields state-of-the-art results on various high resource languages that have parallel datasets, even when the languages are of different language families, e.g. English-Chinese translation tasks [16]. However, to the best of our knowledge, there is no parallel data for the Akan language, and consequently, there is no neural machine translation model or benchmark available between Akan and any other language. Inspired by the assertions above, this work presents the first results on AkanEnglish translations using a deep, hierarchical neural machine translation model and set it as a benchmark for the languages. It also uses our newly mined parallel data for the development of Akan-English translation models to aid in the revitalization of the Akan language. Analyses of experimental results are performed to evaluate the proposed model, and notable recommendations are given to ensure a continual revitalization of the Akan language.



Generally, there are three main approaches to machine translation. These are: 1. Rule-based Machine Translation 2. Statistical Machine Translation 3. Neural Machine Translation 2.1

Rule-Based Machine Translation

Rule-Based Machine Translation approaches (RBMTs) involve the direct application of morphological and linguistic rules to carryout translation task [11]. This approach uses millions of bilingual dictionaries for each language pair to


K. N. Acheampong and N. N. O. Sackey

transform a source text into a transitional representation from which the generation of the target text realized. Due to its reliance on extensive large sets of rules and lexicons with morphological, syntactic, and semantic information, the resulting extensive dictionaries translations and complex linguistic rules created are not good enough for quality translations [9,18]. Again, RBMTs rely on their user-defined settings to improve quality. Any user can add a new term to the translation model, which overrides the original system settings [29]. Although RBMT systems may achieve significant translation results, high-quality results may be unattainable due to the diverse nature of language and human assistance needed to reach this goal. With advancements in statistical models, statistical methods of translation became the norm. 2.2

Statistical Machine Translation

Statistical Machine Translation (SMT) approach involves the use of the analytical nature of the text to perform translation tasks. By applying advanced statistical models, SMT systems analyze existing human translations and optimize their model parameters accordingly to translate text [1]. Through complex algorithms, SMT systems employ bilingual text content (source and target languages), and monolingual content (target language) to create statistical models which in turn generate weights to decide on the best probable translation. Although developing statistical translation models is agile, this approach ultimately depends on the diversity of the corpora. Achieving high-quality translations is probable but at the expense of gathering vast amounts of domain-specific translations [34]. Moreover, with such a data bottleneck, it is virtually impossible to achieve high-quality translations in general language settings. One advantage of SMT systems adaptation over RBMTs is their capacity to render good quality when extensive and qualified training data is available [20]. Translations are more fluent than that of RBMTs and satisfy user expectations. Training from suitable data is automated and relatively cheaper in terms of time and human cost involved. However, SMT translations are mostly inconsistent with sentenceto-similar-sentence comparisons and entail a great deal of effort to manage. 2.3

Neural Machine Translation

Neural machine translation (NMT) is a method to machine translation that uses neural networks to predict the probability of a sequence of words, typically modelling entire sentences into a unified model [4]. The use of neural network models to learn a mathematical model for machine translation is the key benefit of the NMT approach. NMT unified model can be trained directly on the source and target text, no longer requiring the pipeline of specialized systems used in statistical machine learning [16,19]. 2.4

Related Works

Lately, several works on translation studies have highlighted the difficulties of working with low resource languages using NMT, and as such, proposing many

Akan-to-English Machine Translation


methods. [16] ascertained that dual learning models using recurrent neural network (RNN) language model set up in a sequence to sequence fashion could improve the performance of NMT. Models trained with monolingual datasets have also gained a great deal of attention in the translation community. [36] proposed an adaptive attention-based neural network to translate MongolianChinese, which is considered as a low resource language. NMT systems learn higher-level concepts for generating a translation. When multi-layered, their neural networks are capable of learning the semantics of the text and generating translations which allow the systems to achieve human-level translations in a fluent and naturally sounding translation output [13,28]. In contrast to English, no work has been conducted on Akan using NMT. Meanwhile, translations between Akan and English has a high potential of revitalizing the Akan language due to the morphological richness of the language. LearnAkan, My Twi Dictionary, Globse, Kasahorow and GhanaWeb are a few systems that attempt to execute Akan-English translations using RBMT techniques.


Model Architecture

Even though our hierarchical model can be adopted in the construction of phrasebased SMT and NMT alike, we concentrate on NMT, due to its simplicity as an end-to-end system, without suffering from crafted human-engineered features. 3.1

Model Composition

Recurrent Neural Networks. Recurrent neural networks (RNNs) are specific types of artificial neural networks that use internode connections to form a directed graph along a temporal sequence in order to capture information about the relationship between words and their positions in any given sentence [27]. A typical RNN allows previous outputs to be used as inputs while having hidden states. The unique application of its internal state to process sequences of inputs allows it to exhibit a dynamic behaviour towards describing the time dependence of a point in latent space. Thus, an RNN computation takes into account historical information through the sharing of weights across time [8]. Additionally, RNNs are capable of processing inputs of any length or size while keeping a fixed model size. Despite their unique capabilities, the computation of vanilla variants of RNNs is slow and expensive and capturing long term dependencies because of the multiplicative gradient that can increase or decrease exponentially for the number of layers [10,25]. As a language model, RNNs describes the probability distribution of a word (t+1) (1) (t) (t) , after given a word sequence wi , ..., wi . Using its hidden state hi , wi an RNN process sentences in a word-by-word fashion and updating its internal states each time t a word is processed. During encoding, each word is represented as a one-hot vector input wi , and mapped to a lower dimension continuous space vector representation si using an Embedding Matrix, E. That is,


K. N. Acheampong and N. N. O. Sackey

si = Ewi .


Pending the current word si , the RNN’s internal state hi is updated subsequently with a non-linear transformation function, f . Different variants of RNN comprising vanilla RNN, Long Short-term Memory (LSTM), and Gated Recurrent Unit (GRU), have their respective non-linear transformation functions. However, GRUs and LSTMs deal with the vanishing gradient problem encountered by vanilla RNNs, with LSTM being a generalization of GRU. In this work, we use both the GRU and the LSTM variants in the encoding and decoding layers of our Akan-English translation model in order to establish substantial benchmarks for the translation task at hand. Encoder-Decoder Models. One well-known approach for neural machine translation (NMT) in a machine translation task is to predict the probability of a sequence of words. This approach leverages an encoder-decoder framework to model entire sentences in a single neural network and trained jointly to maximize translation performance [30]. Given a source language sentence x = {x1 , x2 , ..., xS } and its corresponding target language sentence y = {y1 , y2 , ..., yT }, the translation model is formulated: P(y|x; Θ) = (


p(yn |y