138 45 102MB
English Pages 1108 Year 2021
Lecture Notes in Networks and Systems 285
Kohei Arai Editor
Intelligent Computing Proceedings of the 2021 Computing Conference, Volume 3
Lecture Notes in Networks and Systems Volume 285
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/15179
Kohei Arai Editor
Intelligent Computing Proceedings of the 2021 Computing Conference, Volume 3
123
Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-80128-1 ISBN 978-3-030-80129-8 (eBook) https://doi.org/10.1007/978-3-030-80129-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
It is a great privilege for us to present the proceedings of the Computing Conference 2021, held virtually on July 15 and 16, 2021. The conference is held every year to make it an ideal platform for researchers to share views, experiences and information with their peers working all around the world. This is done by offering plenty of networking opportunities to meet and interact with the world-leading scientists, engineers and researchers as well as industrial partners in all aspects of computer science and its applications. The main conference brings a strong program of papers, posters, videos, all in single-track sessions and invited talks to stimulate significant contemplation and discussions. These talks were also anticipated to pique the interest of the entire computing audience by their thought-provoking claims which were streamed live during the conferences. Moreover, all authors had very professionally presented their research papers which were viewed by a large international audience online. The proceedings for this edition consist of 235 chapters selected out of a total of 638 submissions from 50+ countries. All submissions underwent a double-blind peer-review process. The published proceedings has been divided into three volumes covering a wide range of conference topics, such as technology trends, computing, intelligent systems, machine vision, security, communication, electronics and e-learning to name a few. Deep appreciation goes to the keynote speakers for sharing their knowledge and expertise with us and to all the authors who have spent the time and effort to contribute significantly to this conference. We are also indebted to the organizing committee for their great efforts in ensuring the successful implementation of the conference. In particular, we would like to thank the technical committee for their constructive and enlightening reviews on the manuscripts in the limited timescale. We hope that all the participants and the interested readers benefit scientifically from this book and find it stimulating in the process.
v
vi
Editor’s Preface
Hope to see you in 2022, in our next Computing Conference, with the same amplitude, focus and determination. Kohei Arai
Contents
Feature Analysis for Aphasic or Abnormal Language Caused by Injury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marisol Roldán-Palacios and Aurelio López-López
1
Automatic Detection and Segmentation of Liver Tumors in Computed Tomography Images: Methods and Limitations . . . . . . . . Odai S. Salman and Ran Klein
17
HEMAYAH: A Proposed Mobile Application System for Tracking and Preventing COVID-19 Outbreaks in Saudi Arabia . . . . . . . . . . . . . Reem Alwashmi, Nesreen Alharbi, Aliaa Alabdali, and Arwa Mashat
36
Data Security in Health Systems: Case of Cameroon . . . . . . . . . . . . . . . Igor Godefroy Kouam Kamdem and Marcellin Julius Antonio Nkenlifack Predicting Levels of Depression and Anxiety in People with Neurodegenerative Memory Complaints Presenting with Confounding Symptoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dalia Attas, Bahman Mirheidari, Daniel Blackburn, Annalena Venneri, Traci Walker, Kirsty Harkness, Markus Reuber, Chris Blackmore, and Heidi Christensen
48
58
Towards Collecting Big Data for Remote Photoplethysmography . . . . . Konstantin Kalinin, Yuriy Mironenko, Mikhail Kopeliovich, and Mikhail Petrushan
70
Towards Digital Twins Driven Breast Cancer Detection . . . . . . . . . . . . Safa Meraghni, Khaled Benaggoune, Zeina Al Masry, Labib Sadek Terrissa, Christine Devalland, and Noureddine Zerhouni
87
Towards the Localisation of Lesions in Diabetic Retinopathy . . . . . . . . 100 Samuel Ofosu Mensah, Bubacarr Bah, and Willie Brink
vii
viii
Contents
Framework for a DLT Based COVID-19 Passport . . . . . . . . . . . . . . . . 108 Sarang Chaudhari, Michael Clear, Philip Bradish, and Hitesh Tewari Deep Learning Causal Attributions of Breast Cancer . . . . . . . . . . . . . . 124 Daqing Chen, Laureta Hajderanj, Sarah Mallet, Pierre Camenen, Bo Li, Hao Ren, and Erlong Zhao Effect of Adaptive Histogram Equalization of Orthopedic Radiograph Images on the Accuracy of Implant Delineation Using Active Contours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Alicja Smolikowska, Paweł Kamiński, Rafał Obuchowicz, and Adam Piórkowski Identification Diseases Using Apriori Algorithm on DevOps . . . . . . . . . 145 Aya Mohamed Morsy and Mostafa Abdel Azim Mostafa IoT for Diabetics: A User Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . 161 Signe Marie Cleveland and Moutaz Haddara Case Study: UML Framework of Obesity Control Health Information System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 Majed Alkhusaili and Kalim Qureshi An IoT Based Epilepsy Monitoring Model . . . . . . . . . . . . . . . . . . . . . . . 192 S. A. McHale and E. Pereira Biostatistics in Biomedicine and Informatics . . . . . . . . . . . . . . . . . . . . . 208 Esther Pearson Neural Network Compression Framework for Fast Model Inference . . . 213 Alexander Kozlov, Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin, and Yury Gorbachev Small-World Propensity in Developmental Dyslexia After Visual Training Intervention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 Tihomir Taskov and Juliana Dushanova An Assessment of Extreme Learning Machine Model for Estimation of Flow Variables in Curved Irrigation Channels . . . . . . . . . . . . . . . . . 259 Hossein Bonakdari, Azadeh Gholami, Bahram Gharabaghi, Isa Ebtehaj, and Ali Akbar Akhtari Benchmarking of Fully Connected and Convolutional Neural Networks on Face Recognition: Case of Makeup and Occlusions . . . . . . 270 Stanislav Selitskiy, Nikolaos Christou, and Natalya Selitskaya Wavelet Neural Network Model with Time-Frequency Analysis for Accurate Share Prices Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 286 Yaqing Luo
Contents
ix
A Hybrid Method for Training Convolutional Neural Networks . . . . . . 298 Vasco Lopes and Paulo Fazendeiro A Layered Framework for Dealing with Dilemmas in Artificial Intelligence Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Jim Q. Chen Improving Student Engagement and Active Learning with Embedded Automated Self-assessment Quizzes: Case Study in Computer System Architecture Design . . . . . . . . . . . . . . . . . . . . . . . . 327 Ryan M. Gibson and Gordon Morison Cognitive Load of Ontology as a Means of Information Representation in the Educational Process . . . . . . . . . . . . . . . . . . . . . . . 340 Maryna Popova and Rina Novogrudska The Influence of Computer Games on High School Students (Adolescents) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 Dalibor Peić and Andrija Bernik Pedagogical Approaches in Computational Thinking-Integrated STEAM Learning Settings: A Literature Review . . . . . . . . . . . . . . . . . . 369 Ashok Kumar Veerasamy, Peter Larsson, Mikko-Ville Apiola, Daryl D’Souza, and Mikko-Jussi Laakso Development of Anaglyph 3D Functionality for Cost-Effective Virtual Reality Anatomical Education . . . . . . . . . . . . . . . . . . . . . . . . . . 390 Alex J. Deakyne, Thomas Valenzuela, and Paul A. Iaizzo Augmented Reality in Virtual Classroom for Higher Education During COVID-19 Pandemic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Monica Maiti, M. Priyaadharshini, and B. Vinayaga Sundaram The Teaching and Learning of the Untyped Lambda Calculus Through Web-Based e-Learning Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 419 Levis Zerpa Development and Design of a Library Information System Intended for Automation of Processes in Higher Education Institution . . . . . . . . . 437 Askar Boranbayev, Ruslan Baidyussenov, and Mikhail Mazhitov Informally Teaching Black Youth STEM Concepts Virtually Using Artificial Intelligence and Machine Learning . . . . . . . . . . . . . . . . . . . . . 446 Darron Lamkin, Robin Ghosh, Tutaleni I. Asino, and Tor A. Kwembe Accrediting Artificial Intelligence Programs from the Omani and the International ABET Perspectives . . . . . . . . . . . . . . . . . . . . . . . . 462 Osama A. Marzouk
x
Contents
Embodied Learning: Capitalizing on Predictive Processing . . . . . . . . . . 475 Susana Sanchez Exploring the Darkness of Gamification: You Want It Darker? . . . . . . 491 Tobias Nyström The Future of Information and Communication Technology Courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Dalize van Heerden and Leila Goosen The Impact of Computer Games on Preschool Children’s Cognitive Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 Dolores Kamenar Čokor and Andrija Bernik Using STEM to Improve Computer Programming Teaching . . . . . . . . . 542 Manuel Rodrigues Tips for Using Open Educational Resources to Reduce Textbook Costs in Computing and IT Courses . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 Richard Halstead-Nussloch and Rebecca Rutherfoord Cross-Cultural Design of Simplified Facial Expressions . . . . . . . . . . . . . 560 Meina Tawaki, Keiko Yamamoto, and Ichi Kanaya Misinformation and Its Stakeholders in Europe: A Web-Based Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Emmanouil Koulas, Marios Anthopoulos, Sotiria Grammenou, Christos Kaimakamis, Konstantinos Kousaris, Fotini-Rafailia Panavou, Orestis Piskioulis, Syed Iftikhar Hussain Shah, and Vassilios Peristeras Artificial Life: Investigations About a Universal Osmotic Paradigm (UOP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 Bianca Tonino-Heiden, Bernhard Heiden, and Volodymyr Alieksieiev Latency and Throughput Advantage of Leaf-Enforced Quality of Service in Software-Defined Networking for Large Traffic Flows . . . 606 Josiah Eleazar T. Regencia and William Emmanuel S. Yu The Impact of Online Social Networking (Social Media) on Interpersonal Communication and Relationships . . . . . . . . . . . . . . . 624 Mui Joo Tang and Eang Teng Chan Embedded Piezoelectric Array for Measuring Relative Distributed Forces on Snow Skis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641 Andres Rico, Carson Smuts, Jason Nawyn, and Kent Larson Experimental Study of a Tethered Balloon Using 5G Antenna to Enhance Internet Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649 Samirah A. Alhusayni, Shuruq K. Alsuwat, Shahd H. Altalhi, Faris A. Almalki, and Hawwaa S. Alzahrani
Contents
xi
Incremental Dendritic Cell Algorithm for Intrusion Detection in Cyber-Physical Production Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 664 Rui Pinto, Gil Gonçalves, Jerker Delsing, and Eduardo Tovar Performance Comparison Between Deep Learning-Based and Conventional Cryptographic Distinguishers . . . . . . . . . . . . . . . . . . . . . . 681 Emanuele Bellini and Matteo Rossi TFHE Parameter Setup for Effective and Error-Free Neural Network Prediction on Encrypted Data . . . . . . . . . . . . . . . . . . . . . . . . . 702 Jakub Klemsa Security in Distributed Ledger Technology: An Analysis of Vulnerabilities and Attack Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . 722 Efthimios-Enias Gojka, Niclas Kannengießer, Benjamin Sturm, Jan Bartsch, and Ali Sunyaev An Analysis of Twitter Security and Privacy Using Memory Forensics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743 Ahmad Ghafarian and Darius Fiallo A Security Requirement Engineering Case Study: Challenges and Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 761 Souhaïl El Ghazi El Houssaïni, Ilham Maskani, and Jaouad Boutahar Towards Automated Surveillance: A Review of Intelligent Video Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784 Romas Vijeikis, Vidas Raudonis, and Gintaras Dervinis Homicide Prediction Using Sequential Features from Graph Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804 Juan Moreno, Sebastian Quintero, Alvaro Riascos, Luis Gustavo Nonato, and Cristian Sanchez Effectiveness and Adoption of NIST Managerial Practices for Cyber Resilience in Italy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818 Alessandro Annarelli, Serena Clemente, Fabio Nonino, and Giulia Palombi Steganographic Method to Data Hiding in RGB Images Using Pixels Redistribution Combining Fractal-Based Coding and DWT . . . . . . . . . 833 Hector Caballero-Hernandez, Vianney Muñoz-Jimenez, and Marco A. Ramos Improving the Imperceptibility of Pixel Value Difference and LSB Substitution Based Steganography Using Modulo Encoding . . . . . . . . . . 851 Ahmad Whafa Azka Al Azkiyai, Ari Moesriami Barmawi, and Bambang Ari Wahyudi
xii
Contents
Securing Mobile Systems GPS and Camera Functions Using TrustZone Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 868 Ammar S. Salman and Wenliang (Kevin) Du Application of Machine Learning in Cryptanalysis Concerning Algorithms from Symmetric Cryptography . . . . . . . . . . . . . . . . . . . . . . 885 Milena Gjorgjievska Perusheska, Vesna Dimitrova, Aleksandra Popovska-Mitrovikj, and Stefan Andonov A Response-Based Cryptography Engine in Distributed-Memory . . . . . 904 Christopher Philabaum, Christopher Coffey, Bertrand Cambou, and Michael Gowanlock Elliptic Curves of Nearly Prime Order . . . . . . . . . . . . . . . . . . . . . . . . . . 923 Daniele Di Tullio and Manoj Gyawali Assessing Small Institutions’ Cyber Security Awareness Using Human Aspects of Information Security Questionnaire (HAIS-Q) . . . . . 933 Glenn Papp and Petter Lovaas Construction of Differentially Private Empirical Distributions from a Low-Order Marginals Set Through Solving Linear Equations with l2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949 Evercita C. Eugenio and Fang Liu Secondary Use Prevention in Large-Scale Data Lakes . . . . . . . . . . . . . . 967 Shizra Sultan and Christian D. Jensen Cybercrime in the Context of COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . 986 Mohamed Chawki Beware of Unknown Areas to Notify Adversaries: Detecting Dynamic Binary Instrumentation Runtimes with Low-Level Memory Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003 Federico Palmaro and Luisa Franchina Tamper Sensitive Ternary ReRAM-Based PUFs . . . . . . . . . . . . . . . . . . 1020 Bertrand Cambou and Ying-Chen Chen Locating the Perpetrator: Industry Perspectives of Cellebrite Education and Roles of GIS Data in Cybersecurity and Digital Forensics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 Denise Dragos and Suzanna Schmeelk Authentication Mechanisms and Classification: A Literature Survey . . . 1051 Ivaylo Chenchev, Adelina Aleksieva-Petrova, and Milen Petrov ARCSECURE: Centralized Hub for Securing a Network of IoT Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1071 Kavinga Yapa Abeywardena, A. M. I. S. Abeykoon, A. M. S. P. B. Atapattu, H. N. Jayawardhane, and C. N. Samarasekara
Contents
xiii
Entropy Based Feature Pooling in Speech Command Classification . . . . 1083 Christoforos Nalmpantis, Lazaros Vrysis, Danai Vlachava, Lefteris Papageorgiou, and Dimitris Vrakas Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1093
Feature Analysis for Aphasic or Abnormal Language Caused by Injury Marisol Rold´ an-Palacios and Aurelio L´ opez-L´opez(B) ´ Instituto Nacional de Astrof´ısica, Optica y Electr´ onica, Luis Enrique Erro No. 1, Sta. Mar´ıa Tonantzintla, 72840 Puebla, M´exico {marppalacios,allopez}@inaoep.mx https://ccc.inaoep.mx/
Abstract. The interest of researching alterations in aphasic or abnormal language has maintained and keeps growing, following a diversity of approaches. Particularly, the period corresponding to the post-traumatic recovery stage in impaired language caused by a traumatic brain injury has been scarcely explored. The findings reported here specify the steps followed during the inspection of a lexical feature set with the objective of determining which contribute most to describe TBI language singularities throughout the first two stages of recovery (after three and six months). For this purpose, we employ language technologies, statistical analysis, and machine learning. Starting with a 25-feature set, after conducting an analytical process, we attained three selected sets with fewer attributes, having comparable efficacy measures to those obtained from the complete feature set taken as baseline. Keywords: TBI-abnormal language · Feature selection technologies · Statistical learning · Machine learning
1
· Language
Introduction
Language impairment occurs from different illnesses, having a variety of causes. In this work, we focus on the analysis of abnormal language caused by a traumatic brain injury (TBI). According to [15], this language disorder can not be explained with the characteristics, or at least part of them, describing aphasic language cases from other causes [1,7]. Furthermore, we concentrate on the post-traumatic recovery period which has been barely studied. With a computational linguistic approach, a more complex analysis can be achieved, which allows supporting or discarding a hypothesis in less time, as well as to extend the studies regarding the characteristics describing the TBIabnormal language. Efforts have addressed the identification of proper metrics solidly defining the deficiencies in the language. These try to open new possibilities to be applied in therapy, to have an improvement in recovery.
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 1–16, 2021. https://doi.org/10.1007/978-3-030-80129-8_1
2
M. Rold´ an-Palacios and A. L´ opez-L´ opez
Among the contributions of the study here reported are the following. Some core TBI-language feature sets were identified, leading to similar efficacy values of learning algorithms to those achieved when evaluating with the complete initial TBI-attribute set with a variety of schemes. An analytical exploration based in correlation in the starting feature set under analysis is presented. Moreover, a possible relation between impurity quality as a convenient evaluator directly grading, or together with learning algorithms based on it, was found. The paper is organized as follows. Section 2 discusses briefly some of the relevant results reached in previous investigations, the methodologies they tried and introduced, as well as, potential language features to describe the subject of study in turn. Section 3 is summing up the steps carried out to complete the present analysis. Section 4 is subdivided to include specific information about the Corpus (4.1), implementation (4.2), preliminary results (4.3), ending the section, and the work, with a discussion of the results presented (4.4), and some additional observations (4.5).
2
Related Work
After Linguistics researchers started extensive investigations in this problem decades ago, mainly doing their analysis and computation by hand, computational techniques have been gradually incorporated. For instance, the application of natural language processing (NLP) of [12] examines disabilities in the language of a heterogeneous group of participants with different diagnostics, having a variety of causes. A language analysis was completed based on transcripts created by automatic speech recognition (ASR) in which three sets of characteristics were extracted, from a Linguistic Content Analysis (LCA), Part-of-Speech (POS) tags, and Linguistic Inquiry and Word Count (LIWC) processing. Each of the above mentioned techniques provides a variety of characteristics. The first set contributes a quantitative view of the word pattern usage. Then from the next group, for instance, nouns by verb frequencies are extracted. Among the categories that LIWC provides are psychology processes, linguistic dimensions (articles, negations, and so on), relative to time and space from which also frequencies are obtained. After ranking characteristics through an analysis of variance (ANOVA), they were explored with varied learning algorithms in WEKA, such as SimpleLogistic, J48 (decision tree), and Multilayered Perceptron (neural networks) [12]. On the same line, further investigations were conducted [4–6,13]. Such works incorporated statistical criteria to qualify features combined with ablation experiments [5,6]. Also, some efforts [13] were done to avoid overfitting occurring in earlier investigations (given the limited size of datasets), and tried some other learning algorithms, such as Na¨ıve Bayes, Support Vector Machine, besides Logistic Regression. Observing that the data entry quality is a cornerstone in the complexity of these types of studies, a knowledge model [11] was introduced to work on the feature filtering based on an ACT-R [10] architecture. Furthermore, language
Feature Analysis for Aphasic or Abnormal Language Caused by Injury
3
models (LMs) were employed in [3,14] which were trained on Part-Of-Speech tag sequences. The selected resulting feature sets were tested with SVM [3,11], Na¨ıve Bayes [3,11], Decision Trees [11], Neural Networks [3,11], and Boosting with Decision Stumps [3]. These last two works are focused on discriminating specific language impairment (SLI) among infants or teenagers.
3
Methodology
The methodology behind our findings is illustrated in Fig. 1, and consists of the following phases. After extracting the initial set of characteristics proper to the language samples, a set of valuators are incorporated to analyze the features, leading to a ranking. Parallel to this analysis, root nodes are generated and selected employing a tree-based algorithm, to compare both criteria and then identify the promising candidates feature sets. For evaluating these promising features sets, a phase of wrapping which involves setting up of learning algorithms, training and testing steps. According to the assessment of the results, we can go back to valuator selection or to consider a new feature set.
Fig. 1. Methodology scheme.
4
4
M. Rold´ an-Palacios and A. L´ opez-L´ opez
Experiments
Next, we report details of the analysis conducted in our reported work. The Corpus source, considered tasks, employed tools, feature set, steps, and achieved results. 4.1
Data Set
Our samples were extracted from information collected as part of a long-term research [15] called “Communication recovery after TBI” led by Prof. Leanne Togher. For the current analysis, the first two phases of recovery are considered and our data set consists of more than 40 instances per phase. To complement our study, an additional set of non-aphasic samples was taken from a different research [2] led by Coelho, also addressed to analyze language in TBI cases. Table 1 summarizes details of the data set. Table 1. Demographics Male Female AgeRange Task TBI-03 35 TBI-06 36 NTBI 31
10 8 16
17–66 17–66 16–63
Retelling Cinderella story Retelling Cinderella story A story based on The Runaway picture
First two sets described in Table 1 refer to subjects in recovery after suffering a brain injury that caused some kind of language impairment (e.g. aphasia, dysarthria, or apraxia), registered after three and six months, respectively. Also, notice that the transcriptions analyzed in our TBI-study group is the retelling of Cinderella story, while for our contrastive (negative) set is the narration of a story based on The Runaway picture, a piece authored by Norman Rockwell. Transcripts are files created with a special format that synthesizes the whole information that CLAN [8] requires to generate the different indices available. They are organized into levels, for example gems (@G), among other, supports the identification of the task being referred by the lines following the tier. The tag (*PAR) indicates the participant layer, and that corresponding to the investigator is marked by (*INV). Also we have the morphem (%mor) and grammar (%gra) tiers. Next, lines of a short example of the transcripts referred to, are given. @G: Cinderella *PAR: &-um the two wicked stepsisters &-um up to these old tricks. \%mor: det:art|the det:num|two adj|wicked step#n|sister-PL adv|up prep|to det:dem|these adj|old n|trick-PL . \%gra: 1|2|DET 2|4|QUANT 3|4|MOD 4|0|INCROOT 5|4|JCT 6|4|JCT 7|9|DET 8|9|MOD 9|6|POBJ 10|4|PUNCT
Feature Analysis for Aphasic or Abnormal Language Caused by Injury
4.2
5
Further Experimental Details
From CHILDES to Talkbank. The Child Language Data Exchange System (CHILDES) Project was first conceived in 1981 and materialized in 1984. Under a continuous growing and development, the CHILDES project started a transformation in 2001 until becoming Talkbank Project, with a long list of contributors and representing an investment of thousands of hours in collecting, transcribing and reviewing data. Among the different corpora included in Talkbank [8], we focus in Togher Corpus [15]. CLAN. This acronym stands for Computerized Language ANalysis. This is a free resource available in the Talkbank platform. Among several other functions, CLAN tool allows to automatically compute a variety of indices oriented to study language, i.e. in this context, language disorders [8]. The set of features we examine in the present work were extracted employing CLAN, with the command c-nnla which automatically computes the Northwestern Narrative Language Analysis profile, according to the rules established in the NNLA manual [16]. This has proved to achieve an accuracy comparable to those calculate by hand by experts [8]. Nowadays, c-nnla provides more than forty features, from which we are revising a subset of only twenty five. With this set of attributes, we are basically covering a wide lexical study. Table 2 shows the detailed list of attributes on which we focus our analysis.1 For a few of the features included in Table 2, we provide some clarification. In particular, for Utterance, there is no universal linguistic definition, however, “phonetically an utterance is a unit of speech bounded by silence”.2 Open-class notion includes nouns, verbs, adjectives, and adverbs [16], and Close-class category encloses auxiliaries, complementizers, conjunctions, determiners, modals, negation particle, prepositions, pronouns, quantifiers, infinitival marker, and whwords [16]. Selecting the More Discriminative Feature Subset. There are two main approaches [17] regarding an attribute filtering process. When this is done prior to the learning phase, it is called filter method. On the contrary, when the learning algorithm in fact encompasses the selection stage, this is called wrapper method. Usually, they are combined in a cycle of feature nomination, closely related to the subject under analysis. Ranking Features. As stated by [17], “there is no universally accepted measure of relevance, although several different ones have been proposed”. So, we start by inspecting which attribute or subset evaluator could provide more information for our purpose. 1 2
Originally, the ratio Utts/Min was not part of c-nnla profile but we added it to this set of features. Source: url(https://glossary.sil.org/term/utterance).
6
M. Rold´ an-Palacios and A. L´ opez-L´ opez Table 2. C-NNLA feature subset FN Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
DurationSec WordsPerMin TotalUtts TotalWords MLUWords UttsPerMin openClass propOpenClassPerWords closedClass propClosedClassPerWords PropOpenPerClosed Nouns propNounsPerWords Verbs propVerbsPerWords nounPerverb adj adv det pro
21 22 23 24 25
aux conj complementizers modals prep
Description Duration in seconds Words / Min # of Utterances # of Words Mean Length Utterance Utterances/Min Open-Class %open-class/Words close-class %close-class/Words %open-class/close-class # of Nouns %Nouns/Words # of Verbs %Verbs/Words Nouns/Verbs # of Adjectives # of Adverbs # of Determinants # of Pronouns & possessive determiners # of auxiliaries # of conjunctions # of complementizers # of Modals # of Prepositions
Excluding “wh” interrogative pronouns and “wh” relative pronouns.
4.3
Preliminary Results
Starting with one of the most extensively used criteria, i.e. information gain, the statistics reveals that features propVerbsPerWords and nounPerVerbs, corresponding to feature numbers (FN) 15 and 16, are the only important components in the group, with average merit values of 0.013 ± 0.039 and 0.013 ± 0.038 respectively. However, we observed that they appear in the last places in the ranking column. Additionally, several preliminary data explorations revealed openClass and closedClass as noisy or non discriminative attribute in this 25-feature set, and placed on top of the corresponding ranking. In consequence, we opted to try
Feature Analysis for Aphasic or Abnormal Language Caused by Injury
7
a series of different ranking criteria. Most of these rankings had in common openClass and closedClass features on top, with the exception of ImpurityEuclid, which guides a part of our next experiments. We will explain a pair of rounds of the experiments conducted throughout the analysis of the 25-feature CNNLA set. Outlier Case. Regarding the TBI language samples, the group of interest, it is not possible to deal with an outlier conception because we are interested in recognizing and modeling each alteration that language is presenting. However, regarding the negative sample information, the notion is plausible, and the continuation of the data exploration led us to identify an outlier instance in it, given that the sample was presenting some indexes doubling the next minimum in the set. We tried a set of learning algorithms to test the discriminative capability of the selected feature group. The viability of Na¨ıve Bayes (NB) in a variety of language studies led us to consider it as one of the first options for the analysis. Given that Impurity indicator seems to be applicable in our context, it was natural to consider SPAARC and Random Forest (RF) algorithms. Although RF works on boosting, we included AdaBoostM1 too, and we considered also a learning algorithm based on regression, i.e. ClassificationViaRegression. Based on the observation that impurity metric was applicable to reveal some discriminating features, and considering that the tree learning algorithm basis to branch is to reduce the impurity grade in the new nodes created, we proceed as follows. We trained a tree learning algorithm over a random sample split of 60% from the 25-CNNLA feature set to then analyze the root node clusters generated. Recognizing potential prospects based on impurity grading along with their Pearson correlation, we determined a part of the selected feature set. Table 3 enumerates the selected feature group. Table 3. Selected feature set. FN Feature 3 4 5 13 16 21
TotalUtts TotalWords MLUWords propNounsPerWords nounPerVerb aux
Description # of Utterances # of Words Mean Length Utterance %Nouns/Words Nouns/Verbs # of auxiliaries
Tables 4 and 5 summarize the results for the first two recovery stages samples once the outlier was removed. Every learning algorithm was applied with a leaveone-out cross validation, given the reduced number of instances. In order to
8
M. Rold´ an-Palacios and A. L´ opez-L´ opez
compare our findings, we evaluate the full 25-CNNLA feature set with the whole proposed group of learning algorithms and we set as baseline the highest value of the overall evaluation for each recovery data set. Table 4. Testing selected feature group once outlier was excluded - TBI03. Learning
Precision
F1
Accuracy
Algorithm
Recall
Measure
Baseline 0.921 Sequential Min. Optimization
0.778
0.843
85.227
Naive Bayes Sequential Min. Optimization SPAARC Random Forest AdaBoostM1 Classification Via Regression
0.578 0.578 0.756 0.822 0.756 0.889
0.693 0.722 0.791 0.813 0.810 0.889
73.864 77.273 79.545 80.682 81.818 88.636
0.867 0.963 0.829 0.804 0.872 0.889
Table 5. Testing selected feature group - TBI06. Learning
Precision
Algorithm
F1 Recall
Measure
Accuracy
Baseline Classification Via Regression
0.909
0.952
0.930
92.941
Naive Bayes Sequential Min. Optimization SPAARC Random Forest AdaBoostM1 Classification Via Regression
0.909 0.833 0.833 0.878 0.902 0.929
0.952 0.714 0.833 0.857 0.881 0.929
0.930 0.769 0.833 0.867 0.892 0.929
92.941 78.823 83.529 87.059 89.412 92.941
Applying Normalization. Among the measure units in the 25-CNNLA feature set, we found for instance, words, utterances, utterances per minute or percentages, i.e. they lie in different ranges. In order to analyze this factor, we carried out a normalization to the CNNLA feature set, to have them in a unified
Feature Analysis for Aphasic or Abnormal Language Caused by Injury
9
interval. The applied transformation, similar to that implemented in [9], is given by formula 1 vf ∗ 100 v f = n (1) i=1 vfi where vf is the vector corresponding to each feature, n is the number of instances, and f ∈ [1, 25]. This transformation aims to compute the proportion of each instance value into their own range, using a factor of one hundred to avoid introducing zero to small values. Under these conditions, a pair of candidate attribute subsets was obtained, where Tables 6, and 7 specify the promising selected feature set for the first and second recovery stages, respectively. Tables 8, and 9 include the results achieved when tested on the collection, corresponding to the first and second recovery stages, respectively. Once more, for each learning algorithm tested, a leave-oneout cross validation was applied. Table 6. Selected feature set from normalized data - TBI03. FN 6 7 8 10
Feature
Description
UttsPerMin openClass propOpenClassPerWords propClosedClassPerWords
Utterances/Min Open-Class %open-class/Words %close-class/Words
.
Table 7. Selected feature set from normalized data - TBI06. FN 8 10 14 16 17 24
Feature
Description
propOpenClassPerWords propClosedClassPerWords Verbs nounPerVerb adj modals
%open-class/Words %close-class/Words # of Verbs Nouns/Verbs # of Adjectives # of Modals
10
M. Rold´ an-Palacios and A. L´ opez-L´ opez
Table 8. Testing selected group from normalized feature set - TBI03. Learning
Precision
Algorithm
F1 Recall
Measure
Accuracy
Baseline Random Forest
0.813
0.867
0.839
82.954
Naive Bayes Sequential Min. Optimization SPAARC Random Forest AdaBoostM1 Classification Via Regression
0.714 0.652 0.702 0.875 0.860 0.894
0.778 0.956 0.733 0.778 0.822 0.933
0.745 0.775 0.717 0.824 0.841 0.913
72.727 71.591 70.454 82.954 84.091 90.909
Table 9. Testing selected group from normalized feature set - TBI06. Learning
Precision
Algorithm
F1 Recall
Measure
Accuracy
Baseline Random Forest
0.837
0.857
0.847
84.706
Naive Bayes Sequential Min. Optimization SPAARC Random Forest AdaBoostM1 Classification Via Regression
0.600 0.357 0.614 0.854 0.659 0.947
0.643 0.238 0.643 0.833 0.690 0.857
0.621 0.286 0.628 0.843 0.674 0.900
61.176 41.176 62.353 84.705 67.059 90.588
Table 10. Confusion matrix - classification via regression - TBI03 40 (True-Positive)
5 (False-Positive)
5 (False-Negative)
38 (True-Negative)
Outlier case
Feature Analysis for Aphasic or Abnormal Language Caused by Injury
4.4
11
Discussion of Results
Regarding the first recovery stage after removing the uniquely recognized outlier, CVR works well with an accuracy of 84.09, similar to that achieved by RF, both showing a balance in instances correctly discriminated with the full 25-feature set. However, they were surpassed by SMO with a 85.23 in accuracy, determining in this way the baseline for this particular experiment. We identify a selected group of seven features listed in Table 3, based on the 25-CNNLA feature set, achieving with CVR scheme a slightly better efficacy of 88.64 in comparison with that calculated over the full attribute group. Besides, a balance among instances correctly discriminated is maintained, as Table 10 shows. We have to remark that three of the attributes contained in this cluster, aux, nounPerVerb, and proNounsPerWords, were positioned in last places when they were graded according to their InfoGain property. For the data corresponding to the second recovery stage (i.e. six months), a high level of accuracy of 92.94 was reached for the basic 25-attribute group with the CVR approach. With the same 7-feature set (Table 3), it was possible to have an efficacy as good as that obtained for the established baseline, once again with the CVR learning algorithm. Additionally, in both settings, with a steady balance in cases correctly discriminated. When moving to review the results when a transformation is applied to the data with the aim of bringing the attribute values to the same scale, the first point to notice is the drastic reorganization in the bias that each element has into the set under inspection. That is, proportion indexes that were null or noisy in previous evaluations, are now positively weighted. Again, repeating the process on the reshaped 25-CNNLA feature set for each recovery stage, we select not one but two feature subsets, as listed in Tables 6 and 7. For the initial stage of recovery (i.e. three months), we have a reduced 4-feature set in which three of the four elements are noisy or non discriminative for the unscaled set, with UttsPerMin as the only one not being in that role. The overall accuracy was computed and the baseline was established to 82.95, this time with Random Forest which steadily achieves the same assessment to the 4-feature set, however, with a decrement in the balance of instances accurately discriminated. The highest value of 90.91 is obtained again with CVR and with a slightly wider margin than in earlier cases, managing a good composition of samples rightly discriminated as the confusion matrix (2) exhibits. 42 3 (2) 5 38 Given that when testing with this last 4-feature set with the second recovery phase data sample produced poor results, a new representative feature set (Table 7) was explored. This time, it was necessary to incorporate a new grading test weighting instances in a neighbor based on its inverse square distance, in reference to the evaluated instance, this along with impurity evaluation guided the selection of the feature set (Table 7). Notice that this is a new composition but shares a couple of elements with the 4-feature set (Table 6), and one with the first group presented (Table 3).
12
M. Rold´ an-Palacios and A. L´ opez-L´ opez
Efficacy also decreases in these conditions, the accuracy baseline was determined to be 84.71, obtained with RF, which was surpassed only by CVR achieving values of 90.59, but showing a drop in the correctly discriminated samples balance (3), charged to the negative sample. 36 6 (3) 2 41 Regarding the whole feature set F as the union of component subsets (4), where 0 < n ≤ m and m ≤ size(F), the idea behind finding the more descriptive feature set, FW (5), consists of uncovering the more positively weighted feature, Fw , of each subset fj . Notice that fk fl = ∅ for k = l. F=
m
fj
(4)
j=n
FW =
m
fw
(5)
w=1
Based on that, we are extending the discussion for the selected 6-feature set (Table 7) corresponding to the second recovery stage. Figure 2 exhibits the strong inverse relation existing between attributes propOpenClassPerWord and propClosedClassPerWords with a diagonal instance distribution having a negative slope, that indicates in our context that they are quite descriptive features. Charts in Fig. 3 show the null or partial relation existing among propOpenClassPerWord and the remaining attributes in the selected 6-feature set.
Fig. 2. Correlation propOpenClassPerWord vs. propClosedClassPerWords - TBI06 normalized
Feature Analysis for Aphasic or Abnormal Language Caused by Injury
13
Fig. 3. Correlation propOpenClassPerWord vs. remaining components - TBI06 normalized
Fig. 4. PCA graph - selected normalized 6-feature set TBI06
14
M. Rold´ an-Palacios and A. L´ opez-L´ opez
The relations previously described can be quite evident in a principal component analysis (PCA) graph, such as that of Fig. 4, where corresponding vectors of the 6-feature subset reveal the relationship among them, for instance, propOpenClassPerWord and propClosedClassPerWords features are clearly inversely related. Also, certain independence among the remaining attributes is noticeable, adding evidence to the efficacy of the six feature subset. It should be noted that some of the features are redistributed in comparison to the position they have in the starting 25-CNNLA feature set, i.e. it is not possible to identify them by a direct gaze in the PCA graph corresponding to the original attribute group.
5
Conclusions
The inherent relationships among features in the sets are complex and are strongly related to the subject of study, and the larger the feature sets the more complex the way they behave. In addition, frequently we have to handle heterogeneous group of features that convey still more elaborated connections. The presented work is summarizing the results of an addressed micro-level analysis, which means, an inspection bounded to a lexical level, a piece of a larger work in the direction of understanding how TBI-abnormal language evolves during the post-traumatic recovery stages. Attaining reasonable efficacy measures with a quarter of the 25-feature set, we have to keep in mind the stability of the CVR algorithm as a wrapping method in feature selection and its steadiness, along with that of RF throughout the whole analysis. In addition to the support that the tree-based algorithm and ImpurityEuclid analysis, as filter methods, represent to this methodology to guide some others examinations. Having acceptable efficacy higher or equal to the baseline does not allow to assert that the reported results are conclusive. It is evident that there are intrinsic relationship at the interior of the feature set still unrevealed. Further research basically aims for extension on three directions: (i) we will cover a wider range of features than that considered here; (ii) we will expand the analysis to a macro-level, i.e. incorporating syntactic components, and (iii) additionally we plan to work on further stages of recovery to understand its evolution. Besides it, each step unavoidable implies the revision of techniques, methods, and methodology to better understand the results and the studied impaired language. Acknowledgments. The first author was supported by CONACyT, through scholarship 1008734. The second author was partially supported by SNI.
References 1. Linnik, A., Bastiaanse, R., H¨ ohle, B.: Discourse production in aphasia: a current review of theoretical and methodological challenges. Aphasiology 30(7), 765–800 (2015). https://doi.org/10.1080/02687038.2015.1113489
Feature Analysis for Aphasic or Abnormal Language Caused by Injury
15
2. Coelho, C.A., Grela, B., Corso, M., Gamble, A., Feinn, R.: Microlinguistic deficits in the narrative discourse of adults with traumatic brain injury. Brain Inj. 19(13), 1139–1145 (2005) 3. Gabani, K., Sherman, M., Solorio, T., Liu, Y., Bedore L.M., Pe˜ na, E.D.: A corpusbased approach for the prediction of language impairment in monolingual English and Spanish-English bilingual children. Human language technologies. In: The 2009 Annual Conference of the North American Chapter of the ACL, Boulder, Colorado, c 2009 Association for Computational Linguistics June 2009, pp. 46–55 (2009). 4. Fraser, K.C., et al.: (2012). Automated classification of primary progressive aphasia subtypes from narrative speech transcripts. Cortex. 55, 43–60 (2013). https://doi. org/10.1016/j.cortex.2012.12.006 5. Kathleen, C.F., Hirst, G., Graham, N.L., Meltzer, J.A., Black, S.E., Rochon, E.: Comparison of different feature sets for identification of variants in progressive aphasia. Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Baltimore, Maryland USA, 27 June 2014, pp. c 2014 Association for Computational Linguistics 17–26 (2014). 6. Fraser, K.C., Hirst, G., Meltzer, J.A., Mack, J.E., Thompson, C.K.: Using statistical parsing to detect agrammatic aphasia. In: Proceedings of the 2014 Workshop on Biomedical Natural Language Processing (BioNLP 2014), Baltimore, Maryland, c 2014 Association for Computational USA, 26–27 June 2014, pp. 134–142 (2014). Linguistics 7. Bryant, L., Ferguson, A., Spencer, E.: Linguistic analysis of discourse in aphasia: a review of the literature. Clin. Linguist. Phonetics (2016). https://doi.org/10.3109/ 02699206.2016.1145740 8. MacWhinney, B.: The CHILDES Project: Tools for Analyzing Talk, 3rd edn. Lawrence Erlbaum Associates, Mahwah (2000) 9. Marini, A., Andreetta, S., del Tin, S., Carlomagno, S.: A multi-level approach to the analysis of narrative language in aphasia. Aphasiology 25(11), 1372–1392 (2011). https://doi.org/10.1080/02687038.2011.584690 10. Oliva, J., Serrano, J.I., Del Castillo, M.D., Iglesias, A.: Cognitive modeling of the acquisition of a highly inflected verbal system. In: Salvucci, D., Gunzelmann, G. (eds.) Proceedings of the 10th International Conference on Cognitive Modeling, pp. 181–186. Drexel University, Philadelphia (2010) 11. Oliva, J., Serrano, J.I., del-Castillo, M.D., Iglesias, A.: A methodology for the characterization and diagnosis of cognitive impairment - application to specific language impairment. Artif. Intell. Med. 61(2014), 89–96 (2014) 12. Peintner, B., Jarrold, W., Vergyri, D., Richey, C., Gorno-Tempini, M.A., Ogar, J.: Learning diagnostic models using speech and language measures. In: 30th Annual International IEEE EMBS Conference Vancouver, British Columbia, Canada, 20– 24 August (2008) 13. Rentoumi, V., et al.: Automatic detection of linguistic indicators as a means of early detection of Alzheimer’s disease and of related dementias: A Computational Linguistics analysis. 8th IEEE International Conference on Cognitive Infocommunications (CogInfoCom 2017), 11–14 September 2017 (2017) 14. Solorio, T., Yamg, L.: Using language models to identify language impairment in Spanish-English bilingual children. In: BioNLP 2008: Current Trends in Biomedical Natural Language Processing, pp. 116–117, Columbus, Ohio, USA, June 2008 c 2008 Association for Computational Linguistics (2008). 15. Stubbs, E., et al.: Procedural discourse performance in adults with severe traumatic brain injury at 3 and 6 months post injury. Brain Inj. 32(2), 167–181 (2018). https://doi.org/10.1080/02699052.2017.1291989
16
M. Rold´ an-Palacios and A. L´ opez-L´ opez
16. Thompson C.K.: Northwestern Narrative Language Analysis (NNLA) Theory And Methodology. Aphasia and Neurolinguistics Research Laboratory Northwestern University, Evanston, IL (2013) 17. Witten, I.H., Frank, E.: Data Mining - Practical Machine Learning Tools and Techniques. Elsevier, Morgan Kaufmann Publishers, Burlington (2005)
Automatic Detection and Segmentation of Liver Tumors in Computed Tomography Images: Methods and Limitations Odai S. Salman1(B) and Ran Klein1,2 1 Department of Systems and Computer Engineering, Carleton University, Ottawa, ON, Canada
[email protected]
2 Department of Nuclear Medicine, The Ottawa Hospital (TOH), Ottawa, ON, Canada
[email protected]
Abstract. Liver tumor segmentation in computed tomography images is considered a difficult task, especially for highly diverse datasets. This is demonstrated by top-ranking results of the Liver Tumor Segmentation Challenge (LiTS) achieving ~70% dice. To improve upon these results, it is important to identify sources of limitations. In this work, we developed a tumor segmentation method following automatic liver segmentation and conducted a detailed limitation analysis study. Using LiTS dataset, tumor segmentation results performed comparable to stateof-the-art literature, achieving overall dice of 71%. Tumor detection accuracy reached ~83%. We have found that segmentation’s upper limit dice can reach ~77% if all false-positives were removed. Grouping by tumor sizes, larger tumors tend to have better segmentation, reaching a maximum approximated dice limit of 82.29% for tumors greater than 20,000 voxels. Medium and small tumor groups had an upper dice limit of 78.75% and 63.52% respectively. The tumor dice for true-positives was comparable for ideal (manual) vs. automatically segmented liver, reflecting a well-trained organ segmentation. We conclude that the segmentation of very small tumors with size values < 100 voxels is especially challenging where the system can be hyper-sensitive to consider local noise artifacts as possible tumors. The results of this work provide better insight about segmentation system limitations to enable for better false-positive removal development strategies. Removing suspected tumor regions less than 100 voxels eliminates ~80% of the total false-positives and therefore, may be an important step for clinical application of automated liver tumor detection and segmentation. Keywords: Tumor detection · Tumor segmentation · Organ segmentation · Deep learning · Image standardization
1 Introduction The main challenge for tumor detection and segmentation is the wide variation of tumors with respect to size, shape, intensity, contrast, and location. This creates a complex diversity for an observer, human or machine, to handle. In addition, there are variations in the global image presentation such as image acquisition and reconstruction methods, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 17–35, 2021. https://doi.org/10.1007/978-3-030-80129-8_2
18
O. S. Salman and R. Klein
voxel size, anatomical region coverage and many other parameters. Therefore, minimizing these image variations helps focus on tumor specific variations. In a previous work [1], we showed that image standardization significantly improved organ (liver) segmentation in computed tomography (CT) scans by reducing the image specific variations. Our method consisted of a series of steps to localize the abdominal region, then set a fixed size and coverage for that region, and finally conduct the organ segmentation. The segmented region is then inverse-standardized, so it corresponds to the original image coordinates. Similarly, a system that standardizes/inverse-standardizes both imaging and tumor variations may boost tumor detection and segmentation performance. In this work we build on the liver organ segmentation framework to detect and segment liver tumors. An intuitive approach is to design two modules: the first eliminates imaging-related variations, and the second handles tumor variations. However, since tumor and imagingrelated variations are interdependent, their separation is not trivial. For instance, standardizing slice thickness may cause some tumors to disappear due to loss of axial resolution when down-sampling. On the other hand, up-sampling may stretch some tumors in a way that the image interpolation algorithms will make some tumors’ masks lose shape information. Moreover, interpolation algorithms introduce blur, compromising potentially useful texture information. Using interpolation based on nearest value also does not help if specifically intended for 3D processing, where tumors can take pixelated shapes along the axial dimension. In other words, applying any morphing standardization may cause loss of information regarding tumors specifications, critically needed for their detection or segmentation, especially for 3D processing. Considering the above factors, it is important to avoid applying these morphing operations and better limit focus on standardizing only common information the images naturally share. Fortunately, crop/pad operations can be used to standardize image representation while preserving the image texture. Following such option necessitates using 2D processing instead of 3D as an initial step. Amongst CT images, the transversal view is the most consistent. Even though the field-of-view (FOV) may vary, its impact is not harmful compared to slice thickness variations and can be handled by copping/padding as shown by [1]. With slice thickness being the most harmful factor requiring morphing operations to be standardized, it is avoided by eliminating its axial dimension through processing one transversal slice at a time. Of course, this means losing contextual information along that dimension. Therefore, analytical post-processing context-based correction can be useful [2]. In the context of liver tumor detection and segmentation, a good starting point would be to focus on the segmented organ rather than including other regions in the image field-of-view. This way, the model may learn to distinguish tumors from normal tissue specific to the organ of interest with a simplified task of producing only two major classes, normal tissue vs. tumor tissue. This work implements such strategy, where the preprocessing stage replaces all region outside the liver with normal tissue imitating voxel values on a standardized image intensity scale, hence reducing intensity variation impact. In this work we present a novel approach for standardizing CT images including anatomical and liver-organ segmentation, removal of region outside the liver, and image intensity normalization. Following standardization, liver tumors are then segmented. We
Automatic Detection and Segmentation of Liver Tumors
19
compare our performance metrics to previously published results, investigate points of failure, explore possible solutions and discuss fundamental limitations to further enhancing segmentation performance. The rest of this article is structured as follows: Sect. 2 presents related work for liver tumor segmentation. Section 3 presents the methods. Section 4 presents the results along with their analysis. Section 5 provides detailed discussion. Finally, Sect. 6 presents conclusions.
2 Related Work L. Chen et al. [3] proposed a cascaded approach to segment the liver first by fusing segmentation results from three networks (coronal, sagittal and transversal views) and then segment tumors within the fused liver region. For tumor segmentation, they used adversarial neural networks to discriminate tumors from normal tissue. They used the Liver Tumor Segmentation dataset (LiTS) [4], where they achieved a dice of 68% for liver tumors. G. Chlebus et al. [5] proposed using U-Net [6] to segment tumors in 2D slices. Following tumor segmentation, they developed an object-based false-positives reduction algorithm to achieve an overall dice score of 69% on the LiTS dataset. L. Bi et al. [7] proposed an approach that utilizes 2D slice-based segmentation using two ResNets [8]. The first network is fed CT image slices to produce two probability heatmaps (normal and tumor tissues) for each slice. Then, a CT slice is stacked with its segmentation results and fed to the second network. This means that the second network takes segmentation estimates for further enhancement. In their work, they present two dice scores. The first, which is relatively low (~50%) was obtained for a development set that was taken from the training set. On the other hand, the score that they were able to obtain on the test set reached up to 64%, raking the 4th position in the LiTS challenge. This might indicate that training set contains more difficult images than those in the testing set. Unfortunately, challenge coordinators do not offer ground-truth labels for the testing set. Hence, developers who conduct research using this dataset tend to dedicate a portion of the training set for testing. W. Li et al. [9] used 2D patch-based voxel segmentation by classifying voxels based on their neighborhood contents. From tumor ground-truth, they selected a patch centered at each voxel. They trained AdaBoost [10], random forest (RF) [11] and support vector machines (SVMs) [12] using patch’s handcrafted features, and convolutional neural network (CNN) [13] using the patch itself to classify tumor vs. normal. They showed that CNN achieved the highest dice of 80% on a locally acquired dataset. One major draw-back to such approach is that it requires classifying many patches to segment a slice (patch for each voxel), which can be computationally too expensive when having high resolution images (e.g. small voxel size and/or thin slices). In their paper, they state that their method has limitation in capability of segmenting heterogeneous-texture tumors and tumors with fuzzy boundaries. Given their results, we speculate that that their dataset may have contained these limitations at lower frequency than that encountered in the LiTS dataset. L. Meng et al. [14] proposed a 3D dual path convolutional neural network, where segmentation results at the end of the architecture are fused together. They also used
20
O. S. Salman and R. Klein
conditional random fields (CRF) [15] to reduce false-positive segmentations, resulting in an overall dice of 68% using LiTS dataset. S. Zheng et al. [16] proposed an approach that takes 2D CT slices, denoises them using block-matching and 3D filtering (BM3D) [17], and feeds them to a fully convolutional network (FCN) [18] to segment. Then, 2D stacked segmentation results are modified using 3D deformable models [19]. In their work, they trained their model using LiTS dataset and tested on 3D-IRCAD [20] dataset, under the implicit assumption that both datasets are similar in nature, which may not be the case, as the best tumor dice achieved in the LiTS challenge was ~70% and for 3D-IRCAD, the worst reported performance was 73% [14]. Regardless, Meng et al. reported 84% performance on 3D-IRCAD dataset. P. Christ et al. [21] proposed a cascaded approach for liver tumor segmentation in abdominal images. In their work, they preprocessed CT images by applying thresholding in Hounsfield window [–100, 400], followed by histogram equalization. Limitations to using histogram equalization with no standardized input were discussed by [1]. They used cascaded fully convolutional neural networks (CFCN) followed by 3D CRF. For testing, they used 3D-IRCAD and they achieved tumor segmentation dice of 82%. As the above literature demonstrates, performance measurements must be compared using the same datasets. Furthermore, for the LiTS challenge, despite different approaches, the upper limit of performance approaches 70% dice.
3 Method Our overall strategy consists of two main components. Module 1 is a segmentation module that processes 2D transversal slices one at a time. This module provides initial tumor segmentation, and it is expected to include false-positive regions. Module 2 is a patch-based approach designed to classify tumor regions identified in the first step into true-positive and false-positive regions. This article is solely dedicated to covering Module 1, with Module 2 to be covered in future work. We explore the limitations of Module 1 to identify the range of improvements that may be expected from Module 2. The workflow diagram is illustrated in Fig. 1. 3.1 Dataset For this study, the LiTS [4] segmentation dataset was used, consisting of 131 CT scans. Of these, 81 were randomly selected for training, and 50 for testing. Generally, this dataset consists of images with a tight FOV around the patient. Slice thicknesses varied as 1.51 ± 1.18 [0.70, 5.00] mm (Mean ± Std [min, max]) as did the number of slices per image 447 ± 275 [74, 987]. With respect to anatomical region coverage in the FOV, of the 131 images, 109 contain pelvis, 52 contain chest, 1 contains legs, and 2 contain head-neck. 3.2 Preprocessing Each image was automatically preprocessed to include the liver and mask out all other regions as previously described. Regions outside the liver ROI were subsequently erased
Automatic Detection and Segmentation of Liver Tumors
21
Fig. 1. CT preprocessing.
by replacing them with voxel intensities resembling normal liver tissues as detailed below. Two volumes were generated. The first was contrast improved CT (CICT) by windowing to Hounsfield values ranging [–100, 400]. In the second volume (CICTHist), histogram equalization was applied exclusively to the liver voxels of the first volume, maximizing tumor to liver contrast. However, this has the undesired effect of also increasing contrast between normal liver voxels in the absence of tumors. The two volumes were then combined on the slice level, forming a multi-channel 2D slice stacks. This configuration maintains actual intensity and emphasizes contrast for the subsequent tumor segmentation stages. Reducing Dataset Intensity Diversity: For such purpose, intensity standardization was applied. However, rather than subtracting the mean and dividing by standard deviation for all images of the training set, the mean and standard deviation used for this adjustment were those of liver’s normal tissue. Each image voxels were rescaled to a standard intensity based on intensity distribution derived from normal tissue voxel intensities in the training images. Since the training set contains diverse tumor contrast, relying on the absolute liver intensity without standardization can create some instability due to absence of intensity reference. Thus, standard intensity was obtained for each voxel by subtracting the normal tissue mean (μ) and dividing by its standard-deviation (σ) and finally scaling to range [0, 1], representing [μ − σ, μ + σ]. Hence, normal tissue voxel intensity distribution was considered as reference to standardize upon. This helps
22
O. S. Salman and R. Klein
to relatively map tumor intensity as abnormality on the normal tissue statistical distribution. Two means and two standard deviations were used (for CICT and CICT-Hist respectively). Neutralizing Impact of Contributions from Outside the Liver ROI: Given the procedure in which an image is convolved, and that the system is divided into two classes (tumor vs. background), most voxels outside the liver region will significantly contribute to the background class. The values within this region should be carefully set. In general, the intensity of tumors is lower than for normal tissue. Setting the background black can harm the learning because it is closer to tumor intensity. Therefore, to avoid texture bias in the learning, it is essential that the region outside the liver’s ROI shares similar texture properties to normal tissue (where both belong to the “background” class). Therefore, the background for CICT was set to solid intensity of 0.5 (after image intensity scaling) while CICT-Hist’s was set to uniform noise with intensity values 0, 0.5 and 1; Fig. 1 shows the scheme. Once both volumes were prepared, their corresponding slices were stacked, and the 2D slices were passed to U-Net to segment tumor regions.
3.3 Tumor Segmentation This module takes 2D stacked transversal slices one at a time, and segments them using U-Net architecture. Weighted dice coefficient was used to overcome class imbalance. The initial learning rate was set to 10–4 with a drop rate of 80% every two epochs for a maximum of 10 epochs. The training used a minimum batch size of 4 stacked images, and it was applied exclusively on slices with tumors. Since there are chances of encountering low intensity around organ boundaries (due to inaccurate segmentation), an optional step to apply binary erosion (scratching a thin layer from the outside surface of the liver segmentation) was implemented. The eroded images may exclude false-positives and possibly small true-positive tumors residing on the liver boundaries. Hence, this can reduce the detection accuracy. In the results section, we presented both configurations, with and without erosion. 3.4 Performance Metrics For this study, two performance metrics were used. The first metric is the dice coefficient, which is defined as twice the intersection volume between segmentation and groundtruth, divided by the summation of their volumes. Since there are images with different slice thickness and hence axial pixel height, a normalized average is needed to get the same contribution from two tumors with the same physical volume and segmentation, while taken from images with different resolutions. For instance, a tumor from an image that has a fine slice thickness of 0.7 mm per height pixel can contribute ten times a tumor of the same size from an image with slice thickness of 7 mm per height pixel. The dice depends on the voxel counts of the volume. In all images, x-y resolution is usually 512 × 512 pixels, with the liver approximately covering 250 × 250. However, the height z-resolution varies between images, and that change the proportional aspect ratio. This results in inaccurate dice values even if the tumors actual contribution is the
Automatic Detection and Segmentation of Liver Tumors
23
same. To minimize and stabilize the volatility of this effect, post-segmentation height was normalized into a common scale that reflects an approximate average amongst dataset images. Height was normalized using nearest pixel method. Hence, the dice can now be more representative to true segmentation performance with no bias caused by the image resolution. The dice score was calculated for the whole test set following the definition of the first metric. To study the dependence of dice performance on tumor size, these were plotted with respect to one-another. Segmented tumors were grouped into three, arbitrarily selected size groups: small, medium, and large corresponding to (x ≤ 5000), (5000 < x ≤ 20,000) and (x > 20,000), respectively, where x is the number of voxels in the segmented tumor region. For each size range, the plot included scattered distribution with each point presenting the dice for a single segmented region. In addition, an “approximate” dice was calculated as a size-based weighted average of all singular regions’ dices within the group. We considered this average as approximate in relation to the overall dice defined above. The range dice can indicate the behavior of the group in relation to the overall dice. The other metric was the tumor detection accuracy, which is the ratio of the detected ground-truth positives to the total ground-truth positives for the test set. A segment that intersected a ground-truth region was considered a true-positive. A false-positive tumor is a segmented region that has no intersection with ground-truth. The number of true-positive segmented regions was less than the actual positives due to network’s detection capacity and segmentation efficiency for this problem, where in the latter multiple ground-truth regions may intersect with a single true-positive segmented region. 3.5 Region Analysis and Performance Conditions To study limitations of the segmentation module, for each tumor size group we measured the dice using four consecutive stages of tumor exclusion conditions: 1) Base value–included all detected tumors within the liver region, representing results without any false-positive removal. 2) False-positive exclusion–excluded all false-positive detected tumors, representing best possible dice if all false-positives were removed. 3) Speckle exclusion–excluded all detected tumors less than 500 voxels in size, representing a simple criterion by which tumors that are most likely to be noise are removed. 4) Low-overlap exclusion–excluded detected tumors that had a relatively low dice (poor overlap with ground-truth), representing detected tumors that disagreed with the ground-truth and hence may be harder to detect (e.g. low contrast). The aim of this exclusion is to isolate regions with bad performance to study their properties. This can be useful to designing a system that is specialized in processing similar instances. Such properties may include contrast, intensity, and location. We reported the number of removed and remaining regions after applying each exclusion condition.
24
O. S. Salman and R. Klein
4 Results and Analysis Table 1 shows the results based on different conditions. Following automatic liver segmentation, tumor segmentation dice achieved base value (without false-positives removal) of 71.11%. Manual (or ideal) organ segmentation starts with the ground-truth liver, while automatic segmentation is based on liver segmentation produced by a trained network. The tumor segmentation in both cases was made using identical procedure. The only difference is the starting liver segmentation. The automatic results are less than the ideal ones but comparable. Erosion scratches some amounts from the surfaces for the sake of reducing false-positives. Erosion hurts accuracy of the automatic segmentation. The difference between manual and automatic for both dice and detection accuracy is due to the variations in the liver segmentation. These points are discussed later. Table 1. Tumor dice and detection accuracy values following manual and automatic (trained network) liver segmentation methods. For both methods, results were presented in presence and absence of applying binary erosion. (n = 50 test images)
Dice (%) All positive
Dice limit (%) FPs removed
Detection accuracy (%)
Manual liver segmentation
76.17
81.83
87.31
Manual liver segmentation-erosion
77.50
81.34
87.31
Automatic liver segmentation
69.20
76.53
83.40
Automatic liver segmentation-erosion
71.11
76.44
79.25
In some sense we have dealt with the false-positives as a separate independent problem. With all false-positives included, the result gives a lower limit to the performance and when all are removed, the dice gives an upper limit to the segmentation performance of the segmentation module, independent of how a false-positive removal module operates. In other words, the test reflects the segmentation learning capacity for true tumors. Table 2 shows the number of detected tumors and false-positives based on size and segmented region exclusion conditions. The table shows the amounts removed, and the remaining ones for each exclusion applied as explained in the following figures. Table 3 shows the dice under various conditions. The dice improved from 25.67% to 60.61% for the small tumors when false-positives were removed, reflecting their large number for that range. By contrast, when removing 157 extremely small tumors (20K)
19
1
18
0
18
4
14
Total
1507
1258
249
157
92
18
74
1388
159
159
0
0
0
Small 200 (500 < x ≤ 5K)
113
87
0
87
16
71
Medium 45 (5K < x ≤ 20K)
6
39
0
39
9
30
Large (>20K)
32
0
32
0
32
5
27
Total
1824
1507
317
159
158
30
128
Automatic liver segmentation Very small (≤500)
1336
Manual liver segmentation Very small (≤500)
1547
For medium and large sizes, removing false-positives improved dice ~9% and ~5% points, respectively. Removing regions with dice less than the moving average-sigma further improved dice ~4% points for medium, and ~4% for the large. Considering only false-positive exclusion, the total approximate dice improved from ~68% to ~80%. With all exclusions applied (≤500 and 92% of these having fewer than 500 voxels regardless of the liver segmentation method. Without false-positive exclusion, removing them improved the mean dice for the small size category by ~8% and ~6% for automatic and manual liver segmentation, respectively (Table 3).
26
O. S. Salman and R. Klein Table 3. Average dice for different tumors sizes under certain exclusions.
(n = 50 test images)
Base segmentation result (no exclusion)
False-positives excluded
+ Speckles excluded (regions of size ≤ 500)
+ Removing low overlap regions with dice < mov avg – mov σ
Automatic liver segmentation Small (≤5K)
25.67
60.61
63.52
73.85
\\Medium (5K < x ≤ 20K)
69.54
78.75
78.75
83.08
Large (>20K)
76.86
82.29
82.29
86.26
Overall
68.15
80.14
80.58
84.96
Excluding regions ≤ 500vx following base segmentation results improved dice from 25.67% to 33.28% Manual liver segmentation Small (≤5K)
21.74
54.40
57.17
68.37
Medium (5K < x ≤ 20K)
61.92
71.91
71.91
84.95
Large (>20K)
85.43
85.43
85.43
88.64
Overall
75.85
82.49
82.78
87.50
Excluding regions ≤ 500vx following base segmentation results improved dice from 21.74% to 27.49%
The behavior of the tumor dice following manual and automatic liver segmentation is summarized in Tables 1, 2 and 3 and compared with respect to the moving average exclusion condition in detail in the Figs. 3, 4, 5, 6, 7, 8, 9 and 10. The numbers of detected tumors differ between the manual and automatic liver segmentation, because tumor segmentation may combine two regions from the ground-truth or vice versa. With different liver ROI, the tumor segmentation algorithm is expected to act differently. Another more likely reason is that the automatic segmentation could have missed portions of the liver, which was reflected in the lower tumor detection accuracies (Table 1).
Automatic Detection and Segmentation of Liver Tumors
27
(a) Automatic segmentation: ~94% have size ≤ 500
(b) Manual liver segmentation: ~92% have size ≤ 500
Fig. 2. False-positive distribution, automatic and manual liver segmentations.
Figures 3, 4, 5 and 6 show scatter plots of dice-value relative to segmented tumor size for each tumor for the four tumor exclusion conditions using the manual liver segmentation. Figure 3 shows all data points including the false-positives. A total of about 300 true tumors were detected and ~1500 false-positives. Of all false-positives in the tumor range (0–350,000 voxels), ~92% were 500vx (small).
and the values for the automatic segmentation and the manual (ideal) segmentation underscore the same behavioral point. It is clear that the medium and large size tumors are stable and most of the true-positives are within sigma. The tumor segmentation is generally good for most of the sizes, and false-positives are more correlated to the small tumor size.
30
O. S. Salman and R. Klein
Fig. 6. Dice vs. segmentation size, manual organ segmentation, true-positives > 500vx and > moving average–sigma.
5 Discussion In this work we developed a novel approach for automatically detecting liver tumors in CT scans. Our approach achieved 71% dice coefficient without any means of removing false-positive detected tumors. This performance is similar to state-of-the-art performance metric reported in previous literature using the same LiTS dataset, with least information on how the false-positives were dealt with, and what was the detection accuracy achieved. With all false-positives manually removed, the dice values improved to ~77%, representing the upper limit of performance. With manual liver organ segmentation, dice improved to ~78% and a maximum possible dice of ~81%. The results clearly demonstrate that accurate organ segmentation can aid tumor detection within the organ. Furthermore, false-positive removal can improve performance. The false-positive problem is not simple for the liver, because of the texture and intensity similarities that normal and tumor tissues share. Since having an ideal 100% false-positive removal is practically not feasible, the bounds in which a false-positive removal module can perform lie somewhere between the base case of segmentation with
Automatic Detection and Segmentation of Liver Tumors
31
Fig. 7. Dice vs. segmentation size, automatic organ segmentation, all-positives.
all positives included, and the upper limit of all false-positives removed, depending on its efficiency. Preliminary results for such module were reached, but not presented in this work. The main observation about false-positives is that they are highly localized in the small size range (≤500 voxels), with very few registered in medium and large ranges. As shown in Fig. 2, over 95% of false-positives have size 500vx (small).
Another solution that may solve the problem without exclusions would be training a dedicated system to handle these small regions, hence make it focus on very fine details to distinguish true from false, but that may reduce accuracy. Again, it is worth noting that the impact of these small false-positives on the overall dice is negligible because of their minuscule volume.
Automatic Detection and Segmentation of Liver Tumors
33
Fig. 10. Dice vs. segmentation size, manual organ segmentation, true-positives > 500vx and > moving average – sigma.
Overall, we can say that the current system produces good segmentation for the liver as an organ, and tumors with size greater than 1000 voxels, with manageable falsepositives. Segmentation of true-positive points below the moving average dice minus sigma may also be handled separately because they usually reflect noisy images or shape related factors. Taking these points and understanding their properties may help improving the dice and accuracy beyond removing the false-positives. The CNN based on our methodology have shown good segmentation dice values for tumor size range 500 to 5000 voxels, while there is a large fraction of false-positives. For sizes greater than 5000 voxels, both segmentation and discrimination against falsepositives were good. For less than 500 voxels, the growth of false-positives dominates. This means that the system has strength against false-positives in the medium-large range. Given that, the current segmentation system can be deployed in a clinical environment to detect medium and large size tumors by excluding regions of size less than 5K voxels, leaving the range less than that for further improvements. Although both manual and automatic trained organ segmentations follow the same behavior for tumors, the manual ideal segmentation produced higher numbers of true-positive and false-positive tumors, but also achieved better detection accuracy (87% vs. 83% automatic). It seems the true-positives are correlated to the false-positives. The true positive detection from the manual segmentation being greater than the detected ones from the automatic (trained)
34
O. S. Salman and R. Klein
segmentation could be due to missing small parts in the auto segmentation of the liver. Similarly, the increase in the false-positives can be explained by the same reason.
6 Conclusion When designing a tumor detection/segmentation system, it is highly recommended to separate three main integrated components: organ segmentation, tumor segmentation and false-positive elimination. With the separation performed, one can analyze the reasons for limitations using the presented methods. Limitations can be caused by the management of different data groups, where each requires certain properties to the solutions to build an integrated system with better generalization capabilities. The work in this article represents the tumor segmentation module, which can be deployed clinically for medium and large size tumors. Identified limitations of this module can be well utilized to develop a false-positive elimination module to allow for better results for all sizes. The maximum performance that we calculated if all false-positives were removed allows for approximation of the improvements that such module can offer.
References 1. Salman, O.S., Klein, R.: Developing an automatic cooperating neural networks and image standardization approach for segmentation of X-ray computed tomography images. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1288, pp. 390–401. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-63128-4_29 2. Salman, O.S., Klein, R.: Anatomical region identification in medical X-ray computed tomography (CT) scans: development and comparison of alternative data analysis and vision-based methods. Neural Comput. Appl. 32(23), 17519–17531 (2020). https://doi.org/10.1007/s00 521-020-04923-6 3. Chen, L., et al.: Liver tumor segmentation in CT volumes using an adversarial densely connected network. BMC Bioinf. 20(16), 587 (2019). https://doi.org/10.1186/s12859-0193069-x 4. Bilic, P., et al.: The liver tumor segmentation benchmark (LiTS). arXiv preprint arXiv:1901. 04056 (2019) 5. Chlebus, G., Schenk, A., Moltz, J., van Ginneken, B., Hahn, H., Meine, H.: Automatic liver tumor segmentation in CT with fully convolutional neural networks and object-based postprocessing. Sci Rep. 8(1), 1–7 (2018) 6. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-31924574-4_28 7. Bi, L., Kim, J., Kumar, A., Feng, D.: Automatic liver lesion detection using cascaded deep residual networks. arXiv preprint arXiv:1704.02703 (2017) 8. He, K., Zhang, X., Ren S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 9. Li, W., Jia, F., Hu, Q.: Automatic segmentation of liver tumor in CT images with deep convolutional neural networks. J. Comput. Commun. 3(11), 146 (2015)
Automatic Detection and Segmentation of Liver Tumors
35
10. Schapire, R.E.: Explaining AdaBoost. In: Schölkopf, B., Luo, Z., Vovk, V. (eds.) Empirical Inference, pp. 37–52. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-411 36-6_5 11. Liaw, A., Wiener, M.: Classification and regression by RandomForest. R News, vol. 2, no. 3, pp. 18–22 (2002) 12. Noble, W.: What is a support vector machine? Nat. Biotechnol. 24(12), 1565–1567 (2006) 13. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015) 14. Meng, L., Tian, Y., Bu, S.: Liver tumor segmentation based on 3D convolutional neural network with dual scale. J. Appl. Clin. Med. Phys. 21(1), 144–157 (2020) 15. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001) 16. Zheng, S., Fang, B., Li, L., Gao, M., Wang, Y., Peng, K.: Automatic liver tumour segmentation in CT combining FCN and NMF-based deformable model. Comput. Methods Biomech. Biomed. Eng. Imaging Vis. 8(5), 1–10 (2019) 17. Image and video denoising by sparse 3D transform-domain collaborative filtering: blockmatching and 3D filtering (BM3D) algorithm and its extensions. Tampere University of Technology (2020). http://www.cs.tut.fi/~foi/GCF-BM3D/. Accessed 9 Aug 2020 18. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 19. McInerney, T., Terzopoulos, D.: Deformable models in medical image analysis: a survey. Med. Image Anal. 1(2), 91–108 (1996) 20. IRCAD France (2020). https://www.ircad.fr/research/3d-ircadb-01/. Accessed 9 Aug 2020 21. Christ, P.F., et al.: Automatic liver and lesion segmentation in CT using cascaded fully convolutional neural networks and 3D conditional random fields. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 415–423. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46723-8_48
HEMAYAH: A Proposed Mobile Application System for Tracking and Preventing COVID-19 Outbreaks in Saudi Arabia Reem Alwashmi, Nesreen Alharbi, Aliaa Alabdali, and Arwa Mashat(B) King Abdulaziz University, Rabigh, Saudi Arabia [email protected]
Abstract. Like many countries worldwide, numerous companies and business in the Kingdom of Saudi Arabia (KSA) have closed to stop the spread of COVID19, resulting in losses of jobs and income. To allow reopening, the Saudi Ministry of Health is actively developing technologies to facilitate early detection. Here, a mobile application system for tracking COVID-19 is proposed that connects a scanner with databases from the Ministry of Health and hospitals to check the status of an individual before entering a location or socializing with others. The main goals of the system are to protect the economy of the country from the impact of COVID-19 by allowing companies and sectors to function continually; to allow cleared individuals to live normally by giving them the ability to access government facilities such as hospitals, schools, mosques, and universities without fear of infection; to preserve social communication between families; and to provide a set of detailed data to aid COVID-19 pandemic analyses, including daily situation updates, pandemic curves and statistics. Keywords: Covid-19 · Tracking · Mobile application · Saudi Arabia
1 Introduction The new coronavirus disease (COVID-19) has spread rapidly globally since the first case was reported in Wuhan, China, in late December 2019 [1]. Globally, by 17 April 2020, more than 2,196,109 confirmed cases of COVID-19 and more than 149,024 deaths had been reported. Within the same period, the Kingdom of Saudi Arabia (KSA) recorded 7,142 confirmed cases of COVID-19 and more than 87 deaths [2]. The World Health Organization (WHO) officially declared pandemic status on 11 March 2020 and has suggested that countries prepare for early detection, isolation, contact tracing and active surveillance [3, 4]. The government of KSA implemented serious and strict measures to prevent and control the spread of the pandemic among citizens and residents in the KSA. These protocols include the implementation of a 14-day quarantine returning travelers [3]; the suspension of international flights, sport events, and entertainment activities; the closure of all malls, supermarkets, schools, universities and mosques, including the two Holy Mosques in Makkah and Madinah; and the imposition of a curfew. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 36–47, 2021. https://doi.org/10.1007/978-3-030-80129-8_3
HEMAYAH: A Proposed Mobile Application System
37
These measures in KSA have had a variety of economic and social impacts. Many Saudi companies and businesses shut their branches in KSA until further notice in solidarity with efforts in the Kingdom to stop the spread of COVID-19. The financial impact to companies continues and cannot yet be assessed [5]. Students are continuing their education through distance learning, which may not suit all students due to the need for computer devices and Internet connection; in addition, students are missing face-toface communication with teachers and may be interrupted in their education by technical problems [6]. Although medical services at hospitals are available, the persistent fear of the risk of infection with the virus may hinder patients from seeking treatment. Perhaps most importantly, people are losing their jobs and income. To ameliorate these impacts, the government has guaranteed the availability of immediate food resources, medicines, and health services, in addition to protecting government facilities and ensuring that their work continues, even remotely. Early detection and isolation are the highest priorities in any disease outbreak [7] in order to reduce the number of individuals transmitting the disease. These efforts require monitoring tools, and many countries have taken advantage of existing tools or begun developing new applications. After experiencing the SARS outbreak in 2003, in 2004 the government of Taiwan began preparing to respond to the next potential crisis by establishing the National Health Command Center (NHCC), which focuses on large outbreaks and acts as the operational unit for direct communications with authorities [8]. When COVID-19 appeared in China, Taiwan acted quickly by integrating its health insurance database with customs and immigration databases to begin big data analytics. They also used QR code scanning and online reporting of travel histories. Those at high risk were monitored electronically, tracked and quarantined at home through their mobile devices [8]. In Japan, the company Bespoke Inc. launched an AI Chatbot that provides up-to-date information about the coronavirus outbreak, a symptom checker and preventative measures [4]. At Johns Hopkins University, the Center for Systems Science and Engineering (CSSE) developed an interactive online dashboard that visualizes and tracks coronavirus cases in real time [9]. This dashboard, which was shared publicly to help researchers, authorities and the public, reports cases in China, the USA, Canada and Australia and shows the location and number of confirmed cases, deaths and recoveries. During epidemics, decisions must be made quickly. The likelihood of future pandemics highlights the urgency of developing and investing in surveillance systems. Data sharing can aid the production of up-to-date epidemic forecasts [10], and new technologies to monitor, control and track the spread of viruses in real time are needed. Despite the Kingdom’s quick measures and decisions, COVID-19 is still spreading rapidly in KSA. Consequently, the Saudi Ministry of Health is applying technologies that can facilitate tracking of the spread of this virus and controlling infection among the population. An existing application in KSA that people are already familiar with is Mawid, an electronic service provided by the Saudi Ministry of Health to the public since 2018 [11]. This application allows patients to book appointments across primary health care centers and manage their appointments and referrals. In 2020, a new, free feature was added for coronavirus self-assessment that asks users five questions. This feature will support the Saudi Ministry of Health by rapidly providing a huge amount of data that can be used to identify high-risk and low-risk patients. Additional applications introduced by the
38
R. Alwashmi et al.
government in collaboration with the Ministry of Health include Tawakkalna and Tabaud, which are designed to help protect individuals from becoming infected by COVID-19. Tawakkalna is the official application to facilitate the mobility of individuals, employees and families throughout the Kingdom during the COVID-19 pandemic. Tawakkalna contributes to reducing the spread of the virus by preventing the issuance of permits to enter infected places. The application also allows users to report suspected cases of COVID-19 and request necessary medical assistance for themselves or others. The application has been used widely, especially to request emergency exit permits during times of curfew. Tabaud contributes to protection against infection by sending alert notifications to users when they are too close to a person who has been infected by the virus in the past 14 days. The application uses Bluetooth technology and saves data on the device; consequently, an internet connection is needed only to receive updates and notifications. In cooperation with the Ministry of Health, the government of KSA has emphasized the importance of downloading both Tawakkalna and Tabaud due to their complementary roles in creating a healthy environment for individuals and families. Both applications can significantly support the government of KSA in containing the COVID-19 pandemic. However, neither application reveals whether a specific individual can be safely interacted with. It is important to know whether employees, customers, friends, and students are clear of COVID-19 in order to ensure safety for all in a room, office or facility. To address this need, this study proposes a COVID-19 detection system that connects a scanner application with the Ministry of Health database and hospitals. The aims of the system are fourfold: first, to protect the Kingdom’s economy from the impact of COVID19 by allowing companies and sectors to continue to function; second, to allow healthy, uninfected individuals to live normally by giving them the right to access government facilities such as hospitals, schools, mosques, and universities without fearing infection; third, to preserve social communication between families; and fourth, to create a new, detailed dataset to facilitate COVID-19 pandemic analyses, including daily situation updates, pandemic curves and statistics.
2 Proposed System The proposed system is a mobile application called Hemayah (Fig. 1). “Hemayah” means “protection” in Arabic. The main goal of the system is to make life easier during a viral pandemic like COVID-19. The application focuses on tracking Covid-19 but can be modified to track any similar disease. The system classifies patient status into six categories: Clear, Healed, Patient, Contacted, AtRisk, and Deceased. “Clear” refers to a healthy individual (marked green). “Healed” refers to an individual who was previously infected and has recovered (marked blue). “Patient” refers to a sick individual with the virus (marked red). “Contacted” refers to an individual who contacted an infected patient (marked orange). “AtRisk” refers to an individual who accompanied any Contacted or Patient individual at any checkpoint excluding hospital checkpoints (marked gray). “Deceased” refers to individuals who passed away after being infected by the virus. The application reveals the status of an individual by the corresponding color (Fig. 2).
HEMAYAH: A Proposed Mobile Application System
39
Fig. 1. Hemayah logo
Fig. 2. Color status
The development of the system comprises four major objectives: 1) building the infrastructure of the database; 2) building the system solution and planning all required steps to design the scanner application; 3) developing the system solution and programming and testing the scanner application; and 4) analyzing the results and handing over system documentation. To build the database, we will design entity relationship (ER) diagrams and then build a relational database system in the cloud. Hadoop technology will be used to develop and store the system database. To analyze the scanner application, we will design all UML diagrams. We will then build an iOS application using Xcode IDE and an Android application using netbeans IDE. The last objective will be achieved after running the application and filling the database with the necessary data to aid subsequent COVID-19 pandemic analyses, including daily situation updates, pandemic curves and statistics. Fulfilling these four objectives in the development of the COVID-19 detection system will allow the four main goals in helping society overcome the economic and social effects of this global pandemic outlined above to be realized. The first goal, to protect the economy by keeping businesses running after isolating any individual at risk of spreading the virus, and second goal, to allow clear (healthy) individuals to live normally by giving them free access to all facilities in the city without fear of infection, will be achieved by creating a database system that contains the data of all individuals moving around the city as well as their names and pictures (National ID or Iqama). Then, several checkpoints will be equipped with scanners loaded with the mobile application connected to the database. Before entering any public place, each individual ID card must be scanned by the checkpoint’s authority to read the ID number. Once the ID number is read, the scanner will compare it with ID holder’s stored status in the database and generate one of four possible responses. “Allow Access” will be displayed if the returned status is either Clear or Healed, thus giving the individual permission to enter the facility with no risk of spreading the virus. The second message is “Deny Access”. This message will be generated by the system if the status returned from the database is either Patient or Contacted and will prevent the individual from entering the facility. The checkpoint authority can contact the police to force this individual to stay in isolation since this individual is at high risk of spreading the virus. The third message is “No Data”. This message means that data for this individual is not saved in the database. In this case,
40
R. Alwashmi et al.
the system will save this individual’s information in the system under the status Clear. Finally, if the returned status from the system is Patient or Contacted and the scanned individual has company, all people accompanying the scanned individual will be scanned for their data and given the status “AtRisk”. This feature will help authorities keep track of any possible risks in society by detecting the status of these AtRisk individuals and preventing them from entering public places. Businesses will also be able to use the application to check the statuses of customers, employees, staff and other individuals entering the facility. The third goal of this system is to preserve social communication between families. This goal will be achieved by building a mobile system that is available to the public. People within the same family will be alerted if someone in their surroundings is Contacted or even AtRisk, making it possible to avoid contact with people who may have COVID-19 and giving Clear individuals the opportunity to live a normal life. Scanned individuals could also include delivery employees with incoming packages or mail. The last goal of the system is to build a detailed dataset that includes all individuals moving within the city for subsequent COVID-19 pandemic analyses, including daily situation updates, pandemic curves and statistics. The data of every individual scanned at any checkpoint will be maintained within this dataset. The saved information will contain the times and locations entered by every individual. The database will be linked with hospitals, and all data will be available for study by researchers to determine accurate public outcomes for further study of the effects of the virus on society.
3 System Implementation 3.1 Different Users of the System In this system, any user who logs into the system and checks the statuses of other individuals is called a host; the individuals whose statuses are being scanned are called visitors. In the proposed system, there are four different types of hosts: individuals, commercial checkpoints, police checkpoints and hospitals (Fig. 3). In the following sections the scanning process of each host type is explained in detail. 3.1.1 Individuals Using Hemayah, individuals can explore their own status of those of their visitors (Fig. 4). Individuals first log into the system by entering their National or Iqama ID. A verification code is then sent to the mobile number registered in the system. Once the correct verification code is entered in the system, the host will be able to start the scanning process. The scanning process begins by entering or scanning the visitor’s National or Iqama ID. The host will choose between two options before the system reveals the status of the scanned visitor. The first option is to send a verification number to the visitor’s mobile device, and the other is to skip sending any verification number. If the host chooses to send a verification code to the visitor’s cellphone, then the host must enter the correct code before accessing the visitor’s status. Otherwise, if the host chooses to skip sending any verification code, then the system will send a notification message to the visitor
HEMAYAH: A Proposed Mobile Application System
41
Fig. 3. Overview of the proposed COVID-19 detection system
about this process. The system will also show a warning message to the host about the possible consequences of any illegal access to the information of others without their consent. Once the host agrees to the warning message, the visitor’s status will be shown. After the current status of the visitor is revealed to the host, the system will ask the host if this visitor has company or not. If the visitor has any company, their statuses will be scanned following the same process, and the scanning data will be saved in the same group for all visitors coming together. If the statuses of any of the individuals in the same group are found to be Contacted or Patient, the system will change the statuses of all other individuals in the same group from Clear or Healed to AtRisk. This change will inform the health authority that this group has been in contact with people infected with the virus, and local mandates on self-isolation can be applied. Figure 5 illustrates the flow chart of the individual scanning process for incoming visitors. 3.1.2 Checkpoints There are three types of checkpoints: commercial and police checkpoints and hospitals. At a commercial checkpoint, the host can be any employee working in the company. Before using the system, the owner or authorized person whose data is registered in the Ministry of Labor database can activate the facility within Hemayah. The owner must first enter their National or Iqama ID to validate their identity. The system will send a verification code to the registered mobile phone, and the host must re-enter the correct verification code. Once the identity of the owner is validated, the system will allow the
42
R. Alwashmi et al.
Fig. 4. Hemayah mobile application screens (Individual)
owner to register all stores belonging to this company. For each store, the owner must register the National or Iqama IDs of all employees who can scan visitors’ data in that store. To scan the statuses of visitors entering a store, the employees of that store first enter their National or Iqama ID to log in as hosts. After verifying their identities, they can start the scanning process for any incoming visitor to the store. After the host enters or scans the National or Iqama ID of the visitor, the system will retrieve the status of the scanned visitor. According to that status, the visitor will be allowed or denied entry to
HEMAYAH: A Proposed Mobile Application System
43
Fig. 5. Flowchart for individuals in the Covid-19 detection system
the store. If the scanned visitor is part of a group, then the host will scan the National or Iqama IDs of all people accompanying the current visitor. At any time, if the status of any person in this group is Patient, Contacted or AtRisk, the system will change the status of Clear people within this group to AtRisk to alert the health authorities of possible new spreading of the virus. Once the scanning process of the current group is finished, the host can return to the main page to start a new scanning process. Figure 6 shows the flowchart of the commercial checkpoint scanning process. At a police checkpoint, the host can be any employee working in the Ministry of Interior, such as a police officer. The police officer at the checkpoint must first enter his National ID for validation to log into the system. Once his identity is verified, he must enter or scan the ID number of the checkpoint. If he is authorized to scan at this checkpoint, he will be able to start the scanning process of any visitor to this checkpoint. The scanning process for visitors to police checkpoints is similar to that at commercial checkpoints. The police officer must first enter or scan the visitor’s National or Iqama ID to reveal their status. If the visitor is part of a group, all individuals within that group will be scanned together. If the status of any is Patient, Contacted or AtRisk, the statuses of all Clear people within this group will be changed to AtRisk. Once the scanning process of the group is complete, the system will return to the main page to start a new scanning process.
44
R. Alwashmi et al.
Fig. 6. Flowchart for Commercial checkpoint scanning using the proposed COVID-19 detection system
Police checkpoints can be added, deleted, or modified in the proposed system. To add new police checkpoints, an authorized individual in the Ministry of Interior can log in using their National ID. After successfully validating their identity, the authorized person can add new checkpoints. Each checkpoint must be linked with the police officers who have the authority to start the scanning process at this checkpoint. The authorized person at the Ministry of Interior can also update the data of the checkpoints or police officers linked to any checkpoint. Figure 7 shows the flowchart of the police checkpoint scanning process.
HEMAYAH: A Proposed Mobile Application System
45
Fig. 7. Flowchart for police checkpoint scanning using the proposed COVID-19 detection system
3.1.3 Hospitals A hospital can serve either as a commercial checkpoint or a health authority. As a health authority, a hospital employee can change the status of any individual based on lab results. For example, the status Contacted or AtRisk can be changed to Patient, Clear or Deceased according to lab results or if the individual has passed away. Similarly, Patient can be changed to Healed or Deceased. In addition, the system will automatically change AtRisk and Contacted statuses to Clear after 14 days of the last registered contact with an individual with the status of Patient if the health authorities do not change the status using the system. A hospital checkpoint can also be used to check the statuses of visitors following the same process as for the commercial checkpoint in Sect. 3.1.2.
46
R. Alwashmi et al.
3.2 The Database The data of this system will be migrated from two different sources. The personal information of individuals in the country and the information of policemen and police checkpoints will be saved in the Hemayah database from the Ministry of Interior or the Ahwal database. The data of commercial checkpoint authorities will be retrieved from the Ministry of Labor database. Figure 8 shows the proposed database of Hemayah.
Fig. 8. COVID-19 detection system database
4 Conclusion and Future Work The outbreak has already affected the economy of KSA [4], and the proposed application is designed to help reduce the consequences of the pandemic. Hemayah has great
HEMAYAH: A Proposed Mobile Application System
47
potential as an application and can help authorities track and control the spread of the virus without affecting businesses and the economy. Hemayah can be modified for use in other outbreaks as well. Before completing the mobile application project, the system will be modeled by Alloy Modeling Language to ensure the system’s accuracy. Acknowledgment. The author acknowledgments the financial support provided by King Abdulaziz City for Science and Technology (General Directorate for Financing and Grants) to (King Abdulaziz University- Rabigh) to implement this work through Fast Track Program For COVID-19 Research Project No. 5-20-01-009-0092.
References 1. Zhu, N., et al.: A novel coronavirus from patients with pneumonia in China, 2019. New Engl. J. Med. (2020). https://doi.org/10.1056/NEJMoa2001017 2. Saudi center for disease prevention and control. https://covid19.cdc.gov.sa/ar/daily-update s-ar/. Accessed 5 Jul 2020 3. Ramesh, N., Siddaiah, A., Joseph, B.: Tackling corona virus disease 2019 (COVID 19) in workplaces. Indian J. Occup. Environ. Med. 24(1), 16–18 (2020) 4. Sohrabi, C., et al.: World health organization declares global emergency: a review of the 2019 novel coronavirus (COVID-19). Int. J. Surg. 76, 71–76 (2020) 5. The list of Saudi companies that shut down branches due to coronavirus outbreak. Argaam. https://www.argaam.com/en/article/articledetail/id/1357294. Accessed 7 Jul 2020 6. O’Lawrence, H.: A review of distance learning influences on adult learners: advantages and disadvantages. In: Proceedings of the 2005 Informing Science and IT Education Joint Conference. Citeseer. (2005) 7. Al-Tawfiq, J., Memish, Z.: COVID-19 in the eastern Mediterranean region and Saudi Arabia: prevention and therapeutic strategies. Int. J. Antimicrobial Agents (2020). https://doi.org/10. 1016/j.ijantimicag.2020.105968 8. Wang, C., Ng, C., Brook, R.: Response to COVID-19 in taiwan big data analytics, new technology, and proactive testing. Am. Med. Assoc. 323(14), 1341–1342 (2020) 9. Dong, E., Du, H., Gardener, L.: An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect. Dis. (2020). https://doi.org/10.1016/S1473-3099(20)30120-1 10. Buckee, C.: Improving epidemic surveillance and responses: big data is dead, long live big data. Lancet Digit. Health. (2020). https://doi.org/10.1016/S2589-7500(20)30059-5 11. Ministry of Health. https://mawidstf.moh.gov.sa/. Accessed 12 Jul 2020
Data Security in Health Systems: Case of Cameroon Igor Godefroy Kouam Kamdem1(B) and Marcellin Julius Antonio Nkenlifack2 1
Dschang University, Research Unit in Fundamental Informatics, Engineering and Applications (URIFIA), Dschang, Cameroon [email protected] 2 Dschang University, Research Unit in Fundamental Informatics, Engineering and Applications (URIFIA), B.P. 96, Dschang, Cameroon [email protected]
Abstract. Computer security consists in protecting a computer system against any violation, intrusion, degradation or theft of data within an information system. In health systems, security aims to guarantee the confidentiality, integrity, availability and authenticity of data. Security problems can be observed at the local level, but also during the exchange of health data. The aim of this work is to ensure the security of health data during exchanges and its local storage. To achieve this objective, a first model is proposed for local data security. This paper distinguishes three classes of data classification (sensitive, non-sensitive and low-sensitive); a second model is proposed to ensure the security of medical file transfer, using a synchronous function to authenticate the user before sending the file.
Keywords: Security
1
· E-health · Sensitivity
Introduction
Health is a state of well-being that consists not only of a physical state, but also a mental state. E-Health or electronic health is the use of informatics tools in the health field. The transformation of the health sector through digital technology has become over the years a top priority for health institutions. This transformation responds to two fundamental issues: improving the organization by facilitating the work of professionals, and meeting the needs of users and residents. This improvement involves facilitating data exchanges between healthcare institutions, patient management, improving the organization of medical data and many others. The problem of data security in healthcare institutions is observed at two levels: at the local level at a server, and during the exchange of medical records. The problems encountered are generally data theft, illegal access to patient records, falsification of data and others. The question that may be asked is how to ensure effective data security both at the local level and during exchanges. This work attempts to answer this question by proposing a c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 48–57, 2021. https://doi.org/10.1007/978-3-030-80129-8_4
Data Security
49
method of data security at both levels mentioned above. This work therefore focuses on the security of health data and aims to propose a data security model both at the local level and during transfer. This work is organized in three sessions: The first part presents a state of the art on health data and presents some previous work carried out on the security of health systems. The second part presents a security model at two levels: firstly, at the level of health data transfer, using synchronous biometric authentication to trigger the sending of data; and secondly, at the local level with partial encryption resulting from the work of Mohammed Miroud [1]. The third and last part presents the different results obtained, followed by a discussion.
2
State of the Art
According to the World Health Organization (WHO), “Health is a state of physical, mental and social well-being and not merely the absence of disease or infirmity” [2]. Medical informatics has two main areas: on the one hand the field of medical practice with diagnostic assistance, computerization of the electronic patient record, economic optimization and others; on the other hand the field of medical science research, which takes into account the following three criteria: internal validity, external validity and clinical relevance [3]. The concept of information system security covers a set of methods, techniques and tools responsible for to protect the resources of an information system so as to ensure integrity, confidentiality, availability, non-repudiation, authentication and others. In order to guarantee a good security, it is necessary to define a perimeter security. “The security perimeter within the physical universe delimits The definition must also include (or not) the immaterial entities that populate computers and networks, essentially software and in particular operating systems” [4]. The legal aspect of health information is located at three levels [5]: Personal data, defined by the French National Commission for Information Technology and Civil Liberties (CNIL) as “any information relating to a natural person who can be identified, directly or indirectly”; sensitive data, which are those that reveal, directly or indirectly, the racial or ethnic origins, political, philosophical or religious opinions or trade union membership of individuals, or relate to their health or sex life as defined by the CNIL; and the processing of personal data, which is all operations involving such data. The protection of health data involves the use of encryption algorithms, hash functions, authentication methods and many others. Several works have been carried out on health systems and some have been the subject of academic research. Katsikas, Spinellis and Al [8] presented in 1998 a security architecture based on a trusted authority. This trusted authority will be responsible for certifying the interactions between the two participants and guaranteeing that each is who they claim to be. Akinyele, Lehmann and Al [6], in 2011 have developed a medical records management application compatible with Apple Smartphones. In this work, user data is encrypted and stored on the Cloud.
50
I. G. Kouam Kamdem and M. J. A. Nkenlifack
Basel Katt [7] also proposes a scheme based on encrypting data before storing it in the Cloud. The disadvantage of this model is the impossibility to search the data since even the server on which it is stored is not able to decrypt it.
3
Data Security Model
Problems related to the security of medical data can be observed at two levels: firstly, at the local level to guarantee the availability of data, prevent illegal access and others; secondly, at the level of the transfer of medical records, which can be subject to numerous attacks jeopardizing the confidentiality of the data exchanged, their integrity, authenticity, etc. 3.1
Local Security in Healthcare Systems
The local security here is to protect the data server from intrusions of any kind. Several cases of problems can be observed at this level. For example, a hacker could attack the medical data server directly and read diagnostic information, test results, X-rays and other information. At the same time, he could modify, create or delete patient records. He could also render the server data inaccessible and thus the system non-functional. A possible solution to prevent these intrusions would be to encrypt all the information before it is stored in the data server or to encrypt the server itself. But this solution seems very cumbersome because each access to a medical file will require the complete decryption of the information, and even the information that is not needed, which will take a lot of time as the volume of data increases. Since this solution is cumbersome, another possibility is to partially encrypt the data contained in the server. Partial encryption requires a prior categorization of data. The work presented by Mohammed Miroud [1] in 2016 can be summed up in three important points: access to data through the “Google Authenticator application, the generation of one-time passwords, and the method of storing data in the Cloud, which consists of categorizing data into three levels of sensitivity: non-sensitive, low-sensitivity and high-sensitivity. This work focuses only on the data categorization method presented in Mohammed’s work. This classification allowed Mohammed to optimize the data access time by encrypting the data using Rijndael’s symmetric AES algorithm with keys of different sizes: a key of 128 bits for non-sensitive data, 192 bits for low-sensitivity data and 256 bits for high-sensitivity data. This work exploits the classification model presented above (Fig. 1). Contrary to Mohammed who encrypts all three classes of data, here only two classes are encrypted, namely the low-sensitive data class with a key size of 192 bits and the high-sensitive data class with a key size of 256 bits. Since the non-sensitive data does not present any information about the patient’s health status, only the security of access to the server will be applied to it (Fig. 2).
Data Security
51
Fig. 1. Diagram inspired by Mohammed’s model [10]
Fig. 2. Local security model [10]
3.2
Data Exchange Security
The medical records transfer sub-model consists of four steps: Step 1 : Opening a Secure Connection. Opening a secure connection consists of creating a connection to the server to which you wish to transmit data. This is done using the OpenSSh version of the SSh server which is the free version of the server. This allows to put a first level of security on the data transfer. Step 2: Authorization to Send Files. This step consists of identifying the user who wishes to transmit the message, but also authenticating this user. As far as the authentication of the user or the doctor is concerned, the transfer of the medical file is conditioned by the taking of biometric parameters. This biometric module will therefore be the trigger for sending the patient file. Even when connected to the system, if the doctor does not have the right to transfer a file, he will not be able to do so. In (1), the physician chooses the medical record to be transferred and once the record is found in the system (2), he or she performs an authentication
52
I. G. Kouam Kamdem and M. J. A. Nkenlifack
Fig. 3. User authentication before sending data [10]
(3) to the system. If the authentication fails, then the process is interrupted (4); otherwise, the file is retrieved (5), encrypted (6) and transmitted to the destination server. At the recipient server, the integrity and authenticity (7) of the received data is checked in order to accept or reject it. Sending or transferring health data requires the physician to authenticate himself or herself. This authentication is done using the biometrics module. Indeed, this makes it possible to control and trace all medical record transfer operations in the system. Taking the biometric parameter is synchronized with the data sending module. This therefore requires the healthcare staff to be present when the data is sent. Indeed, if the biometric module is asynchronous, this implies that the sending of a file can be scheduled at a certain time. This poses a major problem because the data to be sent can be changed or modified by a malicious user before the sending time. Step 3: Encryption and Data Hashing. This step consists of encrypting the data in order to guarantee confidentiality, and also to preserve its integrity by using the hash function. As far as encryption is concerned, the algorithm used is Rijndael’s symmetrical AES algorithm, with a 256-bit key. This algorithm was chosen because it remains to this day one of the safest symmetric algorithms. As far as data integrity is concerned, the function used is the HMAC hash function. This function makes use of a secret key and is very effective because even if the hash function can be broken, it will also be necessary to be able to find the key to successfully trick the user. Step 4: Validating the Authenticity of the Message. This last step consists in accepting or rejecting the message by the recipient server. The HMAC function makes use of a secret key, which is shared by both communicating parties. This key therefore makes it possible to validate the source of the message after reception, and thus to ensure that the message received is indeed authentic (step 7 of Fig. 3). Thus, the HMAC function allows not only to check the integrity of the received message, but also to authenticate this message (Fig. 4).
Data Security
53
Fig. 4. Confidentiality and integrity [10]
4
Results, Analysis and Discussion
This work has been implemented on the OpenMRS health system, which is an open source system, and is currently being deployed in health centers in Cameroon. The data used in the tests are from the OpenMRS database [9] which is a database containing 5000 patients and nearly 500000 observations. To carry out the tests, data of different sizes are loaded and the model is applied in order to see the evolution of the access time according to the volume of data. The following table (Table 1) summarizes the data access time as a function of the data volume and the chosen security method. The tests were performed in a loop of 100 iterations. Table 1. Evolution of the execution time according to the method used and the volume of data [10] Number of patient Global encryption Total encryption Partial encryption 100
116.497
7.065
0.395
500
148.879
34.171
0.415
1000
181.317
68.757
0.424
2000
250.765
135.325
0.427
3000
314.756
199.272
0.476
4000
391.401
268.228
0.481
5000
468.772
332.001
0.553
– Global Encryption: treats the database as a file and encrypts the whole – Full Encryption: treats the database as a file and encrypts only the useful information contained in it – Partial Encryption: partitions data before encryption From this table, the histogram and the curves above can be seen (Fig. 5 and 6):
54
I. G. Kouam Kamdem and M. J. A. Nkenlifack
Fig. 5. Grouped histogram of execution times [10]
Fig. 6. Representative curve of data access time as a function of data volume [10]
Data Security
55
Real-time access can be observed for partial encryption. It can be concluded that this method is very effective as opposed to full data or database encryption. Contrary to the work presented by Mohammed [1], the access time here seems more reduced due to the fact that the data is not stored in the Cloud, due to the fact that this model will use a periodic key and not a single-use key (which makes a gain in key generation time here) and also the size of the encryption keys is reduced. As far as the security of data transfer is concerned, the following images show both clear and encrypted data. A sequence of characters with no meaning can be observed in the encrypted file as well as in the hash result (Fig. 7, 8, 9).
Fig. 7. Hashed data [10]
Fig. 8. Clear data [10]
56
I. G. Kouam Kamdem and M. J. A. Nkenlifack
Fig. 9. Encrypt data [10]
Notes and Comments. The solution obtained on local security was one of the sessions of Mohammed’s model [1] concerning only data categorization. Local security is more efficient than Mohammed’s because the time used in [1] for key generation is saved in this work because instead of using single-use keys, periodically generated keys are used. In addition, the generation size used for each category of data is reduced, which also optimizes data access time. The process of taking the biometric parameter for validating the sending of the message is synchronous, which means that the physician must be present during the data transfer.
5
Conclusion
E-health continues to progress and become more democratic throughout the world. The same is true for the threats to these systems, particularly to personal data. Cybercriminals are increasingly inventive and are turning more and more to this type of system instead of banking systems, for example. This work focused on the security of health systems, particularly in the case of Cameroon. The first model proposed makes it possible to ensure the security of data locally, by opting for partial data encryption as a solution. As for the second model, it makes it possible to secure data exchanges with remote servers. The biometric authentication that initiates the process of sending messages was very interesting because it will make it possible to control and trace communications in health care centers.
Data Security
57
Although local security and exchanges are secure, a question remains about the different encryption keys. What about the process of generating and exchanging cryptographic keys? In the next documents, we will present the process of managing cryptographic keys.
6
Future Work
Future work will focus on: – Propose a model for the periodic generation of cryptographic keys; – Propose a model for classifying medical records according to the degree of sensitivity, taking into account the surrounding parameters. Acknowledgment. The authors wish to thank: – Alexander von Humboldt Foundation, for his support, in particular the acquisition of part of the equipment and research tools, with the funding received from this foundation in 2019, for the project entitled “Multi-scale Analysis and Data Processing”. – AUF (Agence Universitaire de la Francophonie) for his support, in particular the strengthening of the technical platforms of our laboratories in 2019, with the funding received for the project entitled “Sant´e Num´erique S´ecuris´ee : Analyse et S´ecurisation des donn´ees Big data pour les pr´edictions d’int´erˆet m´edical”.
References 1. Mohammed Miroud, E.: La s´ecurit´e dans les syst`emes de sant´e. Ph.D. thesis, Universit´e des Sciences et de la Technologie d’Oran Mohamed Boudiaf, Algerie, Juin 2016 2. Safon, M.-O.: Articles litt´eraire grise, ouvrages, rapport: centre de documentation de l’Irdes (2018) 3. Olivier, M.: Conf´erence pr´esent´ee aux journ´ees d’etude de l’ansi. Les r´esultats probants dans la recherche m´edicale, (96), janvier 2018 4. Quantin, C., Allaert, F., Auverlot, B., Rialle, V.: S´ecurit´e: aspect juridiques et ´ethiques des donn´ees de sant´e informatis´ees. Springer-Verlag, Paris (2013) 5. Lenoir, N.: Element pour un premier bilan de cinq annees d’activite. La revue administrative 36(215), 1983 (1978). la loi 78–17 du 6 janvier: et de la commission national de l’informatique et des libert´es 6. Akinyele, J., Pagano, M., Peterson, Z., Lehmann, C., Rbin, A.D.: Securing electronic medical records using attribute-based encryption on mobile devices report 2010/565. In: Proceeding of the 1st ACM Workshop on Security and Privacy in Smartphones and Mobile Devices. ACM (2011) 7. Katt, B.: A comprehensive overview of security monitoring solutions for e-health systems. In: 2014 IEEE International Conference on Healthcare Informatics, pp. 364–364. IEEE (2014) 8. Katsikas, S.K., Spinellis, D.D., Iliadis, J., Blobel, B.: Using trusted third parties for secure telemedical applications over the www: the euromed-ets approach. Int. J. Med. Informat. 49(1): 59–68 (1998) 9. Mamlin, B.: Demo data. https://wiki.openmrs.org/plugins/servlet/mobile? contentld=5047323#content/view/5047323. Accessed 02 Sep 2010 10. Igor Kouam, K.: S´ecurit´e des donn´ees dans les syst`emes de sant´e au Cameroun. M´emoire master 2, University of Dschang, Informatique, Cameroun, July 2019
Predicting Levels of Depression and Anxiety in People with Neurodegenerative Memory Complaints Presenting with Confounding Symptoms Dalia Attas1(B) , Bahman Mirheidari1 , Daniel Blackburn2 , Annalena Venneri2 , Traci Walker3 , Kirsty Harkness4 , Markus Reuber4 , Chris Blackmore5 , and Heidi Christensen1,6 1
2
Department of Computer Science, University of Sheffield, Sheffield, UK [email protected] Sheffield Institute for Translational Neuroscience (SITraN), University of Sheffield, Sheffield, UK 3 Department of Human Communication Sciences, University of Sheffield, Sheffield, UK 4 Department of Neuroscience, University of Sheffield, Sheffield, UK 5 School of Health and Related Research (ScHARR), University of Sheffield, Sheffield, UK 6 Centre for Assistive Technology and Connected Healthcare (CATCH), University of Sheffield, Sheffield, UK
Abstract. The early symptoms of neurodegenerative disorders, such as, Alzheimer’s dementia, frequently co-exist with symptoms of depression and anxiety. This phenomenon makes detecting dementia more difficult due to overlapping symptoms. Recent research has shown promising results on the automatic detection of depression and memory problems using features extracted from a person’s speech and language. In this paper, we present the first study of how automatic methods for predicting standardised depression and anxiety scores (PHQ-9 and GAD-7) perform on people presenting with memory problems. We used several regressors and classifiers to predict the scores according to a defined level of score ranges. Feature extraction, feature elimination and usual k-fold training were used. The results show that Recursive Feature Elimination can enhance the accuracy of the correlation coefficients and minimise regression errors. Furthermore, classifying the severity score levels for the lower bands achieved better results than the higher ones. Keywords: Depression score prediction · Detecting dementia Clinical applications of speech technology
1
· SVM ·
Introduction
Dementia is a collection of symptoms that relate to deterioration in cognitive functioning such as memory loss and processing speed sufficient to impair ability c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 58–69, 2021. https://doi.org/10.1007/978-3-030-80129-8_5
Predicting Levels of Depression and Anxiety
59
to handle daily tasks. People with early signs of dementia frequently present with co-morbid depression and anxiety. In addition, depression can accrue in up to 40% of people diagnosed with dementia [1]. Overlapping symptoms can make it difficult to distinguish signs of depression from early signs of neurodegenerative dementia (ND), making it hard to correctly identify early stage Alzheimer’s disease, and hence delay starting treatment or recruiting to clinical trials [2]. The diagnostic term Depressive Pseudo-Dementia (DPD) is used to describe patients who suffer from symptoms similar to dementia, but that are mainly caused by depression. Alternative diagnoses include Functional Memory Disorder (FMD), having problems with memory not due to dementia or depression that relate to excessive worry about one’s memory function without evidence of neurodegenerative pathology. Numerous patient-reported assessments have been developed to assess and screen for depression and anxiety in clinical practice. In many memory clinics, where people are referred for dementia diagnosis, questionnaires such as the PHQ-9 and the GAD-7 are used to screen for depression and anxiety (more details in Sect. 3). However, the process of conducting the questionnaire can be distressing for patients and it takes up time from the consultation between patient and doctor. This paper investigates the feasibility of automatically predicting depression and anxiety scores in people with confounding symptoms because of neurodegenerative cognitive complaints. The system is based on automatically extracting acoustic cues from a person’s speech as they answer questions about their memory as part of a newly proposed automatic dementia assessment tool developed by the team [3]. This would allow for the detection and tracking of levels of depression and anxiety as well as early stage dementia. To the best of our knowledge, this is the first study looking at using acoustic cues to predict these scores in the context of cognitive impairment from dementia and related conditions. The rest of the paper is organised as follows. Section 2 review the related background methods, and Sect. 3 describes the data used in the experiment. Section 4 illustrates the depression score prediction and classification pipeline including pre-processing, feature extraction, feature selection, and classifier/regression. Sections 5 and 6 present results and conclusion, respectively.
2
Background
A person’s speech and language is affected by an individual’s mental state and cognitive ability [4,5]. These changes are often part of the diagnostic procedure for clinicians (e.g., verbal fluency tests for diagnosing dementia), and research into how to detect and utilise these changes in speech patterns using automated tools has been the focus of an increasing number of research studies. Research investigating speech patterns of people with depression found that there are common characteristics that are prevalent in people’s speech, such as prosodic, acoustic, and formant features [6–8]. Different methods in the literature have aimed to classify and predict depression in speech either to detect the
60
D. Attas et al.
presence of depression [9,10], classify the severity of depression [8,11], or to predict depression scores [12,13]. In recent years, some consensus as to the choice of which automated system might be best suited for this domain has emerged and most conventional pipelines include a feature extraction front-end followed by a classifier or regressor back-end, depending on the task: detection of severity level or prediction of scores. In [6], a study is presented that aimed to investigate the speech characterisation and speech segment selection methods that would affect the automatic classification of depressed speech. Their proposed system starts by segment selection based on voice activity detection (VAD), acoustic features extraction, feature normalisation, and modelling of depressed speech. The extracted acoustic features consist of speech production, pitch, energy and formant features. They applied a Gaussian Mixture Model (GMM). Using speech rate features that were either manually or automatically selected, the results improved slightly. Another study investigated the importance of vocal tract formant frequency features in the presence of depression [14]. They assessed the performance of GMM and a support vector machine (SVM) in classifying the depression state from the extracted features. They found that the first three formant frequencies could be used as primary features, while other features related to speech productions can improve the classification accuracy. While other automatic systems were built to classify the severity of the depression in the speech based on identified classes such as a study exploring the ability of an SVM to organise depressed speech based on predefined classes. They were interested in studying the voice quality features of the speech on a breathy to tense dimension and the relationship between these features and depression disorders. The severity of the depression was detected based on neutral, positive, or negative classes of depression [8]. The score level prediction assigns a continuous value (depression or anxiety outcome score) to an unknown speech sample. Several systems automatically predict levels of depression. They focused on specific features or feature sets to accomplish the mission of classification. Many of these systems have been developed in response to the leading challenges in detecting emotions and depression, namely the Audio/Visual Emotion Challenge (AVEC). Each year the challenge has different objectives but mainly focuses on different aspects of identifying emotion and depression. The 2013/2014 versions aimed to predict depression levels using Beck Depression Inventory (BDI) estimations for each experiment session. The challenge provides the data for the participant and the gold-standard depression levels. The challenge organisers also provides a suggested, initial set of acoustic features [12,13]. Numerous studies have successfully applied the AVEC 2013/2014 feature set and achieved a high performance, and the set has gone on to become somewhat of a de facto benchmark choice [15]. This paper aims to investigate how the emerging state-of-the-art techniques described above fare in a co-morbidity cohort where conflicting and overlapping symptoms will invariably introduce some noise in the features. To the best of our knowledge, this is the first study exploring automatic methods in this domain,
Predicting Levels of Depression and Anxiety
61
and in general, few studies have looked at similarly complex tasks moving beyond the popular challenge setups. A recent study aimed to detect apathy in older peoples who suffer from Cognitive Disorders using automatic speech analysis [16]. Furthermore, they investigated the paralinguistic features that correlate with the levels of apathy. The study included 60 patients who are 65 years old or older. The participants were asked to describe a situation that stimulated recent affective arousal. The speech features that were extracted from the speech directly were prosodic, formant, source, and temporal. The temporal features in the study represents the measures of speech proportion such as length of pauses and speech rate. In the classification part, the participants were classified using the Apathy Inventory into, either non-apathy or apathy. They used simple logistic regression implemented in the Scikit-learn framework. The model was trained and evaluated using leave one out cross-validation. To measure the performance of the model, they used the area under the curve (AUC). They reported that the prosodic features were more related to apathy, as stated in the previous findings. The classification results showed that features related to sound and pause segments denote higher AUC results. However, they did not report any results that might link detection of apathy with the cognitive disorder diagnostics.
3
Data
Data used in this study is collected from patients referred to a memory clinic at a local hospital between 2012 and 2014 [17]. The data was originally collected for a study focusing on analysing interactions between a neurologist and patients with functional memory disorder (FMD) versus patients with neurodegenerative dementia (ND). The participants and their accompanying persons (if present) were interviewed by a neurologist who asked a few semi-structured questions prior to administering the standard Addenbrooke’s Cognitive Examination Revised (ACE-R) test. A standard microphone and a camera were used for recording the conversations but only the audio is analysed in the paper. The final diagnosis of the individual patients were based on a series of comprehensive neuropsychological tests, investigating the medical history of participants and/or MRI scans. For more detail see [18]. For the purposes of this study, 39 recordings were investigated. The average length of the recordings are 50 min, and when extracting the patient-only turns the available audio ranges in duration from 2 to 30 min. In addition, the participants were requested to complete standard depression (PHQ-9) and anxiety (GAD-7) questionnaires. The PHQ-9 stands for Participant Health Questionnaire and consist of nine depression modules each scored from 0 to 3, (0-meaning “not at all” to 3 meaning “nearly every day”) creating a score range of 0–27. An item is added at the end of the questionnaire to assess the implications of the problems mentioned in the questions on the patients’ daily life routine. The PHQ-9 proved its validity to be used as a reliable tool to identify the severity of depression based on data from two studies with 6000 participants [19]. The GAD-7 (Generalised Anxiety Disorder-7) contains seven questions, each scored from 0 to 3 (as per PHQ-9) with a range of 0–21 [20].
62
D. Attas et al. Table 1. Dataset information Diagnostic category # Speakers Duration DPD
15
2 h 44 min 7 s
FMD
16
2 h 41 min 38 s
ND
6
0 h 30 min 59 s
Undiagnosed
2
0 h 12 min 3 s
Table 1 summarises the number of the speakers and the duration of the speech sample (patient-only turns) per diagnostic category including the undiagnosed speakers. No definite cognitive diagnosis could be reached for two participants, but we included them because they had the PHQ-9 and GAD-7 scores and this study explored accuracy of detecting depression and anxiety.
Fig. 1. Distribution of PHQ-9 score counts in the dataset
Fig. 2. Distribution of GAD-7 score counts in the dataset
Figure 1 shows the line graph distribution based on histogram counts of the PHQ-9 scores per diagnostic category such. The highest observed PHQ-9 score was 22. Likewise, the GAD-7 scores range from 0 to 20 in the dataset. Figure 2
Predicting Levels of Depression and Anxiety
63
presents the line graph distribution of the GAD-7 scores counts per diagnostic category. It is clear that the FMD group has a minimal range of PHQ-9 and GAD-7 scores. While, the people with DPD have higher scores. By contrast, people with ND either have low or high PHQ-9 and GAD-7 scores. Furthermore, Fig. 1 and 2 show that there is a high number of scores that occur only a few times in the dataset, which is a likely challenge for the prediction model.
4
Depression Score Prediction and Classification
The system pipeline comprises pre-processing, feature extraction, and classifier/regressor. We aim to explore the performance of the score prediction and score level classification in relation to the diagnostic classes as well as the effect of using feature dimensionality reduction on the model performance. 4.1
Pre-processing and Feature Selection
To predict the PHQ-9 score for the patient only, the parts of the neurologistpatient conversation that contained only the patient talking was extracted. After that the extracted segments were concatenated to form a single audio file. To predict the depression scores, the AVEC 2014 challenge baseline feature set developed for the depression sub-challenge was picked [13]. A total of 2268 features (Table 2) were extracted consisting of 32 energy and spectral related features, and 6 voice related features. The feature set was extracted using the OpenSMILE toolkit [13,21]. Table 2. AVEC 2014 descriptors [13] Energy & spectral (32) loudness (auditory model based); zero crossing rate; energy in bands; 25%, 50%, 75%, and 90% spectral roll-off points; spectral flux; entropy; variance; skewness; kurtosis; psychoacousitc sharpness; harmonicity; flatness; MFCC 1–16 Voicing related (6) F0 (sub-harmonic summation, followed by Viterbi smoothing), probability of voicing; jitter, shimmer (local), jitter (delta: “jitter of jitter”); logarithmic Harmonics-to-Noise Ratio (logHNR)
We will be applying a standard feature reduction method, the Recursive Feature Elimination (RFE) algorithm which selects the most significant features and has been proven to be successful for similar sparse data domains [3]. The aim of RFE is to compute the coefficient of each feature to eliminate the features with the minimum coefficients in a recursive manner. The RFE strategy
64
D. Attas et al.
depends on first building an estimator (either regressor or classifier in our case) that is trained on the training set features vector. Then the individual feature’s importance is obtained based on the retrieved coefficient of each feature. After that, the features that gained the smallest coefficients are eliminated from the feature set. This procedure is repeated recursively until the desired number of features is reached [22,23]. One of the difficulties when implementing the RFE with cross validation is that on each fold the selected features would likely be different from some arrived at when running RFE on another fold. For that reason, we implemented Recursive Feature Elimination Cross validation (RFECV) using the Scikit-learn library in Python [22,23]. The RFECV specifies the number of features by fitting over the train folds and selects the features that produce the smallest averaged error over all folds. Using the RFECV method, the number of features decreased from 2268 to 337. 4.2
Classification and Regression
A Support Vector Regression (SVR) model was used by the AVEC 2013/2014 challenge as a baseline for implementing a conventional regression pipeline based on their feature set [12,13]. The SVR was demonstrated to deliver best results in predicting emotion continuous parameters as well as depression diagnostics [24,25]. Furthermore, SVM was used in several studies to classify the depression score level in that SVM proved its ability to classify mental state [6,26,27]. In this study, we first compared several classifiers to determine the efficient model for classification such as SVM, Decision Tree Classifier, Random Forest Classifier, Ada Boost Classifier, and Gradient Boosting Classifier. In addition, we compared the following regressors: SVR, Decision Tree Regressor, Gradient Boosting Regressor, and Ada Boost Regressor. SVM and SVR obtained higher recognition rates with an optimal number of features and subsequent results are reported for those architectures only. Preliminary experiments with all regressors and classifiers led us to chose SVR and SVM as our models and to use RFE as the feature reduction method. Due to the relatively small size of the dataset (very common challenge in the healthcare domain). We investigate the process of predicting the depression and anxiety scores (regression) as well as classifying the score levels of severity for PHQ-9 and GAD-7 (classification). The Support Vector Regression (SVR) is a methodology based on Support Vector Machines (SVM) that tries to define the margin that the training set can align to [28]. A linear kernel was used for predicting the depression scores with C = 1.00 and epsilon = 1.00. Due to the small number of recordings in the dataset, we chose K-Folds cross-validation, which splits the dataset into the requested number of folds. As is often the case in healthcare domains, the dataset is small and in addition the outcome measures are not evenly distributed (cf. Fig. 1 and 2). Stratified K-Folds cross-validation was employed to mitigate this. The used cross-validation method can distribute the scores occurrence in the training and test splits fairly which improves the results
Predicting Levels of Depression and Anxiety
65
in a way appropriate for such small datasets. The number of splits that helped in gaining better results are three splits. To classify the severity levels based on the scores, the PHQ-9 scores were grouped according to the following bands [19]: (0–4) Minimal, (5–9) Mild, (10– 14) Moderate, (15–19) Moderately Severe, and (20–27) Severe. The GAD-7 scores were similarly grouped according to the following bands [20]: (0–4) Minimal, (5–9) Mild, (10–14) Moderate, and (15–21) Severe. Each score range is called a ‘band score’ starting from band score 1 for the Minimal scores range, up to band score 5 for PHQ-9 and band score 4 for GAD-7, resulting in 5 and 4 bands for the PHQ-9 and GAD-7, respectively. An SVM classifier is used as it has often proven its ability to classify mental states as mentioned earlier.
5
Results
The use of stratified cross-validation method using three splits gained the best results in Correlation Coefficient (R), Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) without the use of feature elimination method. Due to the difficulty of gaining different features on each fold when using RFE, the optimal number of features used is the ones obtained using RFECV (337). Also, we applied the SVR only on the representative fold that has a correlation coefficient close to the averaged correlation coefficient of the three folds. The results of the three strategies using PHQ-9 scores is shown in Table 3. Table 3. Regression results Method
# Feat. MAE RMSE R
StratKFold
2268
4.82
5.58
0.43
StratKFold + RFE
337
4.47
5.46
0.54
RFECV (StratKFold)
337
1.77
2.09
0.97
It is clear that the use of RFE and RFECV improves the results of the predicted scores. The results of the RFECV with respect to the actual and predicted PHQ-9 scores are shown in Fig. 3. It seems that the model fairly predicts the FMD and DPD people scores, while it finds it difficult to predict the higher scores for DPD people. That might relate mainly to the minimum occurrences of these scores in the dataset. We used SVM to classify the depression severity level (PHQ-9) and anxiety severity (GAD-7). To investigate the effect of depression and anxiety dementia diagnostic category, in Fig. 4 and 5, we plotted the confusion matrix of the depression and anxiety severity classification based on the mentioned banded scores for each of the available diagnoses in the dataset (FMD, ND, and DPD). The figures include the Accuracy (Acc), Precision (P), Recall (R), and F-score(F) rates for each class. The recognition overall accuracy rate was 80.73% with a 0.74 F-score, 0.78 precision, and 0.76 recall.
66
D. Attas et al.
Fig. 3. The results of RFECV and SVR on PHQ-9 scores showing the score band ranges
Fig. 4. The confusion matrix for classifying PHQ-9 band scores per each diagnostic category.
Fig. 5. The confusion matrix for classifying GAD-7 band scores per each diagnostic category.
Predicting Levels of Depression and Anxiety
67
Figure 4 and 5 show that the classifier is able to classify scores in bands 1 and 2 for people with FMD and ND with high accuracies. These are the bands for which most PHQ-9 and GAD-7 scores occur for the FMD and ND groups. In contrast, the classifier struggled to classify some of the scores in the higher bands due to few occurrences of these scores in the range from 11 to higher scores. The higher PHQ-9 and GAD-7 scores mostly occur in the DPD group.
6
Conclusion
People with early signs of dementia often also show signs of depression and anxiety which, because of overlapping symptoms, confounds diagnostic procedures. However, it is vital that anxiety and depression are monitored in patients attending memory clinics because it is a treatable cause of memory complaints and as a co-morbid feature will influence management decisions. This paper investigated how state-of-the-art depression and anxiety score regression and classification perform when applied to people attending memory clinics. We found that the use of feature reduction and k-fold training can enhance the use of SVR in predicting depression scores. The problem of high dimensionality in the AVEC 2014 feature set was solved using RFECV. Furthermore, the stratified K-Folds crossvalidation method aided in the minimal score variability in the dataset. The results show that classifying depression and anxiety using severity score levels can assist in classifying the FMD and ND diagnostics especially if the occurrence of these scores are fairly distributed. In future work we plan to investigate multi-class learning strategies to jointly detect risk of dementia and depression.
References 1. Salary, S., Moghadam, M.: Relationship between depression and cognitive disorders in women affected with dementia disorder. Procedia Soc. Behav. Sci. 84,1763–1769 (2013) 2. Wright, S., Persad, C.: Distinguishing between depression and dementia in older persons: neuropsychological and neuropathological correlates. J. Geriatr. Psychiatr. Neurol. 20(4), 189–198 (2007) 3. Mirheidari, B., Blackburn, D., O’Malley, R., Walker, T., Venneri, A., Reuber, M., Christensen., H.: Computational cognitive assessment: investigating the use of an intelligent virtual agent for the detection of early signs of dementia. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE (2019) 4. Hall, J., Harrigan, J., Rosenthal, R.: Nonverbal behavior in clinician–patient interaction. Appl. Prev. Psychol. 4(1), 21–37 (1995) 5. Sobin, C., Sackeim, H.: Psychomotor symptoms of depression. Am. J. Psychiatr. 154(1) (1997) 6. Cummins, N., Epps, J., Breakspear, M., Goecke, R.: An investigation of depressed speech detection: features and normalization. In: Twelfth Annual Conference of the International Speech Communication Association (2011)
68
D. Attas et al.
7. Flint, A., Black, S., Campbell-Taylor, I., Gailey, G., Levinton, C.: Abnormal speech articulation, psychomotor retardation, and subcortical dysfunction in major depression. J. Psychiatr. Res. 27(3) (1993) 8. Scherer, S., Stratou, G., Gratch, J., Morency, L.: Investigating voice quality as a speaker-independent indicator of depression and PTSD. In: Interspeech (2013) 9. Moore II, E., Clements, M., Peifer, J., Weisser, L.: Critical analysis of the impact of glottal features in the classification of clinical depression in speech. IEEE Trans. Biomed. Eng. 55(1) (2007) 10. Low, L., Maddage, N., Lech, M., Sheeber, L., Allen, N.: Detection of clinical depression in adolescents’ speech during family interactions. IEEE Trans. Biomed. Engi. 58(3), 574–586 (2010) 11. Scherer, S., Stratou, G., Morency, L.: Audiovisual behavior descriptors for depression assessment. IN: Proceedings of the 15th ACM on International conference on multimodal interaction. ACM (2013) 12. Valstar, M., Schuller, B., Smith, K., Eyben, F., Jiang, B., Bilakhia, S., Schnieder, S., Cowie, R., Pantic M.: AVEC 2013: the continuous audio/visual emotion and depression recognition challenge. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM (2013) 13. Valstar, M., Schuller, B., Smith, K., Almaev, T., Eyben, F., Krajewski, J., Cowie, R., Pantic, M.: AVEC 2014: 3d dimensional affect and depression recognition challenge. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge. ACM (2014) 14. Helfer, B., Quatieri, T., Williamson, J., Mehta, D., Horwitz, R., Yu, B.: Classification of depression state based on articulatory precision. In: Interspeech (2013) 15. Cummins, N., Epps, J., Sethu, V., Krajewski, J.: Variability compensation in small data: Oversampled extraction of i-vectors for the classification of depressed speech. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), p. 2014. IEEE (2014) 16. K¨ onig, A., Linz, N., Zeghari, R., Klinge, X., Tr¨ oger, J., Alexandersson, J., Robert, P.: Detecting apathy in older adults with cognitive disorders using automatic speech analysis. J. Alzheimer’s Dis. 69, 1183–1193 (2019) 17. Mirheidari, B., Blackburn, D., Harkness, K., Walker, T., Venneri, A., Reuber, M., Christensen, H.: Toward the automation of diagnostic conversation analysis in patients with memory complaints. J. Alzheimer’s Dis. 58(2), 373–387 (2017) 18. Elsey, C., Drew, P., Jones, D., Blackburn, D., Wakefield, S., Harkness, K., Venneri, A., Reuber, M.: Towards diagnostic conversational profiles of patients presenting with dementia or functional memory disorders to memory clinics. Patient Educ. Counsel. 98, 1071–1077 (2015) 19. Kroenke, K., Spitzer, R., Williams, J.: The PHQ-9: validity of a brief depression severity measure. J. Gen. Int. Med. 16(9), 606–613 (2001) 20. Spitzer, R., Kroenke, K., Williams, J., L¨ oowe, B.: A brief measure for assessing generalized anxiety disorder: the GAD-7. Arch. Int. Med. 166(10), 1092–1097 (2006) 21. Eyben, F., W¨ oollmer, M., Schuller, B.: Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia. ACM (2010) 22. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1-3), 389–422 (2002) 23. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Predicting Levels of Depression and Anxiety
69
24. Grimm, M., Kroschel, K., Narayanan, S.: Support vector regression for automatic recognition of spontaneous emotions in speech. In: 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP 2007, vol. 4. IEEE (2007) 25. Cummins, N., Joshi, J., Dhall, A., Sethu, V., Goecke, R., Epps, J.: Diagnosis of depression by behavioural signals: a multimodal approach. In: Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge. ACM (2013) 26. Cummins, N., Epps, J., Ambikairajah, E.: Spectro-temporal analysis of speech affected by depression and psychomotor retardation. In: 2013 IEEE International Conference on Acoustics. Speech and Signal Processing, IEEE (2013) 27. Cummins, N., Epps, J., Sethu, V., Breakspear, M., Goecke, R.: Modeling spectral variability for the classification of depressed speech. In: Interspeech (2013) 28. Drucker, H., Burges, C., Kaufman, L., Smola, A., Vapnik, V.: Support vector regression machines. In: Advances in Neural Information Processing Systems (1997)
Towards Collecting Big Data for Remote Photoplethysmography Konstantin Kalinin, Yuriy Mironenko, Mikhail Kopeliovich(B) , and Mikhail Petrushan Center for Neurotechnologies, Southern Federal University, Rostov-on-Don, Russia [email protected]
Abstract. Remote photoplethysmography (rPPG) is a technique for non-contact estimation of human vital signs in video. It enriches knowledge about human state and makes interpretations of actions in humancomputer interaction more accurate. An approach to distributed collection of rPPG dataset is proposed along with a central hub where data is accumulated as links to local storages hosted by participating organizations. An instrument for rPPG data collection is developed and described. It is an Android application, which captures dual camera video from front and rear cameras simultaneously. The front camera captures facial video while the rear camera with flash turned on captures a contact finger video. Facial videos constitute a dataset, while ground truth blood volume pulse (BVP) characteristics can be obtained by the analysis of correspondent finger videos. Such approach allows overcoming organizational and technical limitations of biometric data collection and hosting.
Keywords: Crowd science rPPG in the wild
1
· Photoplethysmography · rPPG dataset ·
Introduction
Human-computer interaction (HCI) constitutes a cyclic process where both parties perceive signals from counter-party, perform actions, and analyse feedback. Recognition of short-term activities such as facial expressions, body movements, gaze trajectories, and gestures are typical for video-based HCI [21]. Knowledge about long-term processes and characteristics such as human health state and vital signs is also essential for building intelligent interaction interfaces. Such additional knowledge enables better “understanding” of actual human needs and goals to provide optimal assistance by a machine. Remote photoplethysmography (rPPG) is a technique to retrieve vital signs and their characteristics from video signals. Photoplethysmography originally meant the measurement of volume changes in tissue by photometric sensors [19]. Term “remote” means that there is no contact between a sensor and tissue. Later, instead of direct measurement of volume dynamics, different correlates c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 70–86, 2021. https://doi.org/10.1007/978-3-030-80129-8_6
Towards Collecting Big Data for Remote Photoplethysmography
71
Fig. 1. Proposed scheme for distributed collection of remote photoplethysmography dataset by various research groups.
were considered for measuring, such as micro motion and tiny changes of color of a tissue. Main sources of normal organs volume dynamics are heart beats and respiration [20,31]. Therefore, photoplethysmography mainly aims at retrieving such characteristics of these physiological processes as: heart rate (HR) [29], heart rate variability (HRV) [26], respiratory rate [31]. Recently, arterial blood oxygen saturation has also appeared to be estimated remotely in visible light [34]. The above characteristics determine the health and functional state of a person under observation. Therefore, photoplethysmography is used for health monitoring, while remote techniques (rPPG) allow one to make it continuous and non-disturbing to a person. One primary applied goal of rPPG studies is making this technology convenient and non-disturbing instrument for health state monitoring for any person having a web camera. We consider two main difficulties to reach the stated goal: 1. insufficient accuracy of HR estimation in the wild; 2. difficulty to match different rPPG approaches in standardized conditions to choose optimal one for particular purpose. In this Work, We Focus on the first problem—insufficient accuracy of rPPG methods in the wild to achieve the stated applied goal. Variety of research areas involving signal processing or pattern recognition had significant breakthrough when involving deep learning approaches on big data. For example, in image classification tasks, researchers typically use pre-trained networks on big data like ImageNet dataset [43], which produce embeddings, highly effective in image recognition domain. Thus, for most image classification tasks, transfer learning is applied either by training a few-layered net on top of embeddings generator or by retraining the whole net including the embeddings generator on
72
K. Kalinin et al.
a specific small dataset for a particular task. Such retraining starts from weights obtained on a related task with big data available (not from random weights as for learning from scratch). In rPPG area, a number of datasets appeared, which cover different ages, gender, subjects activities, etc. [3,38]. However, such datasets are still biased [38], having a strong disproportion of different gender, skin tones, ages, or capturing conditions. Furthermore, the entry threshold for research groups to involve available data in training their models is still high, due to different standards of data representation in datasets, necessity to develop or adjust parsers for datasets, equalize data from different dataset, etc. In this way, there is a necessity of a framework handling routine procedures and providing API for available datasets. Further, available datasets still don’t provide “big” data compared to datasets in other computer vision problems such as images classification or semantic segmentation. For example, “Moments in time” dataset [35] for events recognition in video contains about 50 times more minutes of video than VIPL-HR—currently the largest dataset in rPPG [36]. The following barriers can be noted for collecting big data for rPPG: – technical barriers, which relate to capturing video and ground truth signals (or values) with appropriate synchronization; – organizational barriers, which relate to keeping and distributing biometric data, that are regulated by local and international laws regarding personal information handling and copyright. Both kinds of barriers evoke the necessity of certain efforts for research teams, making collecting of big data by any particular group difficult. We Propose a simple instrument in a form of Android or iOS application to overcome the both barriers by Distributed Collecting of RPPG Dataset (Fig. 1). 1.1
RPPG in the Wild
The main rPPG problem is the retrieval of blood volume pulse (BVP) signal from a video. In previous works, factors were investigated that prevent the retrieval: body motions and mimics [55], dark lighting [37], dynamic lighting and shadows [28], extremely dark or fair skin tone [38], low-cost camera [42], video compression algorithms [66], heterogeneous and dynamic background [55], long distance from camera [5]. The factor of body motions appears to be the most significant one [3,22,45,55]. State-of-the-Art (SOTA) rPPG methods are capable of retrieving pulse and respiratory characteristics in stationary people using even cheap webcams in daylight or lamplight conditions. Several approaches were declared as capable of measuring HR in a moving person [55,62]. Nevertheless, in the presence of body motion, accuracy of SOTA rPPG methods raises the following concerns. A number of rPPG methods were declared to be accurate in estimating HR on particular datasets [3,38,47,63]. However, their results are often difficult to reproduce because some of the testing datasets are private (see Appendix 4), and techniques of methods comparison are not standardized [3,33]. There is still
Towards Collecting Big Data for Remote Photoplethysmography
73
a lack of details on motion conditions range where rPPG works appropriately. RePSS [29], a recent rPPG contest, was held in the beginning of 2020. More than 100 teams were registered, while only 25 teams have published their results on the testing set. The training and testing datasets (composed of OBF [30] and VIPLHR-V2 [29]) contained videos of moving persons. Mean absolute error (MAE) of heart rate (HR) estimation of top-3 pipelines were in a range ≈7–10 BPM (beats per minute). Such results are comparable to MAE of “average guess” approach (13–14 BPM), which always returns constant HR, computed as the average value of the HR over the training set. According to the ANSI/AAMI EC13-2002 standard for Cardiac Monitors, Heart Rate Meters, and Alarms, the demonstrated accuracy is insufficient for HR evaluation in healthcare applications (allowable error of no greater than ±10% of the HR or ±5 BPM, whichever is greater) [1]. Some of rPPG pipelines from top-25% of RePSS leaderboard were based on indirect assessment of HR values [24]. Such methods mostly exploit prior knowledge on HR distribution over a subject’s appearance rather than extract BVP signal or its derivatives. For example, according to [4], the approach of the top ranked participant involved recognition of a person (by analyzing background, since background was the same for the same person and differs among persons) in video and predicting HR value mostly as median value of HR for the person. Such exploiting of prior information on HR distribution over persons, over age, over average motion, etc. leads to higher positions in challenge. However, this approach will likely fail to predict HR deviations from values typical for recognized subject’s appearance. Such unrecognizable deviations are of prime importance in medical applications, as they often indicate abnormal health state and necessity of medical assistance. End-to-end deep models become popular in rPPG. In order to determine the practical value of such models they shouldn’t be considered as black boxes generating good or bad predictions of HR values. Analysis should be done to determine what kind of features are inferring in endto-end nets: are they BVP-related features or appearance-related features. The latter indicates that the method is performing indirect estimation of HR that has minor value in medical applications. 1.2
Existing RPPG Datasets: Related Works
The known public datasets suitable for rPPG are presented in Table 1. The dataset is considered as suitable for the task if it contains videos with living skin and corresponding blood volume pulse, blood pressure or electrocardiography signals. Typically, blood volume pulse is measured by pulse oximeter, blood pressure—by sphygmomanometer, and ECG—by electrocardiograph. The largest public dataset regarding the total video duration is VIPL-HR (1212 min) [36]. It is also one of the three datasets with varying video backgrounds. We consider varying background as important factor to Deep Learning approaches because constant background could lead to overfitting of network components responsible for skin detection. In addition to the public datasets in Table 1, there are a number of possibly public datasets that we have failed to fully download for various reasons. They
74
K. Kalinin et al.
Table 1. Publicly Available Datasets on rPPG With Provided RGB Data. Only those Data Sources from which HR can be Calculated are Taken into Account. Abbreviations: P—Pose; I—Illumination; B—Background; ECG—Electrocardiogram; PPG— Photoplethysmogram; BP—Blood Pressure signal. Title
Subjects Total Trials Video duration, records
Cameras Varying factors Ground Camera truth FPS
min
Total size, GB
source P I
B
DEAP [23]
22
874
874
874
1
PPG
50
ECG-Fitness [49]
17
204
102
204
2
ECG
30
207
Hoffman2019 [17]
3
21
21
1
ECG
25
18 35
66
25
Mahnob-HCIa [46]
27
654
559
559
1
ECG
61
MMSE-HR [65]
40
≈70
102
102
1
BP
25
79
8
8
1
PPG
30
151
MR NIRP [39]
8
24
9
9
17
17
1
BP
30
1.5
PURE [50]
10
≈59
59
59
1
PPG
30
39
UBFC-RPPG [8]
42
47
42
42
1
PPG
29
70
VIPL-HR [36] 107 1212 965 2378 a Recordings of Less than 1 min were Discarded.
3
PPG
30
51
Pose-Variant HR [58]
are presented in Table 2. Known private datasets mentioned in rPPG papers are listed in the Appendix 4. 1.3
Contributions
Key Idea of the Proposed Application for Dataset Collection is Synchronous Recording of Two Videos by Front and Rear Cameras (Fig. 2) of a mobile device (phone or tablet), which supports such simultaneous capturing. According to several researches, BVP signal could be relatively easily extracted from a video of a finger closely attached to a camera surface, especially with flash light turned on [11,32,53]. Therefore, while front camera is capturing a facial area, rear camera is capturing a finger. The latter video stream could be used to calculate the “ground truth” finger BVP or its derivatives such as HR or HRV. The finger BVP is not pure ground truth of face BVP (extracted from facial video) due to different appearance of pulse wave in finger and in facial tissue (for example, phase shift is expected between these two signals). However, it still can be used as the approximation of ground truth for training of machine learning models or for evaluation of vital sign estimation algorithms. We contribute to distributed collecting of rPPG dataset by sharing the codebase of the dual camera recorder and by proposing a central hub for hosting links to dataset fragments collected by research groups. We see the process of dataset collection by a research group as follows (see Fig. 1): 1. downloading the mobile application (data collector) from the public repository;
Towards Collecting Big Data for Remote Photoplethysmography
75
Fig. 2. Synchronous Recording of the Face Video (Front Camera) and Contact BVP Signal (Rear Camera, Index Finger, Flash is on) of the Subject.
2. gathering dataset fragment locally and hosting it on the group’s environment according to the proposed recommendation (see Sect. 2) and local laws on biometric data hosting and sharing; 3. publishing a link to the dataset fragment on the central hub. The proposed application lowers the entry threshold for participating in distributed collecting of rPPG dataset. Research groups and individual enthusiasts are capable of such participating having only devices supporting dual camera recording and basic IT skills. Furthermore, it can be collected “in the wild”—in conditions close to those where rPPG is expected to work for health monitoring. Such a “crowd science” approach looks especially reasonable in the current pandemic situation caused by COVID-19 since it makes it possible to collect big data “at home”. To the best of our knowledge, the proposed methodology for collecting big data for rPPG using a single mobile device has not been previously presented.
2
Distributed Dataset
The distributed dataset is a collection of sub-datasets, provided by independent institutions. Information about sub-datasets is stored in the public registry, available at https://osf.io/8zfhm/wiki/Datasets, including the contact information necessary to obtain access to a particular sub-dataset.
76
K. Kalinin et al.
(a)
(b)
Fig. 3. Example of estimated finger BVP signal (a) and its power spectral density (b). The data were obtained from a 10-s fragment of finger video in DCC-SFEDU.
Each sub-dataset consists of records, and each record is represented by two video-files and one “.json” file with the metadata. First video-file is the record of the subject’s face, and second one—of the subject’s finger. To reduce the file size, a finger video is recorded in a low resolution (depending on a hardware, e.g. 160 × 120). Merging of sub-datasets is designed to be simple—one can mix the files from multiple sub-datasets into a single folder. Metadata includes subject’s gender and year of birth (not date), timestamps of the start and finish of the video records, their resolutions and technical info about particular devices used to make these records. The surrogate user identifier is used to recognize records of the same person. This identifier should not be the name or initials of the person. While a face of a subject is clearly visible on video record, a metadata does not store any names or addresses. 2.1
DCC-Client—“Dual-Cam Collector” Mobile App
DCC-Client, an open-source Android application, was implemented to record dual-camera videos for the DCC datasets. DCC-Client is able to record the video of the subject’s face by front camera of the smartphone and, simultaneously, video of the subject’s finger, placed on top of the rear camera of the same device. Flash is constantly lit during the process of recording, directing enough light into the subject’s finger to form an improvised contact photometric sensor with strong (visible by naked eye) BVP signal. Technical details of the DCCClient are described in the Appendix 4. 2.2
DCC-SFEDU—Example Dataset
The DCC-SFEDU dataset was compiled as an example. At the moment of the paper submission, it includes 25 records (the size of the dataset is 2.5 GB) of
Towards Collecting Big Data for Remote Photoplethysmography
77
9 individuals (6 males and 3 females), made by two different mobile devices. There are at least two records of every person: one record in rest and other— after physical exercises. Every record is about 1 min long. Experimental protocol was approved by the bioethics committee of the Southern Federal University (protocol number 6 20, May 13, 2020). A finger BVP signal can be obtained and analyzed by basic signal processing techniques. For instance, in order to calculate HR value, one can apply spatial averaging of color components intensity within each video frame resulting in a three-dimensional (RGB) color signal. Further, average HR within a temporal interval can be calculated using frequency or peak analyses [54]. Figure 3 illustrates frequency analysis of the finger BVP signal obtained from a 10-s fragment of a dataset video record from a finger. The BVP signal (Fig. 3(a)) was calculated as the negative normalized green component of the color signal corresponding to the fragment.
3
Concerns and Limitations
We propose the distributed dataset as a contribution to data collection process by the rPPG community. Such collecting makes datasets more representative which remains essential due to the discovered biases in currently available data [29, 38]. Higher variety of datasets also makes possible the approach of datasets mixing [27,47] as an effective solution for the problem of Zero-shot Cross-dataset transfer (using multiple datasets for training increases model generalization). In this section, we describe limitations of our approach. 3.1
Video Synchronization
For each device, which is used (or will be used) for dataset collecting, comprehensive study of synchronization between two videos captured by front and rear cameras should be done. However, we do not consider such study as a blocking factor for starting collecting DCC rPPG dataset. The sources of desynchronization are: – constant latency between two videos; – variative latency caused by irregular frame rate in each video. We performed preliminary experiments on synchronization problems between two simultaneously obtained videos with one of the experimental devices (Asus ZB602KL). External light source was turned on and off several times in such a
78
K. Kalinin et al.
way that both cameras with 29 FPS captured each event (turning the light on and off); the cameras were started simultaneously. Timestamps of the recorded frames reveal that the rear camera was 0.12 s behind the front one; standard deviation of the latency variation was 0.02 s. Such latency would not significantly affect the HR assessment problem; however, it seems large enough to disturb the detection of BVP signal features. Therefore, we consider further experiments to determine latency properties. 3.2
Video Capturing Procedure
As we described in the Distributed Dataset section, a mobile device is held in an outstretched hand, which makes tiny movements caused by pulse wave passing through arteries. Thus, variations in front camera video laying in feasible heart rate frequencies range may be caused by hand motion mostly. If so, models trained on this data could overfit to such conditions and perform worse on video obtained from a stationary camera (fixed instead of being held in a hand). Although this phenomenon needs to be studied, even if it appears to be significant, the dataset is still of interest as it can be used for pre-training rPPG embedding extractor to be fine-tuned for end tasks via transfer learning.
4
Conclusion
A new approach to distribute dataset collection for remote photoplethysmography is introduced. It is proposed to collect data simultaneously using two cameras of a mobile device. Particularly, rear camera with flash on captures “ground truth” BVP signal, while front camera records video of facial area for training rPPG deep-learning models and evaluating rPPG algorithms. Therefore, no external sensors are required, which makes data collection standardized and easy. Open source Android application, which implements this approach, is provided to rPPG community. After being captured by rPPG-involved research team, such recordings constitute a local dataset, which is supposed to be hosted and managed by the team, while they could add the dataset link to the provided registry (https://osf.io/8zfhm/wiki/Datasets). Summarizing, the main contributions of the work are: 1. Android application with open codebase for easy collection of rPPG samples with ground truth. 2. Instruction on local dataset collection using this application. 3. Registry, where links to local datasets are collected and shared among rPPG community.
Towards Collecting Big Data for Remote Photoplethysmography
79
Future work is currently under way to investigate on concerns and limitations mentioned in the Discussion section. The video synchronization should be analyzed in experiments with a blinking light source visible by both cameras. The video capturing procedure is planned to be verified (to examine changes in facial appearance related to hand movement) by experimenting with stationary cameras instead of holding device in a hand. Funding The project is supported by The Russian Ministry of Science and Higher Education in the framework of Decree No. 218, project No. 2019-218-11-8185 “Creating a software complex for human capital management based on Neurotechnologies for enterprises of the high-tech sector of the Russian Federation” (Internal number HD/19–22-NY).
Appendix 1. Non-public rPPG Datasets Table 2 lists rPPG-suitable datasets declared as public but have not been accessed in part or in full due to various reasons. The estimated total video duration in a dataset is based on its description in the corresponding paper. In addition to the above, there are dozens of private datasets mentioned in one or several papers on rPPG: HNU [64], HR-D [44], BSIPL-RPPG [47], and untitled ones [2,6,9,10,12–15,48,51,52,55–57,59–61]. The both training and testing datasets of the RePSS competition [29] are also considered private because they have been publicly available only during the competition.
2. Dual-Camera Collector Dual-Camera Collector or DCC-Client is an open source (https://github.com/ Assargadon/dcc-client) Android mobile application that records video from both cameras of a mobile device simultaneously (Fig. 2). It has, though, several key differences from a generic video recorder, even if it is able to capture the video from two cameras simultaneously.
80
K. Kalinin et al.
Table 2. Datasets on rPPG that were declared publicly available but have not been fully accesseda . Title
Total duration, min
Reason of absence of full access
COHFACE [16]
160
The authors didn’t respond to requests
LGI [41]
100
Only small part of the dataset is publicly available
MoLi-ppg-1 [40]
480
According to the authors, the dataset is to be released, no deadlines were specified
MoLi-ppg-2 [40]
210
According to the authors, the dataset is to be released, no deadlines were specified
OBF [30]
636
The authors didn’t respond to requests
PFF [18]
N/A
The authors didn’t respond to requests
rPPG [25]
54
Only extracted color signals and ground truth HR data are available
VIPL-HR V2 [29] 6000+
According to the authors, the dataset is to be released, no deadlines were specified
VitalCamSet [7]
According to the authors, the dataset was to be released within 2020
a
520
Valid as of January 12, 2021.
App Features First, data records are designed to be anonymous, keeping track of a metadata related to experiment conditions. Users have the ability to provide a desired metadata. Other metadata is determined automatically. Second, there is no need to have a high-resolution image for a camera capturing a finger, but a flash constantly turned on would be of help. Despite of rear camera usually offers higher capture resolution than a front one, it was considered to use front camera to capture face video and rear camera to capture finger due to the following reasons: – Usage of a face-oriented (front) camera to record a face video makes the user able to operate the UI of the application during the recording. Therefore, a user is able to capture a record of themselves without an assistant, on their own. – A front camera usually has enough resolution to capture the user face at a regular distance from the device—and this is its purpose. – A rear camera almost always has a flashlight, which is useful to record contact BVP signals. Front cameras sometimes have flash too—but less often. Third, the application provides two video data streams obtained simultaneously from front and rear cameras. While heart rate value is only defined in a sliding window and there is no strict requirement for per-frame synchronization, some methods and studies could be very sensitive to a time offset, for example: – Training of deep learning models to extract BVP signal in facial video based on asynchronous ground truth contact BVP signal obtained by rear camera.
Towards Collecting Big Data for Remote Photoplethysmography
81
– Studies based on measurement of the phase shift between BVP signals from finger and face. – Methods of Heart Rate Variability estimation. Fourth, the application collects the standardized data to the permanent dataset storage. User Experience The application determines if a device has the capability to capture videos from two cameras simultaneously—unfortunately, not all Android devices are able to do it. If no, an alert is presented to the user, which means that the device cannot be used to collect the dataset.
Fig. 4. Screenshots of the DCC-Client UI: User Metadata form (a); Process of Video Capture (b).
The application has a simple form containing fields of “User ID”, “Year of Birth” and “Gender” (Fig. 4(a)). “User ID” refers to a surrogate identifier that has the purpose of tracking the same persons, while keeping the anonymity of their personalities. For example, it may be “1” for the first subject, “2” for the second subject and so on. If another record of the first subject is performed (maybe even on a different device), User ID “1” should be used again.
82
K. Kalinin et al.
After the user enters the required information, a live video preview is shown in Fig. 4(b). User can start recording by tapping the respective button in the UI. When a recording is to be finished, the user taps the button again. The dataset entry is represented in a form of three files that are named with a unique prefix -. This ensures that the file group of the single recording is easily recognized. Two video files contain the streams for the subject’s face and finger. Third one, “.json”, contains metadata about both subject (gender and year of birth, as explained above), device (model, anonymous device unique identifier), and record itself (start and final timestamps of the recording, resolutions of face and finger videos). After performing one or several recordings, one can use a data cable or another file transfer mechanism to extract recordings from the device and put it to the permanent dataset storage. There is no need to avoid naming collisions because file names are generated to be unique.
References 1. American National Standards Institute and Association for the Advancement of Medical Instrumentation. Cardiac monitors, heart rate meters, and alarms. Association for the Advancement of Medical Instrumentation, Arlington, Va (2002) 2. Antink, C.H., Gao, H., Br, C., Leonhardt, S.: Beat-to-beat heart rate estimation fusing multimodal video and sensor data. Biomed. Opt. Express 6(8), 2895–2907 (2015) 3. Antink, C.H., Lyra, S., Paul, M., Yu, X., Leonhardt, S.: A broader look: camerabased vital sign estimation across the spectrum. Yearb. Med. Inform. 28(1), 102– 114 (2019) 4. Artemyev, M., Churikova, M., Grinenko, M., Perepelkina, O.: Neurodata lab’s approach to the challenge on computer vision for physiological measurement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020) 5. Blackford, E.B., Estepp, J.R.: Measurements of pulse rate using long-range imaging photoplethysmography and sunlight illumination outdoors. In: Cot´e, G.L. (ed.) Optical Diagnostics and Sensing XVII: Toward Point-of-Care Diagnostics, vol. 10072, pp. 122–134. SPIE, San Francisco, California, United States (2017) 6. Blackford, E.B., Estepp, J.R., Piasecki, A.M., Bowers, M.A., Samantha, L.: Longrange non-contact imaging photoplethysmography: cardiac pulse wave sensing at a distance. In: Optical Diagnostics and Sensing XVI: Toward Point-of-Care Diagnostics, vol. 9715, pp. 176–192 (2016) 7. Bl¨ ocher, T., Krause, S., Zhou, K., Zeilfelder, J., Stork, W.: VitalCamSet - a dataset for Photoplethysmography Imaging. In: 2019 IEEE Sensors Applications Symposium (SAS), pp. 1–6 (2019) 8. Bobbia, S., Macwan, R., Benezeth, Y., Mansouri, A., Dubois, J.: Unsupervised skin tissue segmentation for remote photoplethysmography. Pattern Recogn. Lett. 124, 82–90 (2019) 9. Chen, W., Hernandez, J., Picard, R.W.: Estimating carotid pulse and breathing rate from near-infrared video of the neck. Physiol. Meas. 39(10), 10NT01 (2018) 10. Chen, W., Picard, R.W.: Eliminating physiological information from facial videos. In: 2017 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), pp. 48–55 (2017)
Towards Collecting Big Data for Remote Photoplethysmography
83
11. Coppetti, T.: Accuracy of smartphone apps for heart rate measurement. Eur. J. Prev. Cardiol. 24(12), 1287–1293 (2017) 12. Estepp, J.R., Blackford, E.B., Meier, C.M.: Recovering pulse rate during motion artifact with a multi-imager array for non-contact imaging photoplethysmography. In: 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1462–1469 (2014) 13. Ghanadian, H., Al Osman, H.: Non-contact heart rate monitoring using multiple RGB cameras. In: Vento, M., Percannella, G. (eds.) Computer Analysis of Images and Patterns CAIP 2019. Lecture Notes in Computer Science, vol. 11679, pp. 85– 95. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-29891-3 8 14. De Haan, G., Jeanne, V.: Robust pulse-rate from chrominance-based rPPG. IEEE Trans. Biomed. Eng. 6600(10), 1–9 (2013) 15. Han, B., Ivanov, K., Wang, L., Yan, Y.: Exploration of the optimal skin-camera distance for facial photoplethysmographic imaging measurement using cameras of different types. In: Proceedings of the 5th EAI International Conference on Wireless Mobile Communication and Healthcare, pp. 186–189. ICST: Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, London, Great Britain (2015) 16. Heusch, G., Anjos, A., Marcel, S.: A Reproducible Study on Remote Heart Rate Measurement. arXiv preprint, arXiv:1709 (2017) 17. Hoffman, W.F.C., Lakens, D.: Public Benchmark Dataset for Testing rPPG Algorithm Performance. Technical report (2019) 18. Hsu, G.-S., Chen, M.-S.: Deep learning with time-frequency representation for pulse estimation from facial videos. In: 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 383–389 (2017) 19. Jang, D.-G., Park, S., Hahn, M., Park, S.-H.: A real-time pulse peak detection algorithm for the photoplethysmogram. Int. J. Electron. Electr. Eng. 2(1), 45–49 (2014) 20. Kamshilin, A.A., et al.: A new look at the essence of the imaging photoplethysmography. Sci. Rep. 10494:1–10494:9 (2015) 21. Karray, F., Alemzadeh, M., Saleh, J., Arab, M.N.: Human-computer interaction: overview on state of the art. Int. J. Smart Sens. Intell. Syst. 1, 137–159 (2008) 22. Khanam, F.-T.-Z., Al-naji, A.A., Chahl, J.: Remote monitoring of vital signs in diverse non-clinical and clinical scenarios using computer vision systems: a review. Appl. Sci. 9(20), 4474 (2019) 23. Koelstra, S., et al.: DEAP: a database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 3(1), 18–31 (2012) 24. Kopeliovich, M., Kalinin, K., Mironenko, Y., Petrushan, M.: On indirect assessment of heart rate in video. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, pp. 1260–1264 (2020) 25. Kopeliovich, M., Petrushan, M.: Color signal processing methods for webcam-based heart rate evaluation. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) Intelligent Systems and Applications. IntelliSys 2019. Advances in Intelligent Systems and Computing, pp. 1038:703–723 (2019) 26. Kranjec, J., Begus, S., Gersak, G., Drnovsek, J.: Non-contact heart rate and heart rate variability measurements: a review. Biomed. Signal Process. Control 13(July), 102–112 (2014) 27. Lasinger, K., Ranftl, R., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. CoRR, ArXiv:1907.01341 (2019)
84
K. Kalinin et al.
28. Lee, D., Kim, J., Kwon, S., Park, K.: Heart rate estimation from facial photoplethysmography during dynamic illuminance changes. IEEE Eng. Med. Biol. Soc. 2758–2761 (2015). https://doi.org/10.1109/EMBC.2015.7318963 29. Li, P., et al.: Video-based pulse rate variability measurement using periodic variance maximization and adaptive two-window peak detection. Sensors (Switzerland) 20(10) (2020) 30. Li, X., et al.: The OBF database: a large face video database for remote physiological signal measurement and atrial fibrillation detection. In: 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), pp. 242–249 (2018) 31. Luguern, D., et al.: An assessment of algorithms to estimate respiratory rate from the electrocardiogram and photoplethysmogram. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020) 32. Matsumura, K., Rolfe, P., Yamakoshi, T.: iPhysioMeter: a smartphone photoplethysmograph for measuring various physiological indices. In: Rasooly, A., Herold, K.E. (eds.) Mobile Health Technologies. Methods and Protocols, vol. 1256, pp. 305–326. Humana Press, New York, New York, NY (2015). https://doi.org/10. 1007/978-1-4939-2172-0 21 33. Mironenko, Y., Kalinin, K., Kopeliovich, M., Petrushan, M.: Remote photoplethysmography: rarely considered factors. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, pp. 1197–1206 (2020) 34. Mo¸co, A., Verkruysse, W.: Pulse oximetry based on photoplethysmography imaging with red and green light: calibratability and challenges. J. Clin. Monit. Comput. 35(1), 123–133 (2020). https://doi.org/10.1007/s10877-019-00449-y 35. Monfort, A., et al.: Moments in Time Dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 1–8 (2019) 36. Niu, X., Han, H., Shan, S., Chen, X.: VIPL-HR: a multi-modal database for pulse estimation from less-constrained face video. Jawahar, C., Li, H., Mori, G., Schindler, K. (eds.) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science, vol. 11365, pp. 1–16 (2018). doi:https://doi.org/10.1007/9783-030-20873-8 36 37. Niu, X., Shan, S., Han, H., Chen, H.: RhythmNet: end-to-end heart rate estimation from face via spatial-temporal representation. IEEE Trans. Image Process. 29, 2409–2423 (2020) 38. Nowara, E.M., Mcduff, D., Veeraraghavan, A.: A Meta-Analysis of the Impact of Skin Type and Gender on Non-contact Photoplethysmography Measurements. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020) 39. Nowara, E.M., Marks, T.K., Mansour, H., Veeraraghavan, A.: SparsePPG: towards driver monitoring using camera-based vital signs estimation in near-infrared. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1353–135309 (2018) 40. Perepelkina, O., Artemyev, M., Churikova, M., Grinenko, M.: HeartTrack: convolutional neural network for remote video-based heart rate monitoring. In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020) 41. Pilz, C.S., Zaunseder, S., Blazek, V.: Local group invariance for heart rate estimation from face videos in the wild. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1335–13358 (2018)
Towards Collecting Big Data for Remote Photoplethysmography
85
42. Rouast, P.V., Adam, M.T.P., Chiong, R., et al.: Remote heart rate measurement using low-cost RGB face video: a technical literature review. Front. Comput. Sci. 12, 858–872 (2018). https://doi.org/10.1007/s11704-016-6243-6 43. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). https://doi.org/10.1007/s11263015-0816-y 44. Sabokrou, M., Pourreza, M., Li, X., Fathy, M., Zhao, G.: Deep-HR: Fast Heart Rate Estimation from Face Video Under Realistic Conditions. ArXiv:abs/2002.04821 (2020) 45. Sinhal, R., Singh, K., Raghuwanshi, M.M.: An overview of remote photoplethysmography methods for vital sign monitoring. In: Gupta, M., Konar, D., Bhattacharyya, S., Biswas, S. (eds.) Computer Vision and Machine Intelligence in Medical Image Analysis. Advances in Intelligent Systems and Computing, vol. 992, pp. 21–31. Springer, Singapore (2020). https://doi.org/10.1007/978-981-13-8798-2 3 46. Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3(1), 42–55 (2012) 47. Song, R., Chen, H., Cheng, J., Li, Y., Liu, C., Chen, X.: PulseGAN: learning to generate realistic pulse waveforms in remote photoplethysmography. Image Video Process. 25(5), 1373–1384 (2020) 48. Song, R., Zhang, S., Cheng, J., Li, C., Chen, X.: New insights on super-high resolution for video-based heart rate estimation with a semi-blind source separation method. Comput. Biol. Med. 116, 103535 (2020). https://doi.org/10.1016/ j.compbiomed.2019.103535 ˇ 49. Spetl´ ık, R., Cech, J.: Visual heart rate estimation with convolutional neural network. In: British Machine Vision Conference (2018) 50. Stricker, R., Steffen, M., Gross, H.: Non-contact video-based pulse rate measurement on a mobile service robot. In: The 23rd IEEE International Symposium on Robot and Human Interactive Communication, pp. 1056–1062 (2014) 51. Sun, Y., Papin, C., Azorin-Peris, V., Kalawsky, R., Greenwald, S., Sijung, H.: Use of ambient light in remote photoplethysmographic systems: comparison between a high-performance camera and a low-cost webcam. J. Biomed. Opt. 17(3), 037005 (2012) 52. Tang, C., Lu, J., Liu, J.: Non-contact heart rate monitoring by combining convolutional neural network skin detection and remote photoplethysmography via a low-cost camera. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, pp. 1390–13906 (2018) 53. Tayfur, I., Afacan, M.A.: Reliability of smartphone measurements of vital parameters: a prospective study using a reference method. Am. J. Emerg. Med. 37(8), 1527–1530 (2019) 54. Tsai, Y.C., Lai, P.W., Huang, P.W., Lin, T.M., Wu, B.F.: Vision-based instant measurement system for driver fatigue monitoring. IEEE Access 8, 67342–67353 (2020) 55. Wang, W., Den Brinker, B., Stuijk, S., De Haan, G.: Algorithmic principles of remote-PPG. IEEE Trans. Biomed. Eng. 64(7), 1479–1491 (2017) 56. Wang, W., Shan, C.: Impact of makeup on remote-PPG monitoring. Biomed. Phys. Eng. Express 6, 035004 (2020) 57. Wang, W., Stuijk, S., De Haan, G.: A novel algorithm for remote photoplethysmography?: spatial subspace rotation. IEEE Trans. Biomed. Eng. 63(9), 1974–1984 (2016)
86
K. Kalinin et al.
58. Wang, Z., Yang, X., Cheng, K.-T.: Accurate face alignment and adaptive patch selection for heart rate estimation from videos under realistic scenarios. PLOS ONE 13, 1–25 (2018) 59. Wedekind, D., et al.: Assessment of blind source separation techniques for videobased cardiac pulse extraction. J. Biomed. Opt. 22(3), 035002 (2017). https://doi. org/10.1117/1.JBO.22.3.035002 60. Wei, B., He, X., Zhang, C., Wu, X.: Non-contact, synchronous dynamic measurement of respiratory rate and heart rate based on dual sensitive regions. BioMed. Eng. Online 16(17), 1–21 (2017). https://doi.org/10.1186/s12938-016-0300-0 61. Woyczyk, A., Rasche, S., Zaunseder, S.: Impact of sympathetic activation in imaging photoplethysmography. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 1697–1705 (2019) 62. Woyczyk, S., Fleischhauer, A., Zaunseder, V.: Skin segmentation using active contours and gaussian mixture models for heart rate detection in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops (2020) 63. Zaunseder, S., Trumpp, A., Wedekind, D., Malberg, H.: Cardiovascular assessment by imaging photoplethysmography - a review. Biomed. Eng. (Biomed. Tech.) 63(06), 617–634 (2018) 64. Zhan, Q., Wang, W., de Haan, G.: Analysis of CNN-based remote-PPG to understand limitations and sensitivities. Biomed. Opt. Express 11(3), 1268–1283 (2020) 65. Zhang, Z., et al.: Multimodal spontaneous emotion corpus for human behavior analysis. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3438–3446 (2016) 66. Zhao, C., Lin, C.-L., Chen, W., Li, Z.: A novel framework for remote photoplethysmography pulse extraction on compressed videos. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1380– 138009 (2018)
Towards Digital Twins Driven Breast Cancer Detection Safa Meraghni1,2(B) , Khaled Benaggoune1,3 , Zeina Al Masry1 , Labib Sadek Terrissa2 , Christine Devalland4 , and Noureddine Zerhouni1
4
1 FEMTO-ST Institute, ENSMM, Besan¸con, France {safa.meraghni,khaled.benaggoune,zeina.almasry, noureddine.zerhouni}@femto-st.fr 2 LINFI Laboratory, University of Biskra, Biskra, Algeria 3 LAP, Batna 2 University, Batna, Algeria [email protected] Service D’anatomie Et Cytologie Pathologiques, Hˆ opital Nord Franche-Comt´e, Trevenans, France [email protected]
Abstract. Digital twins have transformed the industrial world by changing the development phase of a product or the use of equipment. With the digital twin, the object’s evolution data allows us to anticipate and optimize its performance. Healthcare is in the midst of a digital transition towards personalized, predictive, preventive, and participatory medicine. The digital twin is one of the key tools of this change. In this work, DT is proposed for the diagnosis of breast cancer based on breast skin temperature. Research has focused on thermography as a non-invasive scanning solution for breast cancer diagnosis. However, body temperature is influenced by many factors, such as breast anatomy, physiological functions, blood pressure, etc. The proposed DT updates the bio-heat model’s temperature using the data collected by temperature sensors and complementary data from smart devices. Consequently, the proposed DT is personalized using the collected data to reflect the person’s behavior with whom it is connected. Keywords: Digital twin
1
· Breast cancer detection · Thermography
Introduction
Breast cancer has become one of the most terrible experiences in women’s health nowadays. It is the first cause of mortality and the most commonly diagnosed form of cancer among women [21]. Diagnosis is the first and principal step in the treatment of any disease, and so far for cancer. However, despite the amount of research provided to this growing disease, the survival rate is still dependent on detecting the tumor in the earlier stages. Screening is a strategy adopted to identify women at the initial stages of this breast disease. However, all kinds of tests, including mammography, recognized as the gold standard for cancer c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 87–99, 2021. https://doi.org/10.1007/978-3-030-80129-8_7
88
S. Meraghni et al.
detection, have limitations. A patient has to undergo, many times, burdensome procedures associated with these techniques, such as radiation side-effects, long duration of the diagnostic procedure, high diagnosis cost [6]. Therefore, research is required to develop a simple and less expensive diagnostic procedure without any side effect. The present work concentrates on the possible application of one such procedure. Due to the limits of mammography and with the aim of increasing the life expectancy of patients, thermography is used to detect early breast cancer [17]. Because of changes in temperature have been identified by thermal imaging for a breast lesion. In addition, the advantages of thermography compared to mammography are a radiation-free technology and a better sensitivity to detect lesions in dense breasts [16]. Nevertheless, this new technique’s application is in the early stages and faces many challenges, especially the generalization of the solution on different patients with heterogeneous anatomies. In medicine, the effectiveness of the medication is not always ensured for all cases. Personalized medicine is the principle of turning health care adapted to the patient-specific physiology; it consists of adapting treatments according to patients’ characteristics and their diseases. It involves anticipating, through a diagnostic test on patients. Personalized medicine aims to better monitor each patient by collecting, measuring, and analyzing their health data to adapt their care cycle to their needs. This innovative approach is based on the digital revolution. The concept of digital twins (DTs) is at the origin of this principle. It originated in the engineering industry but has found a place in medicine. The concept of digital twin (DT) was proposed in 2011 [25]. DTs act as a bridge between the physical and digital worlds by using sensors to collect real-time data on physical objects. Collected data is used to create a digital duplicate of the monitored object [20]. This digital duplicate, combined with other technologies such as cloud computing, artificial intelligence, and machine learning, will understand, analyze, manipulate data, and make decisions. Currently, the digital twin is mainly used in product design and service management, manufacturing, product life forecasting, and real-time monitoring of industrial equipment. Based on the successful applications of DT in the industrial field, it is believed that DT can also play a significant role in the medical field [2]. DTs can accelerate the adoption of connected devices such as smart sensors, smartphones, and smartwatches in the smart healthcare ecosystem to improve data collection capacities and promote artificial intelligence applications for predictive healthcare. DTs help to gather these technologies to provide personalized, proactive, and preventive care in real-time. This work, and to the authors knowledge, presents the first initiative in the field that overlaps DTs with thermal data for breast cancer detection. The Detection based on thermal data is not always accurate; therefore, this overlap is proposed to integrate different data sources from smart devices in the eco-healthcare system. The DT processes and analyzes patient data in real-time and provide an adapted decisions.
DT Driven Breast Cancer
89
The main contributions of this paper are as follows: – DT is proposed for early breast cancer detection, using collected data from smart devices. – Thermal data is combined with other information to provide an accurate detection. – Personalized bio-heat model is integrated with digital twins. The rest of this paper is organized as follows: Sect. 2 discusses the research background and some critical challenges. Section 3 provides the details of our proposed approach, its general architecture, and design. In Sect. 4, the numerical simulation of the bioheat model is explained. Our results are then depicted in Sect. 5, and we discuss and conclude in Sect. 6.
2
Research Background
In this section the research related to DT and breast cancer detection are discussed. 2.1
Digital Twin in Healthcare
In the hospital world, digital twins are increasingly being used, sometimes called “medical twins”. The potential application areas of digital twins in the healthcare sector are numerous, from diagnosis to therapeutics [8]. DT is increasingly being studied for preventive healthcare to raise awareness of people’s health through bio-feedback functions and help them take the right actions using personalized recommendations. Angulo et al. [2] have proposed a DT to monitor the behavior of lung cancer in patients to personalize healthcare about the behavior of this disease on patients. Bagaria et al. [3] have proposed a DT for heart rate and galvanic response monitoring to prevent heart diseases. Based on data collected from different smart devices as accelerometer, GPS, and smartwatch, an AI model estimates the real twin’s physical activity level and recommends different activities to ensure a healthy physical condition. In health treatments, The digital twins can test new experimenting decisions in a simulated “real” environment to see how the treatment results [5]. DT enables simulation with real-world scenarios, which can reduce medical risks and costs and improve diagnosis, treatment, and disease prediction quality. 2.2
Breast Detection Using Thermography
Over the last few decades, the need for effective and cheap diagnostic techniques to screen and diagnose breast cancer has led to the development of various new technologies. Accordingly, studies show that there is a link between breast skin temperature and breast cancer since cancerous tissue temperature is commonly higher than healthy encircling tissues [12]. Thus, thermography has been considered as a promising screening method for breast cancer detection by generating
90
S. Meraghni et al.
images that reveal the heat distribution on the skin surface [28]. The first attempt to use thermography for detecting breast cancer was by Lawson [18]. He observed an increase from 21 ◦ C to 31 ◦ C in the local temperature of the skin surface over the place where the tumor is located when compared to healthy breast tissue. In [26], the authors concluded that thermography shows to be a useful technique for breast cancer detection in young women with a higher breast density level. In fact, numerous studies showed the effect of tumor state on the thermal image [1,23]. In [24], the authors investigated the ability to estimate tumor sizes by analyzing the thermal images. In [9], an estimation methodology is performed to define the breast tumor parameters using the surface temperature profile that may be captured by infrared thermography (see Fig. 1). In [10], the author estimates the size and position of a spherical tumor in a human breast utilizing the temperatures taken on the surface of the breast. Thermography is non-invasive and painless without any exposure to dangerous ionizing radiation. This imaging technique shows to be a potential adjunction tool. Nevertheless, procedures need to be done in the hospital or thermography center by trained medical personnel, and it is still impractical to use it as a personal healthcare device.
Fig. 1. Breast thermography using infrared camera [19]
2.3
Factors Influencing Skin Temperature
The skin’s properties and conditions can be influenced by numerous factors such as the location, skin type, ethnicity, gender, or even lifestyle and body mass index. The blood flow of the skin is highly vital for thermal balance represented by skin temperature. However, this blood flow can be influenced by various parameters, from humans to another. Female hormones such as estradiol and progesterone have essential effects on regulating body temperature and blood pressure [7]. Estradiol generally favors
DT Driven Breast Cancer
91
vasodilatation, heat dissipation, and lower body temperature. In contrast, progesterone promotes less vasodilatation, heat retention, and raising body temperature. During the mid-luteal phase of the menstrual cycle or with the exogenous hormones of oral contraceptives when estradiol and progesterone are simultaneously high, the temperature threshold for sweating and skin cutaneous vasodilation is changed to more elevated temperatures [7]. Besides, changes in thermoregulation during the menstrual cycle and the hot flashes of menopause are influenced by hormones on the neural control of skin blood flow and sweating. During pregnancy, blood pressure levels can decrease, increase, or remain the same from one woman to another. This depends on several parameters, such as sympathetic nerve activity, various combinations of blood volume interactions, vascular stiffness, and additional variables [14]. Thermoregulatory reactions to the rise and fall of body temperature are affected by healthy aging and various age-related diseases such as diabetes and hypertension [11]. Therefore, healthy aging will induce lower heat dissipation responses due to reflex neural, local vascular, and sweat gland mechanisms changes. Abnormal body temperature is a natural indicator of disease. Thermal imaging cameras have been used to obtain correlations between thermal physiology and skin temperature to diagnose breast cancer, diabetic neuropathy, and peripheral vascular disorders. However, the human body’s different physiologies make this diagnosis difficult to be universal due to the variation in the healthy temperature threshold from one human to another. Therefore, customized thermographic detection technology is required to adapt the detection threshold to each person. This solution is part of personalized medicine with continuous monitoring of the person using new information technologies. 2.4
Focus of the Paper
Many works have proposed potential solutions for monitoring body temperature using sensors. These new solutions have opened up a new horizon for diagnostics using portable devices to monitor abnormal temperature changes. Also, intelligent devices used to monitor human performance, and health conditions can collect valuable information from individual patients. This paper proposes a DT for breast cancer using temperature sensor information collected by portable intelligent devices. The DT is first based on a simulated heat transfer model in the human body, based on the anatomy and thermal characteristics of different human breast tissues. Once the DT is connected to its physical user, data will be collected from all temperature sensors to detect abnormalities. As breast anatomy and the tissues’ thermal characteristics may vary from one woman to another, the proposed twins will be updated with the information collected by the temperature sensors and other intelligent devices for blood pressure, age, diseases, etc. The use of complementary data helps the model to adapt quickly to the patient properties and provide an accurate detection.
92
3
S. Meraghni et al.
Proposed Methodology
Decision layer
Update model
Simulated data Trained model
Physical space
Detection
Data feature engineering: cleaning, integration, transformation, reduction.
Complementary data
Physical layer
Data processing layer
Real time data
Personalized module
This section describes our proposed methodology based on the interconnection between the human and its digital twin via several technologies and services. This connection ensures real-time monitoring and control. Our solution plays the role of middleware [4] layer to fill the gap between heterogeneous services and abstract the technical details inside the digital twin and its relevant functionalities. Figure 2 shows the layered architecture of our proposition which consists of four layers described as follow:
Temperature data
Communication protocols: MQTT, HTTPs, AMQP, XMPP
Smart devices
Temperature sensors
Fig. 2. DT for breast cancer detection: physical space is connected with the virtual space to perform detection in real-time. The bio-heat model is used for tumor detection based on temperature data and other information such as blood pressure, external temperature, etc.
DT Driven Breast Cancer
3.1
93
Physical Space
The Internet of Things (IoT), a new Internet revolution, is rapidly gaining area as a priority multidisciplinary research theme in the healthcare sector. With the advent of multiple portable devices and smartphones, the various IoT-based devices are changing and evolving the typical old healthcare system towards a smarter, more personalized system. This is why the current health care system is also referred to as the Personalized Health Care System. Accordingly, technologies such as smartphones, smartwatches, and smart tissues, and others are used in the physical space to collect relevant data about the patient’s health state. 3.2
Physical Layer
In this layer, a set of interconnection functionalities must be provided to ensure data collection from various small devices connected to the human body. The communication technologies generally used in this type of application refer to the M2M stack [15], which allows the use of lightweight protocols and technologies with less influence on human health. Using the green M2M, a multitude of protocols are established in this work, such as HTTPs, HTTP, CoAP, AMQP, XMPP, MQTT, MQTT-sn, etc. We have also included other tools to support the interconnection of legacy or non-Internet devices (Zigbee, BLE, etc.) to our platform through gateways and proxies. 3.3
Data Processing Layer
For early detection of breast cancer, data is essential. Raw data collected from different connected devices are heterogeneous, and several actions are required to transform these data into usable information, ready to be used in algorithms and statistical analysis. Although the data is collected and sorted, there may be significant gaps or erroneous data. The cleaning phase eliminates duplicate, and irrelevant values, applies imputation methods and converts it to the acceptable format. 3.4
Decision Layer
Once the data has been processed into a useful format, a trained model will use it to detect tumors in the breast. There are two phases in this decisionmaking layer: (i) the offline phase: the model is trained on simulated temperature data. More details on the offline phase are presented in Sect. 4. (ii) the online phase: which uses the trained model and real-time data for tumor detection. A personalized module is proposed to match new data with historical data. Also, the predicted decisions made previously with the actual states from the physical space are linked to the model’s update and converge to the actual and specific patient parameters.
94
4
S. Meraghni et al.
Numerical Simulations of Dynamic Thermography
Multiple mechanisms are involved in living tissue’s heat transfer, such as metabolism, heat conduction, and heat convection by blood perfusion. The blood flow in the biological systems provides thermal stability and homogenous temperature distribution of the whole body. Hence, any change in the temperature distribution of the body can be intrinsically connected to an abnormality of the body process, which can be the first sign of tumoral tissue. Therefore, tumors are clumps of cells that multiply in an uncontrolled manner, and New blood vessels known as angiogenesis are formed to transport nutrients to these cells. Therefore, the blood perfusion rate and metabolic heat generation rate of the tumor are higher than healthy tissues. The increased heat generation at the tumor is diffused to the encircling tissue and can be seen as a temperature spike at the surface of the breast [13]. The first step in this work consists of simulating the bioheat of the heat transfer in healthy human breast and breast with abnormality using the Pennes bioheat equation, one of the most used transfer heat model in a biological tissue. The Pennes bioheat equation Eq. (1) was proposed by Pennes in 1948 [22] to describes the effect of the metabolism and blood perfusion on the energy balance within the tissue. Therefore, the Bio-heat transfer processes in human tissues are affected by the rate of blood perfusion through the vascular network. The Pennes equation is easy to implement because it does not need information about the vasculature within tissues. What makes it widely used [10]. The dynamic bioheat transfer process presented by Pennes is described in Eq. (1) as follows: ∂T (1) = (k T ) + ρb ωb cb (Ta − T ) + qm ρc ∂t where k, ρ, c, qm correspond to thermal conductivity, density, specific heat of tissue and metabolic heat generation rate, respectively. ρb , cb , ωb stand for blood density, blood specific heat and he blood perfusion rate, respectively. Ta and T are respectively the arterial blood temperature and the tissue temperature. The values of these properties are listed in Table 1. A simplified breast model is studied to facilitate the heat transfer understanding in the breast. Therefore, the human breast is modeled as a 3D hemispherical domain with 6.5 cm of radius (Fig. 3), and the entire breast is assumed to be fat. A 1 cm diameter spherical tumor has been embedded in the breast model to study the effect of the presence of a tumour on heat distribution. Table 1. The Properties of Breast Tissue [27] Tissue Density (kg/m)
Specific heat C(kg/m)
Perfusion rate C(kg/m)
Metabolic heat generation Q
Thermal conductivity C(kg/m)
Fat
930
2674
0.00008
400
0.21
Tumor 1050
3852
0.0063
5000
0.48
DT Driven Breast Cancer
95
Fig. 3. Human breast model modeled as a 3D hemispherical domain, with a spherical tumour of 1 cm of diameter
In order to enhance and maximize the thermal contrast between tumor and healthy tissue, a cooling thermo-stimulation is applied to the skin. Thus, many works supposed that the boundary condition on the breast skin surface is at 0 ◦ C or 10 ◦ C; however, this temperature can not be used in real cases. The 17 ◦ C is the minimum temperature that the skin can expose to. Therefore, in our study, we suppose the temperature of the air around the skin is 17 ◦ C. The bottom of the model is assumed to be at 37 ◦ C, which is constant core body temperature.
5
Results and Discussion
Figure 4 shows the impact of blood flow on temperature regulation. The breast temperature is 33 ◦ C and the outside temperature is 17 ◦ C. The blood perfusion rate (ωb ) has been modified to study the impact of blood perfusion on the perceived temperature on the skin. The higher the perfusion rate, the faster the temperature stabilization, and the lower the skin temperature. The temperature difference between the three configurations is between 3 ◦ C and 8 ◦ C. This difference indicates that no-thermal factors can affect the skin’s temperature as the blood. The heat propagation in the human body is influenced by many factors, not just its temperature and outside temperature. The anatomy of the human body nevertheless influences it. The external temperature has been varied to study the impact of the place and season on a healthy breast’s skin temperature. Figure 5 shows the temperature of a healthy breast at different outdoor temperatures. The skin temperature has been adjusted differently according to the outdoor temperature for each studied body anatomy.
96
S. Meraghni et al.
Fig. 4. Healthy breast with different blood perfusion rate
Fig. 5. Healthy breast in different external temperature. Results show the impact of external temperature on the generated temperature in the breast.
This study shows the importance of having complementary data about the location, the weather, and the time to update the temperature model correctly and accurately interpret the temperature difference results according to all the data collected from different smart devices. The skin temperature distribution is symmetrical. Thus, for breast diagnostics by thermography, the temperature distribution of both breasts is compared
DT Driven Breast Cancer
97
Fig. 6. Skin Temperature difference between health breast and breast with tumor for three different levels of fat in the breast. Results show that the threshold value to detect tumor is relative to the human body’s anatomy and environmental factors.
to confirm whether an abnormality has altered one breast’s skin temperature. Therefore, a tumor is detected if the temperature difference is above a threshold. In this study, a tumor 1 cm in diameter is introduced into the breast. For each anatomy, the skin temperature is measured before and after tumor introduction. Then, the temperature difference is calculated to study the detection threshold. Figure 6 shows a difference of 4.49 ◦ C, 1.99 ◦ C, and 0.31 ◦ C between healthy and tumor breast. These results confirm that the threshold is relative to the human body’s anatomy and environmental factors; therefore, the threshold value cannot be generalized for tumor detection.
6
Conclusion
The proposed DT is intended to provide a personalized tool for diagnosing breast cancer using temperature data. The proposed DT is developed in an offline phase based on the bio-heat model. In the online phase, the DT is connected to the user, and then the behavior of the DT is updated based on temperature data. However, results show that the temperature data cannot be sufficient to conclude an abnormality. Many other factors can influence the body temperature as outdoor temperature, blood pressure, anatomy, and body mass index. To overcome these differences, the proposed DT is updated using the temperature data and the other data collected from the smart connected devices, allowing the DT to be accurate to mirror the body temperature behaviour. DT is a significant advance in the medical field as it will allow the transition to personalized medicine. However, DT’s development still faces some challenges, such as data heterogeneity, security and confidentiality, certification, and regulation. Future work will focus on the integration of the proposed DT, taking into account these challenges. Moreover, as the proposed method has been validated
98
S. Meraghni et al.
on simulated data, the next step will be to use the proposed DT with real data from patients.
References 1. Agnelli, J.P., Barrea, A.A., Turner, C.V.: Tumor location and parameter estimation by thermography. Math. Comput. Model. 53(7–8), 1527–1534 (2011) 2. Angulo, C., Gonzalez-Abril, L., Raya, C., Ortega, J.A.: A proposal to evolving towards digital twins in healthcare. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L., Ortu˜ no, F. (eds.) International Work-Conference on Bioinformatics and Biomedical Engineering. IWBBIO 2020. Lecture Notes in Computer Science, vol. 12108, pp. 418–426. Springer, Cham (2020). https://doi.org/10.1007/978-3-03045385-5 37 3. Bagaria, N., Laamarti, F., Badawi, H.F., Albraikan, A., Martinez Velazquez, R., El Saddik, A.: Health 4.0: digital twins for health and well-being. In: El Saddik, A., Hossain, M., Kantarci, B. (eds.) Connected Health in Smart Cities, pp. 143–152. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-27844-1 7 4. Benayache, A., Bilami, A., Barkat, S., Lorenz, P., Taleb, H.: MsM: a microservice middleware for smart WSN-based IoT application. J. Netw. Comput. Appl. 144, 138–154 (2019). https://doi.org/10.1016/j.jnca.2019.06.015 5. Bruynseels, K., de Sio, F.S., van den Hoven, J.: Digital twins in health care: ethical implications of an emerging engineering paradigm. Front. Genet. 9, 31 (2018) 6. Byrns, G.E., et al.: Chemical hazards in radiology. Appl. Occup. Environ. Hyg. 15(2), 203–208 (2000) 7. Charkoudian, N., Stachenfeld, N.S.: Reproductive hormone influences on thermoregulation in women. Compr. Physiol. 4(2), 793–804 (2011) 8. Croatti, A., Gabellini, M., Montagna, S., Ricci, A.: On the integration of agents and digital twins in healthcare. J. Med. Syst. 44(9), 1–8 (2020) 9. Azevedo Figueiredo, A.A., Fernandes, H.C., Guimaraes, G.: Experimental approach for breast cancer center estimation using infrared thermography. Infrared Phys. Technol. 95, 100–112 (2018) 10. Gonzalez-Hernandez, J.-L., Recinella, A.N., Kandlikar, S.G., Dabydeen, D., Medeiros, L., Phatak, P.: Technology, application and potential of dynamic breast thermography for the detection of breast cancer. Int. J. Heat Mass Trans. 131, 558–573 (2019) 11. Greaney, J.L., Kenney, W.L., Alexander, L.M.: Sympathetic regulation during thermal stress in human aging and disease. Auton. Neurosci. 196, 81–90 (2016) 12. Gros, C., Gautherie, M., Bourjat, P.: Prognosis and post-therapeutic follow-up of breast cancers by thermography. Bibl. Radiol. 6, 77–90 (1975) 13. Hadjiiski, L., et al.: Breast masses: computer-aided diagnosis with serial mammograms. Radiology 240(2), 343–356 (2006) 14. Jarvis, S.S., et al.: Sympathetic activation during early pregnancy in humans. J. Physiol. 590(15), 3535–3543 (2012) 15. Jung, S.-J., Myllyl¨ a, R., Chung, W.-Y.: Wireless machine-to-machine healthcare solution using android mobile devices in global networks. IEEE Sens. J. 13(5), 1419–1424 (2012) 16. Kandlikar, S.G., et al.: Infrared imaging technology for breast cancer detectioncurrent status, protocols and new directions. Int. J. Heat Mass Trans. 108, 2303– 2320 (2017)
DT Driven Breast Cancer
99
17. Kennedy, D.A., Lee, T., Seely, D.: A comparative review of thermography as a breast cancer screening technique. Integr. Cancer Ther. 8(1), 9–16 (2009) 18. Lawson, R.N., Chughtai, M.S.: Breast cancer and body temperature. Can. Med. Assoc. J. 88(2), 68 (1963) 19. Ma, J., et al.: A portable breast cancer detection system based on smartphone with infrared camera. Vibroeng. PROCEDIA 26, 57–63 (2019) 20. Meraghni, S., Terrissa, L.S., Yue, M., Ma, J., Jemei, S., Zerhouni, N.: A data-driven digital-twin prognostics method for proton exchange membrane fuel cell remaining useful life prediction. Int. J. Hydrogen Energy 46, 2555–2564 (2020) 21. Miller, K.D., Fidler-Benaoudia, M., Keegan, T.H., Hipp, H.S., Jemal, A., Siegel, R.L.: Cancer statistics for adolescents and young adults, 2020. CA: A Can. J. Clin. 70(6), 443–459 (2020) 22. Pennes, H.H.: Analysis of tissue and arterial blood temperatures in the resting human forearm. J. Appl. Physiol. 1(2), 93–122 (1948) 23. Tepper, M., Gannot, I.: Monitoring tumor state from thermal images in animal and human models. Med. Phys. 42(3), 1297–1306 (2015) 24. Tepper, M., et al.: Thermographic investigation of tumor size, and its correlation to tumor relative temperature, in mice with transplantable solid breast carcinoma. J. Biomed. Opt. 18(11), 111410 (2013). https://doi.org/10.1117/1.JBO.18.11.111410 25. Tuegel, E.J., Ingraffea, A.R., Eason, T.G., Spottswood, S.M.: Reengineering aircraft structural life prediction using a digital twin. Int. J. Aerosp. Eng. 2011, 154798 (2011) 26. Wahab, A.A., Salim, M.I.M., Ahamat, M.A., Manaf, N.A., Yunus, J., Lai, K.W.: Thermal distribution analysis of three-dimensional tumor-embedded breast models with different breast density compositions. Med. Biol. Eng. Comput. 54(9), 1363– 1373 (2016) 27. Zhou, Y., Herman, C.: Optimization of skin cooling by computational modeling for early thermographic detection of breast cancer. Int. J. Heat Mass Transf. 126, 864–876 (2018) 28. Zuluaga-Gomez, J., Al Masry, Z., Benaggoune, K., Meraghni, S., Zerhouni, N.: A CNN-based methodology for breast cancer diagnosis using thermal images. Comput Methods Biomech. Biomed. Eng. Imag. Vis. 9(2), 1–15 (2020)
Towards the Localisation of Lesions in Diabetic Retinopathy Samuel Ofosu Mensah1,2(B) , Bubacarr Bah1,2 , and Willie Brink2 1
African Institute for Mathematical Sciences, Cape Town, South Africa [email protected] 2 Department of Mathematical Sciences, Stellenbosch University, Stellenbosch, South Africa
Abstract. Convolutional Neural Networks (CNNs) have successfully been used to classify diabetic retinopathy (DR) fundus images in recent times. However, deeper representations in CNNs may capture higherlevel semantics at the expense of spatial resolution. To make predictions usable for ophthalmologists, we use a post-attention technique called Gradient-weighted Class Activation Mapping (Grad-CAM) on the penultimate layer of deep learning models to produce coarse localisation maps on DR fundus images. This is to help identify discriminative regions in the images, consequently providing evidence for ophthalmologists to make a diagnosis and potentially save lives by early diagnosis. Specifically, this study uses pre-trained weights from four state-of-the-art deep learning models to produce and compare localisation maps of DR fundus images. The models used include VGG16, ResNet50, InceptionV3, and InceptionResNetV2. We find that InceptionV3 achieves the best performance with a test classification accuracy of 96.07%, and localise lesions better and faster than the other models. Keywords: Deep learning
1
· Grad-CAM · Diabetic retinopathy
Introduction
There has been tremendous success in the field of deep learning, especially in the computer vision domain. This success may be attributed to the invention of Convolutional Neural Networks (CNNs). It has since been extended to specialised fields such as medicine [8]. With the help of transfer learning [15], pre-trained weights of state-of-the-art models can be used for the analysis of medical images. In this study, for example, we use deep learning backed with transferred weights of pre-trained models to classify Diabetic Retinopathy (DR) fundus images [5,8]. Even though these models are able to achieve good performance, it can be difficult to understand the reasoning behind their discrimination processes. This is due in part to the fact that deep learning models have a nonlinear multilayer structure [4]. Recently, DR classification has received a lot of attention as it has been useful in the ophthalmology domain [5,8]. However, ophthalmologists may not be able c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 100–107, 2021. https://doi.org/10.1007/978-3-030-80129-8_8
Towards Lesions Localisation
101
to evaluate the true performance of the model [2,4]. We attempt to mitigate this issue by providing additional evidence of model performance [4] with the help of a post-attention technique called Gradient-weighted Class Activation Mapping (Grad-CAM) [11]. This technique can be seen as a Computer-Aided Diagnosis (CAD) tool integrated to increase the speed of diabetic retinopathy diagnosis. The main contribution of this paper is to use Grad-CAM to generate coarse localisation maps from higher-level semantics of a CNN model, consequently aiding and speeding up the diagnosis process of diabetic retinopathy. The rest of this paper is organized as follows. In Sect. 2, we introduce detail of the data and techniques used for the study. Section 3 provides performances of the evaluation done for the models used in the study. Finally, we present conclusions and future work in Sect. 4.
2
Data and Methodology
In this study, we used a publicly available labelled DR fundus image dataset from the Asia Pacific Tele-Ophthalmology Society (APTOS)1 to train, validate and test the model. The dataset has 3662 DR fundus images categorised into a five-class scale of increasing severity, namely normal, mild, moderate, severe and proliferative [1,5]. Figure 1 shows the distribution of classes in the dataset.
Fig. 1. The distribution of classes for APTOS dataset.
The plot shows that the data is imbalanced as the normal class has the highest number of instances. We resolve the class imbalance problem by incorporating weights of the classes into the loss function. In this case, the majority class was 1
https://asiateleophth.org/.
102
S. O. Mensah et al.
given a small weight while the minority classes were given larger weights. Intuitively, during training the model places more emphasis on the minority classes, giving higher penalties to misclassifications made by those minority classes, and less emphasis on the majority class [3,7]. For image pre-processing, we used both vertical and horizontal flips for training data augmentation as well as standard normalisation of the data. We also resized all of the images to either 224 × 224 or 299 × 299 depending on the pretrained model we want to use. An important pre-processing technique used in this study is Contrast Limited Adaptive Histogram Equalization (CLAHE) [17] because DR fundus images often suffer from low contrast issues [10]. Figure 2 shows a comparison of the intensity distribution of a DR fundus image before and after applying CLAHE.
(a)
Before CLAHE
(b)
Before CLAHE intensity distribution
(c)
After CLAHE
(d)
After CLAHE intensity distribution
Fig. 2. Depicting a DR fundus image and its corresponding intensity distribution before and after applying CLAHE.
It can be seen that the DR fundus image before CLAHE has more Gaussian (white) noise [10] making it potentially difficult to analyse. The Gaussian noise is reduced after adjusting the DR fundus image with CLAHE and its corresponding distribution is more normalised. This step is crucial because it can affect the sensitivity and specificity of the model [10]. The pre-processed images are passed to the model which in this case has a CNN backbone. We consider and compare four CNN backbones in this study, namely VGG16 [12], ResNet50 [6], InceptionV3 [13] and InceptionResNetV2 [14]. The idea is to pass images through a pre-trained model and extract its feature maps at the last layer for classification. It should be noted that as images pass through a deep convolutional model, there exist trade-offs between losing spatial resolution and learning higher-level semantics, especially in the last convolutional layers. Neurons in those layers look for semantic class-specific information in the image, useful for discriminative purposes. Grad-CAM thus helps to interpret and explain individual components of a model [11,16]. Grad-CAM is a technique used to produce visual explanations from decisions made by a CNN. Specifically, it uses gradients of a class concept in the
Towards Lesions Localisation
103
penultimate layer to produce a coarse localisation map which helps to identify discriminative regions in an image. For every class c, there are k feature maps. To generate localisation maps, we first compute neuron importance weights αkc by global-average-pooling gradients of the target y c with respect to the feature maps Ak . It is important to note that Ak is a spatial feature map hence it has width and height dimensions indexed by i and j [11]. Neuron importance weight αkc is given by global average pooling
αkc
=
1 Z i j
∂y c ∂Akij
.
(1)
gradients via backprop
Finally, we pass a linear combination of neuron importance weights αkc and feature maps Ak through a ReLU function to generate the coarse localisation map. The generated map can then be overlaid on top of the input image to identify the region(s) of interest. Grad-CAM is given by αkc Ak . (2) LcGrad-CAM = ReLU
k
linear combination
Aside from using the feature maps to generate localisation maps, they are also used for discriminative purposes. In this study we replace the final layer with three successive layers, starting with a Global Average Pooling (GAP) layer, followed by a dropout (with 50% probability) layer and finally a dense layer. In summary, we feed a CNN model with a preprocessed DR fundus image. We then generate localisation maps using feature maps extracted from the CNN model. In addition, the feature maps are used to classify the DR fundus images in an image-level manner. Figure 3 shows a diagramme of this approach.
Fig. 3. Depicting the setup of generating a localisation map and classifying a DR fundus image.
104
3
S. O. Mensah et al.
Results
In this section we evaluate the classification performance of the models and their ability to generate localization maps for DR fundus images. Models are trained using Keras with TensorFlow backend on an NVIDIA Tesla V100 GPU for 15 epochs. We use the sigmoid activation function for the last layer and cross entropy as the loss function for training. In this context, we use accuracy and Area Under Curve (AUC) as performance metrics over the test set. For every image we first predict a class and then generate the localisation map over the image. Table 1 shows the performance of the classification task. Table 1. Performance table for the models Model
Accuracy (%) AUC
VGG16
95.31
0.97
0.64
0.77
0.62
0.78
ResNet50
94.56
0.97
0.72
0.82
0.71
0.75
InceptionV3
96.07
0.97
0.67
0.84
0.62
0.67
InceptionResNetV2 94.39
0.96
0.77
0.81
0.68
0.69
Normal Mild Moderate Severe Proliferative
We observe that InceptionV3 performed better than the other models as it had the highest accuracy of 96.05%. This is followed by VGG16, InceptionResNetV2 and ResNet50 in that order, however their performances are close to one other. Finally, we randomly select some input images (see Fig. 4a) for demonstrative purposes. We pass the selected images through the InceptionV3 model (since it had the highest accuracy) to generate localisation maps and overlay the generated maps on the input images, as shown in Fig. 4. We observe that the model generates a blurry image in Fig. 4b. The blurry effect can be attributed to the typical loss of spatial resolution as images pass through deep convolutional models. The blurry image in this case is the result of Grad-CAM. It may not be as useful until we map it on the input images. We see in Fig. 4c that the Grad-CAM has highlighted regions of interest on the input images after mapping.
Towards Lesions Localisation
(a) Input
(b) Grad-CAM
105
(c) Overlaid Grad-CAM
Fig. 4. Generated Grad-CAM for higher-level semantics and overlaid Grad-CAM on randomly selected input images.
4
Conclusions
In this work we presented a technique which identifies regions of interest in DR fundus images and produces visual explanations for models. This could aid ophthalmologist in understanding the reasoning behind a model’s discriminative process and speed up diagnosis. We observe high performance for classification as well as potentially useful localisation maps. We observed that the performance of different models are very close to each other. In future, we intend to use in-training attention mechanism such as stand-alone self-attention [9] to classify and generate localisation maps. This is because in-training attention mechanisms overcome the limitation of losing spatial resolution. Acknowledgement. We would like to thank the German Academic Exchange Service (DAAD) for kindly offering financial support for this research. Also, we thank the
106
S. O. Mensah et al.
Centre for High Performance Computing (CHPC) for providing us with computing resource for this research.
References 1. American Academy of Ophthalmology: International Clinical Diabetic Retinopathy Disease Severity Scale Detailed Table (2002) 2. Beede, E., et al.: A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–12 (2020) 3. Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 106, 249–259 (2018) 4. Gondal, W.M., K¨ ohler, J.M., Grzeszick, R., Fink, G.A., Hirsch, M.: Weaklysupervised localization of diabetic retinopathy lesions in retinal fundus images. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 2069–2073 (2017) 5. Gulshan, V., et al.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. J. Am. Med. Assoc. (JAMA) 316(22), 2402–2410 (2016) 6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 7. Johnson, J.M., Khoshgoftaar, T.M.: Survey on deep learning with class imbalance. J. Big Data 6(1), 27 (2019) 8. Raghu, M., Zhang, C., Kleinberg, J., Bengio, S.: Transfusion: understanding transfer learning for medical imaging. In: Advances in Neural Information Processing Systems, pp. 3347–3357 (2019) 9. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A. Shlens, J.: Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909 (2019) 10. Sahu, S., Singh, A.K., Ghrera, S.P., Elhoseny, M.: An approach for de-noising and contrast enhancement of retinal fundus image using CLAHE. Opt. Laser Technol. 110, 87–98 (2019) 11. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: GradCAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618– 626 (2017) 12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 13. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015) 14. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016) 15. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: Kurkov´ a, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) International Conference on Artificial Neural Networks. ICANN 2018. Lecture Notes in Computer Science, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01424-7 27
Towards Lesions Localisation
107
16. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016) 17. Zuiderveld, K.: Contrast limited adaptive histogram equalization. In: Heckbert, P.S. (ed.) Graphics Gems, pp. 474–485. Academic Press, Boston (1994)
Framework for a DLT Based COVID-19 Passport Sarang Chaudhari1 , Michael Clear2 , Philip Bradish2 , and Hitesh Tewari2(B) 1
Indian Institute of Technology - Delhi, Delhi, India 2 Trinity College Dublin, Dublin, Ireland [email protected]
Abstract. Uniquely identifying individuals across the various networks they interact with on a daily basis remains a challenge for the digital world that we live in, and therefore the development of secure and efficient privacy preserving identity mechanisms has become an important field of research. In addition, the popularity of decentralised decision making networks such as Bitcoin has seen a huge interest in making use of distributed ledger technology to store and securely disseminate end user identity credentials. In this paper we describe a mechanism that allows one to store the COVID-19 vaccination details of individuals on a publicly readable, decentralised, immutable blockchain, and makes use of a two-factor authentication system that employs biometric cryptographic hashing techniques to generate a unique identifier for each user. Our main contribution is the employment of a provably secure inputhiding, locality-sensitive hashing algorithm over an iris extraction technique, that can be used to authenticate users and anonymously locate vaccination records on the blockchain, without leaking any personally identifiable information to the blockchain.
Keywords: Locality-sensitive hash Vaccination passport
1
· Biometric hash · Blockchain ·
Introduction
Immunization is one of modern medicine’s greatest success stories. It is one of the most cost-effective public health interventions to date, averting an estimated 2 to 3 million deaths every year. An additional 1.5 million deaths could be prevented if global vaccination coverage improves [13]. The current COVID-19 pandemic which has resulted in millions of infections worldwide [3] has brought into sharp focus the urgent need for a “passport” like instrument, which can be used to easily identify a user’s vaccination record, travel history etc. as they traverse the globe. However, such instruments have the potential to discriminate or create This publication has emanated from research conducted with the financial support of Science Foundation Ireland under Grant Number 13/RC/2094 (Lero). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 108–123, 2021. https://doi.org/10.1007/978-3-030-80129-8_9
Framework for a DLT Based COVID-19 Passport
109
bias against citizens [10] if they are not designed with the aim of protecting the user’s identity and/or any personal information stored about them on the system. Given the large number of potential users of such a system and the involvement of many organizations in different jurisdictions, we need to design a system that is easy to sign up to for end users, and for it to be rolled out at a rapid rate. The use of hardware devices such as smart cards or mobile phones for storing such data is going to be financially prohibitive for many users, especially those in developing countries. Past experience has shown that such “hardware tokens” are sometimes prone to design flaws that only come to light once a large number of them are in circulation. Such flaws usually require remedial action in terms of software or hardware updates, which can prove to be very disruptive. An alternative to the above dilemma is an online passport mechanism. An obvious choice for the implementation of such a system is a blockchain, that provides a “decentralized immutable ledger” which can be configured in a manner such that it can be only written to by authorised entities (i.e. there is no requirement for a hard computation such as proof-of-work (PoW) to be carried out for monetary reward), but can be queried by anyone. However, one of the main concerns for such a system is based on: How does one preserve the privacy of user’s data on a public blockchain while providing a robust mechanism to link users to their data records securely? In other words, one of the key requirements is to avoid having any personally identifiable information (PII) belonging to users stored on the blockchain. In the subsequent sections we describe some of the key components of our system and the motivation that led us to use them. The three major components are - extraction of iris templates, a hashing mechanism to store them securely and a blockchain technology. Finally, we present a formal description of our framework which uses the aforementioned components as building blocks. However, first we will briefly discuss some related work and then briefly some preliminaries with definitions and notation that are used in the paper. 1.1
Related Work
There has been considerable work on biometric cryptosystems and cancellable biometrics, which aims to protect biometric data when stored for the purpose of authentication cf. [11]. Biometric hashing is one such approach that can achieve the desired property of irreversibility, albeit without salting it does not achieve unlinkability. Research in biometric hashing for generating the same hash for different biometric templates from the same user is at an infant stage and existing work does not provide strong security assurances. Locality-sensitive hashing is the approach we explore in this paper, which has been applied to biometrics in existing work; for example a recent paper by Dang et al. [4] applies a variant of SimHash, a hash function we use in this paper, to face templates. However the technique of applying locality-sensitive hashing to a biometric template has not been employed, to the best of our knowledge, in a system such as ours.
110
2 2.1
S. Chaudhari et al.
Preliminaries Notation
A quantity is said to be negligible with respect to some parameter λ, written negl(λ), if it is asymptotically bounded from above by the reciprocal of all polynomials in λ. For a probability distribution D, we denote by x ←$ D the fact that x is sampled according to D. We overload the notation for a set S i.e. y ←$ S denotes that y is sampled uniformly from S. Let D0 and D1 be distributions. We denote by D0 ≈ D1 and the D0 ≈ D1 the facts that D0 and D1 are computationally C
S
indistinguishable and statistically indistinguishable, respectively. We use the notation [k] for an integer k to denote the set {1, . . . , k}. Vectors are written in lowercase boldface letters. The abbreviation PPT stands for probabilistic polynomial time. 2.2
Entropy
The entropy of a random variable is the average “information” conveyed by the variable’s possible outcomes. A formal definition is as follows. Definition 1. The entropy H(X) of a discrete random variable X whichtakes on the values x1 , . . . , xn with respective probabilities Pr X = x1 , . . . , Pr X = xn is defined as H(X) := −
n
Pr X = xi log Pr X = xi
i=1
In this paper, the logarithm is taken to be base 2, and therefore we measure the amount of entropy in bits.
3
Iris Template Extraction
Iris biometrics is considered one of the most reliable techniques for implementing identification systems. For the ID of our system (discussed in the overview of our framework in Sect. 6), we needed an algorithm that can provide us with consistent iris templates, which will have not only low intra-class variability, but also show high inter-class variability. This requirement is essential because we would expect the iris templates for the same subject to be similar, as this would then be hashed using the technique described in the Sect. 4. Iris based biometric techniques have received some good attention in the last decade. One of the most successful technique was put forward by John Daugman [5], but most of the current best-in-class techniques are patented and hence unavailable for an open-source use. For the purpose of writing this paper, we have used the work of
Framework for a DLT Based COVID-19 Passport
111
Fig. 1. Masek’s iris template extraction algorithm
Libor Masek [7] which is an open-source implementation of a reasonably reliable iris recognition technique. Users can always opt for other commercial biometric solutions when trying to deploy our work independently. Masek’s technique works on grey-scale eye images, which are processed in order to extract the binary template. First the segmentation algorithm, based on a Hough Transform is used to localise the iris and pupil regions and also isolate the eyelid, eyelash and reflections as shown in Fig. 1a and 1b. The segmented iris region is then normalised i.e., unwrapped into a rectangular block of constant polar dimensions as shown in Fig. 1c. The iris features are extracted from the normalised image by one-dimensional Log-Gabor filters to produce a bit-wise iris template and mask as shown in Fig. 1d. We denote the complete algorithm by Iris.ExtractFeatureVector which takes a scanned image as input and outputs a binary feature vector fv ∈ {0, 1}n (this algorithm is called upon by our framework in Sect. 6). This data is then used for matching, where the Hamming distance is used as the matching metric. We have used CASIA-Iris-Interval database [2] in Fig. 1 and for some preliminary testing. Table 1 shows the performance of Masek’s technique as reported by him in his original thesis [8]. This algorithm performs quite well for a threshold of 0.4 where the false acceptance rate (FAR) is 0.005 and the false rejection rate (FRR) is 0. These values are used when we present our results to compare the
112
S. Chaudhari et al. Table 1. FAR and FRR for the CASIA-a data set Threshold FAR
FRR
0.20
0.000 74.046
0.25
0.000 45.802
0.30
0.000 25.191
0.35
0.000
4.580
0.40
0.005
0.000
0.45
7.599
0.000
0.50
99.499
0.000
performance of our technique when a biometric template is first hashed and then the hamming distance is measured to calculate the FAR and FRR, as opposed to directly measuring the hamming distances in the original biometric templates. One would assume the performance of our work to get better with the increase in efficiency of extracting consistent biometric templates by other methods. As aforementioned, we rely on the open-source MATLAB code by Libor Masek. For each input image, the algorithm produces a binary template which contains the iris information, and a corresponding noise mask which corresponds to corrupt areas within the iris pattern, and marks bits in the template as corrupt. These extracted iris templates and their corresponding masks are 20 × 480 binary matrices each. In the original work, only those bits in the iris pattern that correspond to ‘0’ bits in the noise masks of both iris patterns were used in the calculation of Hamming distance. A combined mask is then calculated and both the templates are masked with it. Finally, the algorithm calculates the bitwise XOR i.e. distance between the masked templates. The steps given below provides an overview for the algorithm used by Libor: Extraction: (template1 , mask1 ) = createiristemplate(image1 ) (template2 , mask2 ) = createiristemplate(image2 ) Matching: c mask = mask1 ∧ mask2 masked template1 = template1 ∧ (¬ c mask) masked template2 = template2 ∧ (¬ c mask) distance = masked template1 ⊕ masked template2 There are two major issues that we need to deal with before being able to use these templates in our system, specifically, template masking and conversion to linear vector.
Framework for a DLT Based COVID-19 Passport
3.1
113
Template Masking
The above matching technique requires one to have two pairs of iris patterns and their corresponding masks to calculate the Hamming distance. However for the application we are targeting, at any time during verification, the system would have to match the extracted template of an individual (i.e. template and mask) against a hashed template stored on the blockchain. This means that we cannot incorporate the above matching algorithm into our system. We have two choices to mitigate this problem and obtain the masked template for the remaining steps: 1. We can calculate the masked template independently for each sample i.e. masked template = template ∧ (¬ mask) An underlying assumption for this method is that the masks for an individual would be approximately the same in every sample. This assumption is not too far-fetched as was clear from the preliminary analysis of our database. We will refer to these as type1 templates. 2. We can maintain a global mask which can be the defined as global mask = mask1 ∧ . . . ∧ maskl for all extracted maski of all imagei belonging to the training database. And at the time of verification, we generate the masked template as masked template = template ∧ (¬ global mask) This method has some added difficulty in finding the global mask and it also discards more data from the iris patterns as opposed to directly using the respective maski for each templatei . But it helps in maintaining a consistency among the masked templates. We will refer to these as type2 templates. In a follow-up to this paper, we will use and provide results for templates of both types based on experiments we are conducting at the time of writing. 3.2
Conversion to Linear Vector
For our use case, we need a one-dimensional input stream which can be fed into the hashing algorithm discussed in the subsequent sections. For converting those masked template matrices into linear feature vectors, we have two naive choices of concatenating either the row vectors or the column vectors. Before deciding the type of conversion, let us look at an important key factor, which is rotational inconsistencies in the iris templates. Rotational inconsistencies are introduced due to rotations of the camera, head tilts and rotations of the eye within the eye socket. The normalisation process does not compensate these. In order to account for rotational inconsistencies, when the Hamming distance of two templates is calculated, one template is
114
S. Chaudhari et al.
shifted left and right bit-wise, and a number of Hamming distance values are calculated from successive shifts. This bit-wise shifting in the horizontal direction corresponds to the rotation of the original iris region. This method was suggested by Daugman [5], and corrects misalignment in the normalised iris pattern caused by rotational differences during imaging. From the calculated Hamming distance values, only the lowest is taken, since this corresponds to the best match between two templates. Due to this, column-wise conversion seems like the most logical choice as this would allow us to easily rotate the binary linear feature vectors. Shifting the linear vector by 20 bits will correspond to shifting the iris template once (recall that the dimension of iris templates is 20 × 480). 3.3
Wrap-Up
Putting it all together, we define the steps of our algorithm Iris.ExtractFeatureVector that we call upon later. – Iris.ExtractFeatureVector(image): • (template, mask) = createiristemplate(image) where createiristemplate is Masek’s open source algorithm. • Obtain masked template (either type1 or type2 as defined in Sect. 3.1). • Convert masked template to linear vector as in Sect. 3.2. • Output binary linear vector fv ∈ {0, 1}n Note that the parameter n is a global system parameter measuring the length of the binary feature vectors outputted by Iris.ExtractFeatureVector.
4
Locality-Sensitive Hashing
To preserve the privacy of individuals on the blockchain, the biometric data has to be encrypted before being written to the ledger. Hashing is a good alternative to achieve this, but techniques such as SHA-256 and SHA-3 cannot be used, since the biometric templates that we extracted above can show differences across various scans for the same individual. Hence using those hash functions would produce completely different hashes. Therefore, we seek a hash function that generates “similar” hashes for similar biometric templates. This prompts us to explore Locality-Sensitive Hashing (LSH), which has exactly this property. Various LSH techniques have been researched to identify whether files (i.e. byte streams) are similar based on their hashes. TLSH is a well-known LSH function that exhibits high performance and matching accuracy but, does not provide a sufficient degree of security for our application. Below we assess another type of LSH function, which does not have the same runtime performance as TLSH, but as we shall see, exhibits provable security for our application and therefore is a good choice for adoption in our framework.
Framework for a DLT Based COVID-19 Passport
4.1
115
Input Hiding
In the cryptographic definition of one-way functions, it is required that it is hard to find any preimage of the function. However, we can relax our requirements for many applications because it does not matter if for example a random preimage can be computed as long as it is hard to learn information about the specific preimage that was used to compute the hash. In this section, we introduce a property that captures this idea, a notion we call input hiding. Input hiding means that if we choose some preimage x and give the hash h = H(x) to an adversary, it is either computationally hard or informationtheoretically impossible for the adversary to learn x or any partial information about x. This is captured in the following formal definition. Definition 2. A hash function family H with domain X := {0, 1}n and range Y := {0, 1}m is said to be input hiding if for all randomly chosen hash functions H ←$ H, all i ∈ [n], all randomly chosen inputs x ←$ X and all PPT adversaries A it holds that Pr xi = 0 ∧ A(i, H(x)) → 1 − Pr xi = 1 ∧ A(i, H(x)) → 1 ≤ negl(λ) where λ is the security parameter. 4.2
Our Variant of SimHash
Random projection hashing, proposed by Charikar [1], preserves the cosine distance between two vectors in the output of the hash, such that two hashes are probabilistically similar depending on the cosine distance between the two preimage vectors. This hash function is called SimHash. We describe a slight variant of SimHash here, which we call S3Hash. In our variant, the random vectors that are used are sampled from the finite field of F3 = {−1, 0, 1}. Suppose we choose a hash length of m bits. Now for our purposes, the input vectors to the hash are binary vectors in {0, 1}n for some n. First we choose m random vectors ri ←$ {−1, 0, 1}n for i ∈ {1, . . . , m}. Let R = {ri }i∈{1,...,m} be the set of these random vectors. The hash function S3HashR : {0, 1}n → {0, 1}m is thus defined as: (1) S3HashR (x) = (sgn( x, r1 ), . . . , sgn( x, rm )) where sgn : Z → {0, 1} returns 0 if its integer argument is negative and returns 1 otherwise. Note that the notation ·, · denotes the inner product between the two specified vectors. Let x1 , x2 ∈ {0, 1}n be two input vectors. (1) (2) It holds for all i ∈ {1, . . . , m} that Pr[hi = hi ] = 1 − θ(x1π,x2 ) where h(1) = S3HashR (x1 ), h(2) = S3HashR (x2 ) and θ(x1 , x2 ) is the angle between x1 and x2 . Therefore the similarity of the inputs is preserved in the similarity of the hashes. An important question is: Is this hash function suitable for our application? The answer is in the affirmative because it can be proved that the function
116
S. Chaudhari et al.
information-theoretically obeys a property we call input-hiding that we defined in Sect. 4.1. We recall that this property means that if we choose some binary vector x ∈ {0, 1}n and give the hash h = S3HashR (x) to an adversary, it is either computationally hard or information-theoretically impossible for the adversary to learn x or any partial information about x. This property is sufficient in our application since we only have to ensure that no information is leaked about the user’s iris template. We now prove that our variant locality-sensitive hash function S3Hash is information-theoretically input hiding. Theorem 1. Let X denote the random variable corresponding to the domain of the hash function. If H(X) ≥ m + λ then S3Hash is information-theoretically input hiding where λ is the security parameter and H(X) is the entropy of X. Proof. The random vectors in R can be thought of as vectors of coefficients corresponding to a set of m linear equation in n unknowns on the left hand side and on the right hand side we have the m elements, one for each equation, which are components of the hash i.e. (h1 , . . . , hm ). Now the inner product is evaluated over the integers and the sgn function maps an integer to an element of {0, 1} depending on its sign. The random vectors are chosen to be ternary. Suppose we choose a finite field Fp where p ≥ 2(m + 1) is a prime. Since there will be no overflow when evaluating the inner product in this field, a solution in this field is also a solution over the integers. We are interested only in the binary solutions. Because m < n, the system is underdetermined. Since there are n − m degrees of freedom in a solution, it follows that there are 2n−m binary solutions and each one is equally likely. Now let r denote the redundancy of the input space i.e. r = n − H(X). The fraction of the 2n−m solutions that are valid inputs is 2n−m−r . If 2n−m−r > 2λ , then the probability of an adversary choosing the “correct” preimage is negligible in the security parameter λ. For this condition to hold, it is required that n − m − r > λ (recall that r = n − H(X)), which follows if H(X) ≥ m + λ as hypothesized in the statement of the theorem. It follows that information-theoretically an unbounded adversary has a negligible advantage in the input hiding definition. Our initial estimates suggest that the entropy of the distribution of binary feature vectors outputted by Iris.ExtractFeatureVector is greater than m + λ for parameter choices such as m = 256 and λ = 128. A more thorough analysis however is deferred to future work. 4.3
Evaluation
We have ran experiments with S3Hash applied to feature vectors obtained using our Iris.ExtractFeatureVector algorithm. The distance measure we use is the hamming distance. The results of these experiments are shown in Table 2. Results for a threshold of 0.3 in particular indicates that our approach shows promise. We hope to make further improvements in future work.
Framework for a DLT Based COVID-19 Passport
117
Table 2. FAR & FRR for the CASIA-iris-interval data set Threshold FAR FRR
5
0.25
3.99 64.26
0.26
6.35 57.35
0.27
9.49 49.63
0.28
13.92 43.79
0.29
19.60 36.47
0.3
26.23 30.79
0.31
34.01 25.10
0.32
42.30 19.33
0.33
50.91 16.00
0.34
59.04 11.94
0.35
67.05
8.53
Blockchain
A blockchain is used in the system for immutable storage of individuals’ vaccination records. The blockchain we employ is a permissioned ledger to which blocks can only be added by authorized entities or persons such as hospitals, primary health care centers, clinicians, etc. Such entities have to obtain a public-key certificate from a trusted third party and store it on the blockchain as a transaction before they are allowed to add blocks to the ledger. The opportunity to add a new block is controlled in a round robin fashion, thereby eliminating the need to perform a computationally intensive PoW process. Any transactions that are broadcast to the P2P network are signed by the entity that created the transaction, and can be verified by all other nodes by downloading the public key of the signer from the ledger itself. An example of distributed ledger technology that fulfills the above requirements is MultiChain [9]. 5.1
Interface
We now describe an abstract interface for the permissioned blockchain that capˆ A subset of parties tures the functionality we need. Consider a set of parties P. ˆ P ⊂ P are authorized to write to the blockchain. Each party P ∈ P has a secret key skP which it uses to authenticate itself and gain permission to write to the blockchain. How a party acquires authorization is beyond the scope of this paper. For our purposes, the permissioned blockchain consists of the following algorithms: – Blockchain.Broadcast(P, skP , tx): On input a party identifier P that identifies the sending party, a secret key skP for party P and a transaction tx (whose form is described below), then broadcast the transaction tx to the peer-topeer network for inclusion in the next block. The transaction will be included iff P ∈ P.
118
S. Chaudhari et al.
– Blockchain.AnonBroadcast(skP , tx): On input a secret key skP for a party P and a transaction tx, then anonymously broadcast the transaction tx to the peer-to-peer network for inclusion in the next block. The transaction will be included iff P ∈ P. – Blockchain.GetNumBlocks(): Return the total number of blocks currently in the blockchain. – Blockchain.RetrieveBlock(blockNo): Retrieve and return the block at index blockNo, which is a non-negative integer between 0 and Blockchain.GetNumBlocks() − 1. A transaction has the form (type, payload, party, signature). A transaction in an anonymous broadcast is of the form (type, payload, ⊥, ⊥). The payload of a transaction is interpreted and parsed depending on its type. In our application, there are two permissible types: ‘rec’ (a record transaction which consists of a pair (ID, record)) and ‘hscan’ (biometric hash transaction which consists of a hash of an iris feature vector). This will become clear from context in our formal description of our framework in the next section which makes use of the above interface as a building block. The final point is that a block is a pair (hash, transactions) consisting of the hash of the block and a set of transactions {txi }i∈[] .
6 6.1
Our Framework Overview
In this section we provide a formal description of our proposed framework which makes use of the building blocks presented in the previous sections. Our proposed system utilises a two-factor authentication mechanism to uniquely identify an individual on the blockchain. The parameters required to recreate an identifier are based on information that “one knows” and biometric information that “one possess”. Figure 2 describes the overall algorithm that we employ in our proposed system. When a user presents themselves to an entity or organisation participating in the system, they are asked for their DoB (dd/mm/yyyy) and Gender (male/female/other). In addition, the organization captures a number of scans of the user’s iris, and creates a hash H1 (fv) from the feature vector extracted from the “best” biometric scan data. Our system can combine the user’s DoB and Gender with H1 (fv) to generate a unique 256-bit identifier (ID) for the user: ID = H2 (DoB || Gender || H1 (fv))
(2)
The algorithm tries to match the calculated hash H1 (fv) with existing “anonymous” hashes that are stored on the blockchain. It may get back a set of hashes that are somewhat “close” to the calculated hash. In that case the algorithm concatenates each returned hash (M atchi ) with the user’s DoB and
Framework for a DLT Based COVID-19 Passport
119
Fig. 2. Algorithm workflow
It then tries to match ID with an ID in a vaccination Gender to produce ID. record transaction on the blockchain. If a match is found then the user is already registered on the system and has at least one vaccination record. At this point we may just wish to retrieve the user’s records or add an additional record, e.g. when a booster dose has been administered to the user. However if we go through the set of returned to an existing ID in a vaccination record on the matches and cannot match ID blockchain, i.e. this is the first time the user is presenting to the service, then we store the iris scan hash data H1 (fv) as an anonymous record on the blockchain, and subsequently the ID and COVID-19 vaccination details for the user as a separate transaction. In each case the transaction is broadcast at a random interval on the blockchain peer-to-peer (P2P) network for it to be verified by other nodes in the system, and eventually added to a block on the blockchain. Uploading the two transactions belonging to a user at random intervals ensures that the transactions are stored on separate blocks on the blockchain, and an attacker is not easily able to identify the relationship between the two. Figure 3 shows a blockchain in which there are three anonymous transactions (i.e. Hash of Scan Data) and three COVID-19 vaccination record transactions stored on the blockchain pertaining to different users. The reader is referred to Sect. 4 for more details on how the hash is calculated in our system. Note that the storage of the anonymous hash data has to be carried out only once per registered user in the system.
120
S. Chaudhari et al.
Fig. 3. Blockchain structure
6.2
Formal Description
We present a formal description of our framework in Fig. 4 and Fig. 5. Note that the algorithms described in these figures are intended to formally describe the
Fig. 4. Our framework for a COVID-19 passport
Framework for a DLT Based COVID-19 Passport
Fig. 5. Additional algorithms used by our framework
121
122
S. Chaudhari et al.
fundamental desired functionality of our framework and are so described for ease of exposition and clarity; in particular, they are naive and non-optimized, specifically not leveraging more efficient data structures as would a real-world implementation. Let H be a family of collision-resistant hash functions. The algorithms in Fig. 4 are stateful (local variables that contain retrieved information from the blockchain are shared and accessible to all algorithms). Furthermore, the parameters tuple params generated in Setup is an implicit argument to all other algorithms. The algorithms invoked by the algorithms in Fig. 4 can be found in Fig. 5.
7
Conclusions and Future Work
In this paper we have detailed a framework to build a global vaccination passport using a distributed ledger. The main contribution of our work is to combine a Locality-sensitive hashing mechanism with a blockchain to store the vaccination records of users. A variant of the SimHash LSH function is used to derive an identifier that leaks no personal information about an individual. The only way to extract a user’s record from the blockchain is by the user presenting themselves in person to an authorised entity, and providing an iris scan along with other personal data in order to derive the correct user identifier. However, our research has raised many additional challenges and research questions whose resolution require further investigation and experimentation, intended for future work. First and foremost, we need to improve the accuracy of the Iris.ExtractFeatureVector algorithm (e.g. deciding whether to use type1 or type2 template masking) and to accurately compute the entropy of the feature vectors. Furthermore, our variant of the SimHash algorithm, referred to in this paper as S3Hash, requires further analysis and evaluation, especially with respect to the domain of the random vectors ri , which are restricted to be ternary in this paper. Additionally, we must choose a suitable blockchain. Finally, the overall protocol would benefit from a thorough security analysis where not just privacy but other security properties are tested. The blockchain based mechanism that we have proposed can also be used as a generalised healthcare management system [6,12] with the actual data being stored off-chain for the purpose of efficiency. Once a user’s identifier has been recreated it can be used to pull all records associated with the user, thereby retrieving their full medical history. We are in the middle of developing a prototype implementation of the system and hope to present the results of our evaluation in a follow-on paper. At some time in the future, we hope to trial the system in the field, with the hope of rolling it out on a larger scale. Our implementation will be made open source.
Framework for a DLT Based COVID-19 Passport
123
References 1. Charikar, S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380– 388 (2002) 2. Chinese Academy Sciences’ Institute of Automation (CASIA) - Iris Image Database. http://biometrics.idealtest.org/ 3. COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). https://coronavirus.jhu.edu/map.html 4. Dang, T.M., Tran, L., Nguyen, T.D., Choi, D.: FEHash: full entropy hash for face template protection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020 5. Daugman, J.: Statistical richness of visual phase information: update on recognizing persons by iris patterns. Int. J. Comput. Vis. 45, 25–38 (2001) 6. Hanley, M., Tewari, H.: Managing lifetime healthcare data on the blockchain. In: 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation, Guangzhou, pp. 246– 251 (2018) 7. Masek, L.: Recognition of human iris patterns for biometric identification. Final Year Project, The School of Computer Science and Software Engineering. The University of Western Australia (2003) 8. Masek, L., Kovesi, P.: Matlab source code for a biometric identification system based on iris patterns. The School of Computer Science and Software Engineering. The University of Western Australia (2003) 9. Multichain. https://www.multichain.com/ 10. Phelan, A.L.: Covid-19 immunity passports and vaccination certificates: scientific, equitable, and legal challenges. Lancet 395(10237), 1595–1598 (2020) 11. Rathgeb, C., Uhl, A.: A survey on biometric cryptosystems and cancelable biometrics. EURASIP J. Inf. Secur. 2011, 3 (2011) 12. Tewari, H.: Blockchain research beyond cryptocurrencies. IEEE Commun. Stand. Mag. 3(4), 21–25 (2019) 13. World Health Organization. https://www.who.int/news-room/facts-in-pictures/ detail/immunization
Deep Learning Causal Attributions of Breast Cancer Daqing Chen1(B) , Laureta Hajderanj1 , Sarah Mallet2 , Pierre Camenen2 , Bo Li3 , Hao Ren3 , and Erlong Zhao4 1 School of Engineering, London South Bank University, London SE1 0AA, UK
{chend,hajderal}@lsbu.ac.uk
2 Electronics and Digital Technologies Department, Polytech Nantes, 44300 Nantes, France
[email protected], [email protected]
3 School of Informatics and Electronics, Northwestern Polytechnical University,
Xi’an 710072, China [email protected], [email protected] 4 School of Information Science and Technology, Northwest University, Xian 710127, China [email protected]
Abstract. In this paper, a deep learning-based approach is applied to high dimensional, high-volume, and high-sparsity medical data to identify critical casual attributions that might affect the survival of a breast cancer patient. The Surveillance Epidemiology and End Results (SEER) breast cancer data is explored in this study. The SEER data set contains accumulated patient-level and treatment-level information, such as cancer site, cancer stage, treatment received, and cause of death. Restricted Boltzmann machines (RBMs) are proposed for dimensionality reduction in the analysis. RBM is a popular paradigm of deep learning networks and can be used to extract features from a given data set and transform data in a non-linear manner into a lower dimensional space for further modelling. In this study, a group of RBMs has been trained to sequentially transform the original data into a very low dimensional space, and then the k-means clustering is conducted in this space. Furthermore, the results obtained about the cluster membership of the data samples are mapped back to the original sample space for interpretation and insight creation. The analysis has demonstrated that essential features relating to breast cancer survival can be effectively extracted and brought forward into a much lower dimensional space formed by RBMs. Keywords: Restricted Boltzmann machines · Deep learning · Survival analysis · K-means clustering analysis · Principal component analysis
1 Introduction Breast cancer is the most diagnosed cancer in women, affecting over 2.1 million women each year globally, and it causes the greatest number of cancer-related deaths among women. In 2018, it is estimated that 627,000 women died from breast cancer – that is approximately 15 of all cancer deaths among women [1]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 124–135, 2021. https://doi.org/10.1007/978-3-030-80129-8_10
Deep Learning Causal Attributions of Breast Cancer
125
In order to improve breast cancer survival rate, it is crucial to learn and understand the factors that might affect breast cancer survival rate and outcomes following certain treatments. In the past years, enormous studies have been undertaken intensively in this area aiming to identify the causal attributions of breast cancer survival from multiple perspectives including biological, diagnostical, and data mining techniques. From an analytical point of view, analysing breast cancer data has been challenged by a) The volume of the data to be explored tends to be quite big since it has been accumulated for several decades. For example, the SEER (Surveillance Epidemiology and End Results) data set [2] contains some 291,760 breast cancer incidences collected from 1974 up to date; and b) The number of variables (features) of the data usually is considerably high, e.g., over a thousand or even more. The high dimensionality is mainly caused by the categorical variables contained in the data that may have many distinct symbolic values, since each of the distinct values of a categorical variable needs to be transformed into a unit vector using, for example, the one-hot encoding (orthogonal encoding) method to make a categorical variable applicable for an algorithm. Due to these factors, some typical data mining algorithms, such as the k-means clustering algorithm, may not perform well with high dimensionality, high volume, and high sparsity data. The k-means clustering algorithm is a widely used unsupervised descriptive modelling approach for grouping (segmenting) a given data set based on similarities among the data samples with respect to the values taken by certain variables involved. The algorithm is simple, effective in general, and can converge within a few iterations. However, the algorithm has several problems when applied to a data set with high dimensionality and high sparsity: 1. The typical Euclidean distance for similarity measure is inefficient when the number of variables is large, and the number of samples is relevantly small [3]; 2. The computational complexity of the algorithm increases with the number of dimensions [4]; and 3. It is difficult to determine the cluster centroids if the data values are sparse, i.e., only a small number of data entries having a non-null value. An example of such a data is the resultant data set transformed from a data that contains many categorical variables using the one-hot encoding method. This can also be viewed as an asymmetry data matrix. In addition, sparsity has made the algorithm very sensitive to noise. To address the high dimensionality problem involved in the k-means clustering, a proper dimensionality reduction approach is usually considered. The principal component analysis (PCA) is a popular approach for such a purpose. PCA forms a linear transformation to transform the original data into a new space spanned by a set of principal components. Depending on the significance of each principal component, only a few significant principal components could be selected to form a subspace with a low dimensionality, and then the k-means clustering can be performed in the subspace. The PCA-based subspace clustering approach has been applied to medical image segmentation [5]. It should be noted that data samples may become insufficient when dealing with high dimensional data since it may lead a modelling process that involves too many parameters to learn and/or to optimize with relatively a very small number of samples.
126
D. Chen et al.
In this paper, a deep learning-based approach is applied to identify critical casual attributions that might affect the survival of a breast cancer patient. The SEER breast cancer data is explored in this study. The data set contains rich patient-level and treatment-level information on breast cancer incidences, such as cancer site, cancer stage, treatment received, and cause of death. Restricted Boltzmann machines (RBMs) are proposed for dimensionality reduction in the present analysis. RBM is a popular paradigm of deep learning networks and can be used to extract features from a given data set. RBMs transform data in a non-linear manner into a lower dimensional space for further modelling, for instance, clustering analysis and classification. In this study, a group of RBMs has been trained to sequentially transform the original data into a very low dimensional space, and then the k-means clustering is conducted in this space. Furthermore, the results obtained about the cluster membership of the data samples are mapped back to the original space for interpretation and insight creation. The analysis has demonstrated that essential features in the data set can be effectively extracted and brought forward into a much lower dimensionality space formed by RBMs. The remainder of this paper is organized as follows. Section 2 provides a literature review on the relevant works in relation to diagnosis analytics in breast cancer data. Section 3 discusses in detail the methodology adopted in this study including the entire analytical process and RBMs. The SEER data set is described in Sect. 4 along with the essential data pre-processing performed on the data. Section 5 gives a detailed account about the analysis experiments, and further in Sect. 6, the findings from the analysis are interpreted and summarized. Finally, in Sect. 7, concluding remarks are discussed and the further research is outlined.
2 Relevant Works Data mining techniques have been widely used in medical research, and especially, in medical diagnosis of diverse types of cancer. Various models have been developed for this purpose including qualitative models [6–9], quantitative models [10, 11], and hybrid models [12]. Very recently, many deep learning-based models have been considered in several case studies [13]. Segmenting breast cancer patients was studied in [6] and it was found that the number and the types of the resultant clusters were similar in terms of symptom occurrence rates or symptom severity ratings. Five clusters were identified using symptom occurrence rates while six clusters were established using symptom severity ratings. The types of clusters were also similar. A bisecting k-means algorithm was applied to analyze three diseases: breast cancer, Type 1 diabetes, and fibromyalgia in [7]. Their results showed that, although the clusters established were different from each other, all the clusters had several common features. In [8], the time effect and symptom for patients who received chemotherapy was examined using clustering analysis. An ensemble learningbased algorithm for lung cancer diagnosis was investigated [14]. The algorithm can achieve a high classification accuracy with a low false no-cancer rate (false negative rate). The model contained two levels. The first level consisted of a group of individual neural networks which can be used to identify if a cell is a cancer cell or not. A cell was considered a cancer cell if any network classified it a cancer cell. The second level
Deep Learning Causal Attributions of Breast Cancer
127
employed a group of neural networks to determine which type of cancer a recognized cancer cell belongs to. This algorithm has implemented a certain classification scheme. A Bayesian network structure was proposed for canner diagnosis purpose [10]. The network can be trained using a direct causal learner algorithm and has been applied to both simulated and real data sets. In addition, algorithms for dimensionality reduction and feature extraction have also been considered. A two-step algorithm was investigated to address the high dimensionality problem in genes analysis [15]. Interestingly in [16] a statistical test and genetic algorithm was utilized for feature selection, and further leave-one-out cross validation was used along with receiver operating characteristic curve to identify which features to be used in order to achieve the best classification performance. Several real-life data sets were chosen to testify and evaluate the effectiveness of the algorithm. Comparing different data mining algorithms for a better cancer risk or survival rate prediction and classification has received a great research attention. In [12] various Bayesian-based classifiers were explored for the prediction of the survival rate of 6 months after treatment, including naive Bayesian classifier, selective naive Bayesian classifier, semi-naive Bayesian classifier, tree-augmented Naive Bayesian classifier, and k-dependence Bayesian classifier. The performance of all these classifiers were evaluated and compared. Artificial neural networks (ANNs) were employed to examine mammograms [11]. The work has demonstrated that ANNs with two hidden layers performed better than ANNs with only one hidden layer. In [17] logistic regression, ANNs, and Bayesian networks were compared for accuracy. Their results showed that the Bayesian model outperformed other methods. Sensible new knowledge can be discovered when applying data mining algorithms in medicine. Yang [18] proposed a vicinal support vector classifier to handle data from different probability distributions. The proposed method had two steps: clustering and training. In the first step, a supervised kernel-based deterministic annealing clustering algorithm was applied to partition the train data into different soft vicinal areas in the feature space. By doing so, they constructed vicinal kernel functions. In the training step, the objective function, called vicinal risk function, was minimized under the constraints of the vicinal areas defined in the clustering step. Kakushadze [19] applied k-means to cluster different types of cancer with genome data without using nonnegative matrix factorization. They found that, out of 14 types of cancer, three had no cluster-like structures, two had high within-cluster correlations, and the others had common structure. In conclusion, many data mining technologies such as neural networks and Bayesian networks are applied to construct cancer diagnosis models. The constructed models can diagnose various cancer and have shown a high accuracy. On the other hand, there is still a clear lack of dealing with high-dimensional and high-sparsity medical data.
3 Methodology In this paper a deep learning-based approach is proposed for feature extract and dimensionality reduction in order to identify crucial casual attributions with high dimensional, high-volume, and high-sparsity breast cancer data. The entire analysis process is illustrated conceptually in Fig. 1 and the key techniques applied are discussed below.
128
D. Chen et al.
Fig. 1. The analysis process with the key steps.
3.1 Restricted Boltzmann Machines for Dimensionality Reduction and Feature Extraction RBMs are used in this research for feature extraction and dimensionality reduction. Using an RBM, data from an original high-dimensional space can be transformed into a feature space with a much smaller number of dimensions for analysis, such as the k-means clustering. It is our intention in this study to examine if this approach can effectively address the issues relating to cluster analysis in a high-dimensional and highly-sparse space in order to partition the breast cancer patients into various meaningful groups based on their similarities in relation to a set of features and measures. A typical topology of RBM is shown in Fig. 2. An RBM is an energy-based generative statistical model with hidden and visible variables [20]. The energy of the visible node state and hidden node state is defined as bi vi − bi vi − vi hj wij (1) E(v, h) = − i∈visible
i∈hidden
i,j
where vi and hj are the ith visible variable and the jth hidden variable, i.e., the original input i and the feature j in the feature space; bi and bj are the biases to the nodes, and wij is the connection weight between them. The RBM assigns a probability to every possible pair of visible and hidden variables using the energy function P(v, h) =
1 −E(v,h) e Z
(2)
where Z is the sum of all the possible pairs of the visible and hidden variables expressed as Z= eE(v,h) (3) v,h
The probability of the visible v is given as 1 −E(v,h) P(v) = e Z
(4)
h
To minimize the energy of that input data, the weights and the biases will be adjusted by
wij = η vi hi − vi hi
(5)
Deep Learning Causal Attributions of Breast Cancer
129
Fig. 2. Typical structure of RBM.
where η is the learning rate, and · denotes the expectation under the distribution of the variables vi and hi , and their reconstructed pair vi and hi . In practice, a single RBM may not be sufficient to extract features with a reduced dimensionality. Often several RBMs are used in a sequential way, forming stacked RBMs layer-by-layer. The outputs of a trained RBM in the stack is used as the inputs to the next adjacent RBM for training. The number of RBMs to be used varies and usually is determined using a trial-and-error approach. For a detailed guidance on the training the readers can refer to, for example, [21]. 3.2 k-Means Cluster Analysis k-means clustering is one of the most popular algorithms in data mining for grouping samples into a certain number of groups (clusters) based on Euclidean distance measure. Assume V1 , V2 , · · · , Vn are a set of vectors, and these vectors are to be assigned to k clusters S1 , S2 , · · · , Sk . Then the objective function of the k-means clustering can be expressed as f (m1 , m2 , · · · , mk ) =
k i=1
Vj ∈Si
Vj − mi 2
(6)
where mi represents the centroid of cluster Si .
4 Data Pre-processing The SEER breast cancer data explored in this study contains totally 291,760 incidences registered in the US from 1974 to 2017. The data set has two separate collections with incidences from 1974 to 2014 and 2014 to 2017, respectively. Note that the number of variables in the two sets of collections is 134 and 130, respectively. Most of the variables in the data set are categorical type with a varying number of distinct values from 2 to more than 100. The original data sets need to be pre-processed and transformed to a target data set for analysis. The most crucial task in the data pre-processing process is to identify if there are any data quality issues in the data and further to adopt appropriate strategies to address them accordingly.
130
D. Chen et al.
The SEER data under consideration has typical data quality problems, including inconsistent variables in the two collections of the data, some duplicate variables and therefore directly or indirectly correlated to each other, missing values, and incomparative value ranges for several numeric variables. As such, the main tasks involved in the data pre-processing are as follows: 1. Select meaningful and common variables that are applicable to both sets of the data. Further, remove any duplicate variables that are identical from statistical perspective and/or from medical diagnostical perspective. As a result, a total of 130 variables have been chosen to use. 2. Remove any incidences that contain missing value. The removal is reasonable and acceptable since the entire data set is big and there were only some 10,000 incidences containing missing values. As such no replacement of missing value is needed. 3. Transform the value range of each numeric variable into a unit interval [0, 1] using the min-max normalization. 4. Represent each distinct value of a categorical value as a unit vector using the one-hot encoding. This leads to a significant number of dummy variables to be created. The data pre-processing process was very time-consuming, and it has eventually led to a resultant target data set with 260,000 incidences and 961 variables. The variable Survival that represents if a patient survived has been considered the target variable since this analysis is aiming at identifying crucial factors that potentially affect the survival of a breast cancer patient. It should be noted that no incidences have been removed although they may be considered outliers, and this is because each instance should be analysed.
5 Experimental Settings Using the target data set created, RBMs have been implemented for dimensionality reduction and feature extraction, and further the k-means clustering analysis has been applied to the samples in the feature space formed by the RBMs. Three RBM models, RBM_1, RBM_2 and RBM_3, have been constructed in a sequential manner as follows: • RBM_1 has 961 input nodes and 625 (25 × 25) hidden nodes; • RBM_2 has 625 input nodes and 169 (13 × 13) hidden nodes; and • RBM_3 has 169 input nodes and 81 (9 × 9) hidden nodes. RBM_1 needs to be trained using the target data set; RBM_2 needs to be trained using the outputs of the trained RBM_1, and RBM_3 needs to be trained using the outputs of the trained RBM_2. In other words, the original data in a 961-dimensioanl space has been transformed into an 81-dimesional space for analysis. RBM_1, RBM_2, and RBM_3 have been trained with 10,000, 20,000, and 20,000 iterations, respectively. The initial values of all the connection weights and the biases were randomly selected from a uniform distribution in the interval [−1, 1].
Deep Learning Causal Attributions of Breast Cancer
131
Following the entire analysis process as shown in Fig. 1, the k-means clustering analysis has been applied to the outputs of RBM_3. The number of centroids was set to 6 with randomly selected initial centroids to start the clustering process.
6 Pattern Interpretations and Findings In order to interpret each of the clusters created for diagnosis purpose, the results obtained about the cluster membership of all the samples have been mapped back to the original space (of 961 dimensions) since the variables in each of the lower dimensional spaces is not interpretable. Note that each instance of a patient’s records in each of the spaces, e.g., the original and all the RBM spaces, can be visualized by a “facial” like imagery and this enables each patient’s profile can be compared with each other in an initiative and easy way. Examples of such “imagery” description are provided in Fig. 3, where each two rows from the top to the bottom contains 10 patients’ profile of the same cluster in the RBM_3 (81 dimensions) space and their counterparts in the original (961 dimensions) spaces,
Fig. 3. Samples of a patient’s profile in a “facial” like imagery description. each two rows from the top to the bottom contains 10 patients’ profile of the same cluster in the RBM_3 (81 dimensions) space and their counterparts in the original (961 dimensions) spaces, respectively.
132
D. Chen et al. Table 1. A comparison of different clusters based on several features.
Cluster Survival (%) Grade (%) Grade 1 Grade 2 Grade 3 Grade 4 Cell not Determined Number of Malignant Tumors 1 2 3 4 5 6 7 Treatment Surgery Performed Surgery not Recommended Contraindicated Died before Unknown Reason for No Surgery Refused Recommended Unknown if Surgery Performed Alive or Died due to Cancer Dead Not first Tumor In Situ Malignant Primaries One primary only First of Two or More Primaries Second of Two or More Primaries Third of Three or More Primaries Fourth of Four or More Primaries Fifth of Five or More Primaries
0 25.00
1 92.00
2 72.00
3 89.00
4 74.00
5 78.00
7.00 27.00 31.00 3.00
23.00 45.00 24.00 0.00
10.00 41.00 43.00 1.00
14.00 31.00 27.00 2.00
3.00 8.00 80.00 3.00
32.00 59.00 3.00 0.00
32.00
8.00
5.00
26.00
6.00
6.00
58.00 33.00 7.00 2.00 0.00 0.00 0.00
1.00 71.00 23.00 4.00 1.00 0.00 0.00
68.00 26.00 5.00 1.00 0.00 0.00 0.00
47.00 39.00 11.00 2.00 1.00 0.00 0.00
64.00 28.00 7.00 1.00 0.00 0.00 0.00
57.00 33.00 8.00 1.00 1.00 0.00 0.00
38.00
95.00
99.00
95.00
99.00
99.00
34.00 2.00 0.00
2.00 0.00 0.00
1.00 0.00 0.00
3.00 0.00 0.00
1.00 0.00 0.00
1.00 0.00 0.00
15.00 4.00 1.00
1.00 1.00 1.00
0.00 0.00 0.00
2.00 0.00 0.00
0.00 0.00 0.00
0.00 0.00 0.00
6.00
0.00
0.00
0.00
0.00
0.00
24.00 41.00 35.00 0.00 100.00
1.00 0.00 99.00 3.00 97.00
65.00 15.00 20.00 0.00 100.00
57.00 1.00 42.00 99.00 1.00
68.00 7.00 25.00 0.00 100.00
67.00 3.00 30.00 0.00 100.00
58.00
1.00
67.00
47.00
63.00
57.00
6.00
0.00
12.00
12.00
12.00
13.00
30.00
79.00
18.00
34.00
22.00
26.00
4.00
16.00
2.00
6.00
3.00
4.00
1.00
3.00
0.00
1.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00 (continued)
Deep Learning Causal Attributions of Breast Cancer
133
Table 1. (continued)
Sixth of Six or More Primaries Seventh of Seven or More Primaries Eighth of Eight or More Primaries Ninth of Nine or More Primaries Unknown Stage Stage 0 Stage I Stage IIA Stage IIB Stage III NOS Stage IIIA Stage IIIB Stage IIIC Stage IV Not Applicable Stage Unknown
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.00 2.00 2.00 2.00 1.00 2.00 8.00 4.00 41.00 2.00 36.00
3.00 62.00 21.00 6.00 0.00 2.00 1.00 1.00 1.00 0.00 3.00
0.00 1.00 36.00 28.00 0.00 22.00 3.00 9.00 0.00 0.00 2.00
99.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1.00 62.00 30.00 3.00 0.00 0.00 1.00 0.00 0.00 0.00 2.00
0.00 84.00 12.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00
respectively. Only samples from 4 out of the 6 clusters created were selected. The value for each pixel of the images is either 1 or 0 if a component associated is binary data type, or between 0 and 1 if the component associated is numeric data type, indicting a scale grading between black and white, especially in the original space. The six clusters are labelled as Cluster 0, Cluster 1, …, and Cluster 5, respectively. To examine the clusters, and to compare them with each other, several variables have been used as shown in Table 1. In terms of factors affecting survival rate, there are several factors are crucial as highlighted in yellow in the Table. It appears that if a patient had a surgery performed and the stage of tumor are two essential casual attributors. The survival rate for those who either didn’t have surgery performed or were not recommended for a surgery had a very low survival rate (only 25%). In addition, patients in this group usually had a stage IV or not known tumor. As such accurately and timely detect the stage of a tumor and recommend on having a surgery or not accordingly is crucial. A consistent pattern can be identified if, for example, examining the two clusters, Cluster 0 and Cluster 1, using a radar graph as shown in Fig. 4. It is evident from Fig. 4 that high survival rate was closely correlated with if a tumor was localized and if a surgery was performed. On the other hand, low survival was in general related to a tumor was regional and distant, and a surgery was not performed.
134
D. Chen et al.
Grade: 1 100% Treatment:… Treatment: Surgery… 80% Treatment: Surgery…
Grade: 2 Grade: 3
60%
Grade: 4
40% Stage:Unknown Stage:Distant
Grade: Unknown
20% 0%
Negave Nodes
Stage:Regional
No Posive Nodes
Stage:Loclized Four Nodes Examined Three Nodes…
No Node Examined One Node Examined Two Nodes Examined
Fig. 4. Low survival rate (in blue) vs. high survival rate (in red) with several factors.
7 Concluding Remark and Future Work In this paper, a deep learning-based approach is applied to high dimensional, highvolume, and high-sparsity medical data to identify critical casual attributions that might affect the survival of a breast cancer patient. The analysis has demonstrated that essential features in a high-dimensional sample space can be effectively extracted and brought forward into a much lower dimensionality space formed by RBMs. This has provided a novel approach to understand high-dimensional data. The analysis has also identified several crucial casual factors that significantly affect a patient’s survival rate among all the variables. Further research will focus on exploring what features RBM extracts, how to interpret the feature space established by an RBM and design an ideal imaginary description for visualizing a healthy person for comparison purpose.
References 1. WHO. https://www.who.int/cancer/prevention/diagnosis-screening/breast-cancer/en/ 2. Hankey, B.F., Ries, L.A., Edwards, B.K.: The surveillance, epidemiology, and end results program. Cancer Epidemiol. Prev. Biomark. 8(12), 1117–1121 (1999) 3. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011) 4. Zhang, K., Liu, J., Chai, Y., Qian, K.: An optimized dimensionality reduction model for high-dimensional data based on restricted boltzmann machines. In: 27th Chinese Control and Decision Conference (2015 CCDC), Qingdao, China, pp. 2939–2944. IEEE (2015) 5. Katkar, J.A., Baraskar, T.: A novel approach for medical image segmentation using PCA and K-means clustering. In: 2015 International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT), Davangere, India, pp. 430–435. IEEE (2015)
Deep Learning Causal Attributions of Breast Cancer
135
6. Sullivan, C.W., et al.: Differences in symptom clusters identified using symptom occurrence rates versus severity ratings in patients with breast cancer undergoing chemotherapy. Eur. J. Oncol. Nurs. 28, 122–132 (2017) 7. Chen, A.T.: Exploring online support spaces: using cluster analysis to examine breast cancer, diabetes and fibromyalgia support groups. Patient Educ. Couns. 87(2), 250–257 (2012) 8. Sanford, S.D., Beaumont, J.L., Butt, Z., Sweet, J.J., Cella, D., Wagner, L.I.: Prospective longitudinal evaluation of a symptom cluster in breast cancer. J. Pain Symptom Manage. 47(4), 721–730 (2014) 9. Sarenmalm, E.K., Browall, M., Gaston-Johansson, F.: Symptom burden clusters: a challenge for targeted symptom management. A longitudinal study examining symptom burden clusters in breast cancer. J. Pain Symptom Manage. 47(4), 731–741 (2014) 10. Rathnam, C., Lee, S., Jiang, X.: An algorithm for direct causal learning of influences on patient outcomes. Artif. Intell. Med. 75, 1–15 (2017) 11. Fogel, D.B., Wasson III, E.C., Boughton, E.M.V., Porto, W.: Evolving artificial neural networks for screening features from mammograms. Artif. Intell. Med. 14(3), 317–326 (1998) 12. Blanco, R., Inza, I., Merino, M., Quiroga, J., Larrañaga, P.: Feature selection in bayesian classifiers for the prognosis of survival of cirrhotic patients treated with tips. J. Biomed. Inform. 38(5), 376–388 (2005) 13. Kose, U., Alzubi, J. (eds): Deep Learning for Cancer Diagnosis. Springer, Singapore (2020). https://doi.org/10.1007/978-981-15-6321-8 14. Zhou, Z.H., Jiang, Y., Yang, Y.B., Chen, S.F.: Lung cancer cell identification based on artificial neural network ensembles. Artif. Intell. Med. 24(1), 25–36 (2002) 15. Xu, R., Damelin, S., Nadler, B., Wunsch, D.C.: Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps. Artif. Intell. Med. 48(2), 91–98 (2010) 16. Li, L., et al.: Data mining techniques for cancer detection using serum proteomic profiling. Artif. Intell. Med. 32(2), 71–83 (2004) 17. Regnier-Coudert, O., McCall, J., Lothian, R., Lam, T., McClinton, S., NDow, J.: Machine learning for improved pathological staging of prostate cancer: a performance comparison on a range of classifiers. Artif. Intell. Med. 55(1), 25–35 (2012) 18. Yang, X., Cao, A., Song, Q., Schaefer, G., Su, Y.: Vicinal support vector classifier using supervised kernel-based clustering. Artif. Intell. Med. 60(3), 189–196 (2014) 19. Kakushadze, Z., Yu, W.: k-means and cluster models for cancer signatures. Biomol. Detect. Quantif. 13, 7–31 (2017) 20. Hinton, G.E.: A practical guide to training restricted Boltzmann machines. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_32 21. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Effect of Adaptive Histogram Equalization of Orthopedic Radiograph Images on the Accuracy of Implant Delineation Using Active Contours Alicja Smolikowska1 , Pawel Kami´ nski2 , Rafal Obuchowicz3 , and Adam Pi´ orkowski1(B) 1
Department of Biocybernetics and Biomedical Engineering, AGH University of Science and Technology, Mickiewicza 30 Avenue, 30–059 Cracow, Poland [email protected] 2 Malopolska Orthopedic and Rehabilitation Hospital, Modrzewiowa 22, 30–224 Cracow, Poland 3 Department of Diagnostic Imaging, Jagiellonian University Medical College, Kopernika 19, 31–501 Cracow, Poland
Abstract. The segmentation of implants on orthopedic radiographs is a difficult task. Due to different acquisition conditions, some photos have reduced quality and contrast. The use of cement in alloplastics particularly hinders this process. This article presents considerations on how to increase the effectiveness of the active contour method when segmenting these implants using adaptive histogram equalization. Radiogram analyses are presented and a binarization method is proposed. Segmentation accuracy is experimentally estimated in relation to the parameter space of the algorithms used. Keywords: Radiograph · Knee · Hip · Implant · Prosthesis · Segmentation · Active contour · Histogram equalisement · CLAHE
1
Introduction
The hip and knee are the most commonly replaced joints. Globally, approximately half a million procedures are performed each year [19]. The best results in terms of prostheses survival for more than 20 years are achieved with use of total hip arthroplasty (THA) and total knee arthroplasty (TKA) [2,16]. With THA, both the femoral head and the acetabulum are replaced; with TKA, both condyles are replaced. With these types of prosthesis, the fixed prosthetic devices articulate with each other, which significantly limits bone wear. There are important differences in the fixation techniques of prostheses. With cemented prostheses, a cement mantle is present around the metallic shaft. With non-cemented prostheses, the shaft might be coated with hydroxyapatite in order to induce new c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 136–144, 2021. https://doi.org/10.1007/978-3-030-80129-8_11
Effect of Adaptive Histogram Equalization of Orthopedic Radiograph Images
137
bone formation [7,8]. Titanium implants became more common. Titanium has been used in medicine since 1970 [1], and it has been found to display high biocompatibility [5,7,11]. In vitro studies have shown that this material stimulates osteointegration [1]. Bone sclerosis and femoral shaft thickening are radiographic indicators of bone ingrowth and prosthesis fixation [3,15]. Aseptic loosening (AL) is the most common complication and the reason for revision surgery related to both hip and knee prostheses [12]. The presence of lucency exceeding 2 mm width around the metallic shaft is a widely accepted radiographic symptom of AL [10]. Along with the coexistence of clinical indicators such as pain, the aforementioned radiographic indicators can be assumed to represent prosthesis loosening. This phenomenon is evaluated on the basis of radiograms as the placement of a prosthesis changes over time. In the case of THA and TKA, the presence of lucency that extends to the entire circumference of the prosthesis is the most important indicator of loosening. Furthermore continuous widening of the lucency strongly suggests a progressing loosening [18]. The aforementioned symptoms have to be meticulously examined over time using consecutive radiograms over a time span that is appropriate to the clinical situation [13]. Assessment of the geometry of the hip prosthesis and shaft-bone interface is therefore a demanding task that is prone to errors due to the subjectivity of observers [6]. Radiographic image segmentation and other image post-processing techniques are promising tools for increasing the accuracy of radiological assessment of TKA and THA. This article proposes a new, precise approach to implant delineation on orthopedic radiograph images. The document is divided into several sections as follows. Section 1 introduces the topic of scientific achievements in the field of orthopedic radiography. Section 2 provides background information on input data, processing methods and the proposed approach. Next the experimental evaluation is presented and the results are discussed. The last part contains the conclusions.
2
Implant Segmentation on Radiographs
Implantation segmentation is a task that has been undertaken by various teams, and different solutions have been developed. For example, the Fuzzy C-Means method was used in [14]. Unfortunately, this approach is only effective for implants that are well-defined in images. A popular approach is the use of edge detection using the Canny [4,9] algorithm. The work [17] proposed segmentation consisting of four stages: denoising, binary masking, growing the region using the Malladi-Sethian level-set algorithm, and contour refining. The method is not fully automatic and requires marking of the seed point. The above works did not report any difficulties due to the variable parameters of radiographs. An implant is usually represented as a homogeneous area, but various problems may arise. For example, the saturation brightness for an implant may be comparable to the saturation brightness of the adjacent bone. Another difficulty is the presence of cement in the neighborhood of the implant. Previous attempts to implement segmentation did not attempt to address the indicated problems, but attention is drawn to them in this work.
138
2.1
A. Smolikowska et al.
Input Data
For the purposes of this study, X-ray images were prepared that documented the course of treatment in 241 patients. All images were acquired with a KonicaMinolta Regius CS3. The dimensions of the images were 1008 × 1350 to 2446 × 2010 pixels, and the resolution (pixel spacing) was 0.175 mm. Source image depth was 12 bit. From these pictures, the 30 most difficult cases were selected in which the aforementioned issues appear. 2.2
Proposed Approach
An active contour-based approach is proposed. Initial contours were manually prepared in the implant area; then, for the whole parameter space of used algorithms, the binary segmentations indicated by the active contour were prepared. After adding up the binary masks (Fig. 1(c), 1(d)), a statistical description of the results was obtained which was then subjected to automatic binarization (Fig. 1e), 1(f)). In this way, the proposed implant shape was finally achieved. The above procedure was performed for the original image (Fig. 1, left column) and for the image after adaptive contrast adjustment (Fig. 1, right column), taking into account the parameter space of these algorithms. 2.3
Space of Parameters
The work uses the ABsnake implementation of the active contour algorithm that is available for the ImageJ environment. It was found that 50 iterations was sufficient, and gradientthreshold is an important parameter that affects the result. Histogram operations were performed using the “Enhance Local Contrast (CLAHE)” function for the ImageJ/Fiji package. Blocksize and maxslope were indicated as parameters that affect the result. The histogrambins parameter was taken as 256. For the purposes of the experiment, the following ranges of parameters were assumed: – gradientthreshold: ABsnake active contour with values of 2, 3, 4. – blocksize: Enhance Local Contrast (CLAHE) with values of 11, 111, ..., 1511. – maxslope: Enhance Local Contrast (CLAHE) with values of 2, 3, 4. 2.4
Experimental Evaluation
As part of the experiment, segmentation using the active contour method was tested on the original images (for 3 values of the threshold gradient parameter); the same was done for the same images after applying local contrast enhancement throughout the previously described parameter space. Images of segmented binary sums were determined and compared to segmentation agreed by radiologists. It should be emphasized that the outline that initiated the active contour algorithm was the same in both cases.
Effect of Adaptive Histogram Equalization of Orthopedic Radiograph Images
139
(a) source
(b) after CLAHE
(c) sum of binary AC segmentations
(d) sum of binary AC segmentations
(e) after automatic binarization
(f) after automatic binarization
Fig. 1. Stages of the proposed algorithm.
140
A. Smolikowska et al.
The most important result of the tests was that the effectiveness of the approach was tested. For the group of 30 original images, the use of the active contour algorithm in 15 out of cases 30 (50%) allowed at least one correct contour to be obtained. When using the CLAHE algorithm, at least one valid segmentation was found for each input image in the tested space (30 of 30). In these cases, the binary sum segment images gave the correct result after the final automatic binarization (different binarization algorithms are valid). The 3D plot (Fig. 2) depicts the parameter space, in which points mark the parameter sets of the correct segmentation obtained for all images. This space is unfortunately not homogeneous; however, a set of parameters was found for which there were the most (60%) valid segmentations: – gradientthreshold: 4; – blocksize: 511; – maxslope: 2.
binsize
1500
1000
500
5 4
5 3
4 3
2
2
1
maxslope
0
1 0
gradientthreshold
Fig. 2. Parameter space of correct segmentations of a set of 30 input images for the proposed method.
Effect of Adaptive Histogram Equalization of Orthopedic Radiograph Images
(a) source image
(b) histogram equalized
(c) after CLAHE
(d) profile
141
Fig. 3. Source images.
Fig. 4. Profile plots. Lines: black - source image; red - after histogram equalization; blue - after CLAHE.
142
A. Smolikowska et al.
Fig. 5. Profile plots for ROI. Lines: black - source image; red - after histogram equalization; blue - after CLAHE. Yellow zone - the signal slope on the edges of the implant.
(a) final segmentation
(b) final segmentation for Fig 1
Fig. 6. Final segmentations. Red line - for original images without preprocessing; green line - for images after CLAHE.
3
Conclusions
The proposed implant segmentation approach allows the correct delineation of objects on a radiograph with satisfactory efficiency. Using adaptive histogram equalization, the number of valid segmentations increased significantly. For each image in a selected group, at least one correct delineation was achieved, and the final solutions were also satisfactory. Analysis of the profiles of the sample image (Fig. 3) indicates that the raised edge of the bone-implant connection is definitely enhanced by the CLAHE algo-
Effect of Adaptive Histogram Equalization of Orthopedic Radiograph Images
143
rithm (but not, for example, the global histogram equalization, Fig. 4, 5), therefore the active contour algorithm more accurately matched the implant edge (Fig. 6). The proposed method of statistical object determination seems to have potential and will be used in other future image-processing tasks. The future work involves the precise analysis of prosthesis loosening on orthopedic radiographs. Acknowledgment. This work was financed by the AGH – University of Science and Technology, Faculty of EAIIB, KBIB no 16.16.120.773.
References 1. Benazzo, F., Botta, L., Scaffino, M.F., Caliogna, L., Marullo, M., Fusi, S., Gastaldi, G.: Trabecular titanium can induce in vitro osteogenic differentiation of human adipose derived stem cells without osteogenic factors. J. Biomed. Mater. Res. Part A 102(7), 2061–2071 (2014) 2. Berry, D.J., Scott Harmsen, W., Cabanela, M.E., Morrey, B.F.: Twenty-five-year survivorship of two thousand consecutive primary charnley total hip replacements: factors affecting survivorship of acetabular and femoral components. JBJS 84(2), 171–177 (2002) 3. Callaghan, J.J.: The clinical results and basic science of total hip arthroplasty with porous-coated prostheses. JBJS 75(2), 299–310 (1993) 4. Hermans, J., Bellemans, J., Maes, F., Vandermeulen, D., Suetens, P.: A statistical framework for the registration of 3D knee implant components to single-plane xray images. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8. IEEE (2008) 5. Hofmann, A.A.: Response of human cancellous bone to identically structured commercially pure titanium and cobalt chromium alloy porous-coated cylinders. Clin. Mater. 14(2), 101–115 (1993) 6. Kami´ nski, P., Szmyd, J., Ambro˙zy, J., Jurek, W.: Postoperative migration of short stem prosthesis of the hip joint. Ortop. Traumatol. Rehabil. 17(1), 29–38 (2015) 7. Kami´ nski, P., Szmyd, J., Ambro˙zy, J., Jurek, W., Jaworski, J.: Use of trabecular titanium implants for primary hip arthroplasty. Ortop. Traumatol. Rehabil. 18(5), 461–470 (2016) 8. Kaplan, P.A., Montesi, S.A., Jardon, O.M., Gregory, P.R.: Bone-ingrowth hip prostheses in asymptomatic patients: radiographic features. Radiology 169(1), 221–227 (1988) 9. Mahfouz, M.R., Hoff, W.A., Komistek, R.D., Dennis, D.A.: Effect of segmentation errors on 3D-to-2D registration of implant models in X-ray images. J. Biomech. 38(2), 229–239 (2005) 10. Manaster, B.J.: From the RSNA refresher courses. total hip arthroplasty: radiographic evaluation. Radiographics 16(3), 645–660 (1996) 11. Marin, E., Fusi, S., Pressacco, M., Paussa, L., Fedrizzi, L.: Characterization of cellular solids in Ti6Al4V for orthopaedic implant applications: trabecular titanium. J. Mech. Behav. Biomed. Mater. 3(5), 373–381 (2010) 12. Nishii, T., Sugano, N., Masuhara, K., Shibuya, T., Ochi, T., Tamura, S.: Longitudinal evaluation of time related bone remodeling after cementless total hip R 339, 121–131 (1997) arthroplasty. Clin. Orthop. Relat. Res.
144
A. Smolikowska et al.
13. O’Neill, D.A., Harris, W.H.: Failed total hip replacement: assessment by plain radiographs, arthrograms, and aspiration of the hip joint. J. Bone Jt. Surg. Am. 66(4), 540–546 (1984) 14. Oprea, A., Vertan, C.: A quantitative evaluation of the hip prosthesis segmentation quality in x-ray images. In: 2007 International Symposium on Signals, Circuits and Systems, vol. 1, pp. 1–4. IEEE (2007) 15. Sychterz, C.J., Engh, C.A.: The influence of clinical factors on periprosthetic bone remodeling. Clin. Orthop. Relat. Res. 322, 285–292 (1996) 16. Szmyd, J., Jaworski, J.M., Kami´ nski, P.: Outcomes of total knee arthroplasty in patients with bleeding disorders. Ortop. Traumatol. Rehabil. 19(4), 361–371 (2017) 17. Tarroni, G., Tersi, L., Corsi, C., Stagni, R.: Prosthetic component segmentation with blur compensation: a fast method for 3D fluoroscopy. Med. Biol. Eng. Comput. 50(6), 631–640 (2012) 18. Tigges, S., Stiles, R.G., Roberson, J.R.: Complications of hip arthroplasty causing periprosthetic radiolucency on plain radiographs. AJR. Am. J. Roentgenol. 162(6), 1387–1391 (1994) 19. Yu, S., Saleh, H., Bolz, N., Buza, J., Iorio, R., Rathod, P.A., Schwarzkopf, R., Deshmukh, A.J.: Re-revision total hip arthroplasty: epidemiology and factors associated with outcomes. J. Clin. Orthop. Trauma 11, 43–46 (2020)
Identification Diseases Using Apriori Algorithm on DevOps Aya Mohamed Morsy(B) and Mostafa Abdel Azim Mostafa College of Computing and Information Technology, Arab Academy for Science, Technology and Maritime Transport, AAST, Cairo, Egypt
Abstract. Nowadays, there are a lot of changes raise rapidly, due to customer needs, like delivering fast to the market, delivering better products at optimum cost, minimized risk, and the ability to manage change and prioritize it. We also need to reduce the time taken to gain customer feedback, so as to meet the customer needs. We must have a fast process and good communication in place as provided by DevOps. The health care industry generates a lot of data about patients, diseases and hospitals. We can use this data to help doctors and the healthcare sector to make better decisions. The Apriori algorithm and Apriori-TID have been used to find the frequent items in a given dataset. We used the algorithm to find frequent diseases using other parameter like year or gender. This paper introduces new web application based on DevOps (AWS) using Apriori algorithm and Apriori-TID to identify frequent diseases. The experiment explains that the proposed application improved and minimized the execution time by 15% when compared to the local machine according to the use of DevOps tools and memory usage. Keywords: DEVOPs · AWS · DATA Mining · Apriori Algorithm · Apriori-TID Algorithm
1 Introduction Challenges require big organizations to integrate the team and software. This is the reason for the needs to use DevOps. DevOps main demand on IT is to deliver faster in continuous automated way by bridging the gap between teams (here, the devolvement and operations teams). There is more pressure to adapt to the market needs and deliver quickly due to the continuous change of the environment and technology. In order to meet these changes, the software team needs to be more agile in all phases like coding, building, testing, integration and deployment in continues way by using continues integration and delivery (CI&CD) [1]. Systems also need to reduce the time taken, price and to manage the complex process as well [2]. Most organizations face communication problems between the development team and operational team as, there is always a barrier between them. When the product did not satisfy the customer‘s needs, there are a lot of complaints between development team and operation team raised up like: “un realistic project deadline”, “lazy technical staff”, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 145–160, 2021. https://doi.org/10.1007/978-3-030-80129-8_12
146
A. M. Morsy and M. A. A. Mostafa
and “sloppy coding”. Each team will think that it is not their problem, and that is the concern of the other team. Organizations also needs to reduce the time involved in getting customer feedback by automating it, so there is a big need to build a model to fit the end user requirements [3]. This model is a combination between water scrum fall and DevOps on the cloud computing using AWS, A priori algorithm and A priori-TID algorithm to get frequent diseases [4]. What is AWS? Amazon Web Services (AWS) is a cloud service platform, offering machines, databases, content delivery and other functionality to help business scale up. Figure 1 explains the AWS Infrastructure.
Fig. 1. AWS cloud formation [5]
The Apriori algorithm is a classic algorithm mainly used in the mining frequent item sets and association rules. Usually, this algorithm is used in relation to datasets containing a large number of items like medical datasets. It is one of the most important association rule algorithms. It was created by grawal in 1993 and it is used to get frequent item sets from databases. It is very useful in relation to large data sets and big data. Using frequent items to determine the association rules that helping to predict the next step. As example in the context of market basket analysis is if we have set of items {A, B, C and D}. Customers used to buy A is likely buy B and customer used to buy C is likely buy D, try to find the association rule between different items and products that can be buy together. This helps the organization and marketing team to put together in same section and to predict what can also go with A and B. This is so then they can create suitable discounts on it. To extract the association rule after getting the most frequent item sets the following are some measures [6]: • Support = Frequent (A,B)/N • Confident = Frequent (A,B)/Frequent (A) • Lift = Support/support (A) × Support (B)
Identification Diseases Using Apriori Algorithm on DevOps
147
A lot of deployments will be raised beforehand of IT Operations for deployment. Clyde Logue said “Agile was instrumental in Development regaining the trust in the business, but it unintentionally left IT Operations behind” [7]. DevOps is a way for brig the gap between development and operation by making strong collaboration between teams, the key of overcome this gap is automation and CI & CD [8]. Also All directions are towards deploy application cross multiple platforms on cloud.
2 Problem Definition Roughly 65% of big data projects fail because of their needs for rapid delivery, real time performance, high availability, and high scalability and the necessary to scale smoothly with no downtime as shown in Fig. 2. AWS services conations operation and automation with high availability (24 × 7) and high scalability [9]. Cloud services offer this option as pay as you go. This will minimize the cost.
Fig. 2. Big data project failure [10]
One of the analytic data project problems is organizational changes and how business analysts, data scientists, and other stakeholders work with system design, development, and operations teams in project and get fast delivery? In the traditional project system engineer creates new features and deliver product cycle with average time 6 to 8 months [9], but with DevOps it can be more than one delivery per day.
3 Background Evolution of the Software Delivery First, there are a lot of software models such as Waterfall, Agile, Water scrum, Hybrid v and so on. All of these models aim to improve the process life cycle to meet the customer needs.
3.1 Traditional Model (Water Fall) The traditional model is the oldest, popular traditional models include Waterfall Model, Iterative Model, Spiral Model, V-Model and Big Bang Model. There are problems with
148
A. M. Morsy and M. A. A. Mostafa
the waterfall model as does not meet the needs of new businesses because the model does not adjust to new change. A working version of the system is often not seen until late on in the project’s life. There is where the client is not able to see the project until it is done. It is difficult to integrate risk management. There is no feedback and bad communication between the teams, meaning that they cannot deliver fast. 3.2 Agile Model Popular agile methodologies are extreme programming models like the SCRUM model, Crystal Clear model, Feature Driven Development (FDD model and Test Driven Development (TDD). There are some problems with agile model like defects in the understanding and managing the requirements, the of extent changes. Some of the agile methodologies are more difficult to understand than linear ones. They cannot deal with the operation team to get faster delivery as the relationship between the development and operations team is very poor. The release process is taking too long so the time taken to deliver it to the market will be long too. 3.3 DevOps It has several definitions but the most popular is: the practice of the operations and development team working together in the entire software life cycle from design through to developing the production support as shown in Fig. 3. It covers the entire lifecycle, including business planning and creation in addition to delivery and feedback. The DevOps community improves the communication between the operations team and the development team to ensure that the developers understand the issues associated with the operations. Another definition is that it: empowers developers and operations team through the automation of the release process pushing some of the features faster to customer and safety. DevOps is a cultural and tool not only tools [1].
Fig. 3. DevOps life cycle
The main goal of DevOps is the mature integration between development and operations through continuous delivery: straight through to release from code check-in to final stage (test or production as required). 3.4 Water-Scrum-Fall (Mixture Traditional and Scrum) Water-Scrum-Fall is a combined approach to lifecycle management application that combines waterfall and scrum methodologies. The dev team uses a waterfall that considers the process for a software as one large project. After finishing the project, the
Identification Diseases Using Apriori Algorithm on DevOps
149
Fig. 4. Water-scrum-fall life cycle [11]
release team passes the software to an operations team for installation and maintenance as shown in Fig. 4. This model consists of three stages: a) Water: The team starts to understand the idea of the project by collecting all of the information needed in the user stories or backlog to state the flow of the project. b) Agile – scrum: This is the iterative stage. This involves starting to rearrange the backlog according to the priority in addition start to define detailed user stories for each sprint. At the end of each sprint they will conduct test for this sprint. c) Fall: The QC team will conduct regression test on product and release the deployed version as it [12]. 3.5 Hybrid v- Model: (Mix V –Model with Scrum) Figure 5 the shows hybrid V model. The team and the customer can apply “WaterfallUp Front” to bring up the requirements, formulate documents and mix them together in an approval document. The benefit of this step is to minimize the requirements and deliverable misunderstands. Next, the team can apply agile methods during the design, implementation and unittesting phases. This will speed up iteration process and cut down the modifications, delays and rescheduling present in the water fall development model.
Fig. 5. Hybrid V- model life cycle [13]
150
A. M. Morsy and M. A. A. Mostafa
Finally, the team and the customer can apply “Waterfall-At-End” to build testing and acceptance to high level. This will complete the process in order to deploy the system in the customer’s environment as shown in Fig. 5 [13].
4 Related Work Data mining nowadays plays a big role in identifying a lot of diseases. They use the patient’s data to get the information used in decision-making in order to cure the disease and predict it, such as. K.R.Thakre and R.Shende [14] used Apriori-hybrid algorithm is mix of weighted Apriori Algorithm and Apriori Hash T algorithm. The algorithm is applied to difficult text types such as semi-structured data set. It minimizes the processing time and the number of rule generated but it will generate item sets recurrently. This take more time and is not 100% accurate. A. S. Sarvestani and A. A. Safavi [15] presented an analysis of the prediction related to breast cancer patients using data mining techniques. G. S. M. Tech [16] developed decision support as apart of Heart Disease Prediction System using data mining technique based on the Naïve Bayes algorithm. Sheila A. Abaya [17] present modified Apriori algorithm that provided less database access compared to the Apriori one. This makes it faster by introducing new parameters like set size and set size frequency. This were used to remove non-significant candidate keys. M. Sornalakshmi, S. Balamurali [18]. They enhanced the Apriori algorithm using context ontology (EAA-SMO). They make use of the relationship between patient data through the generated rules. They improved execution time by 25% in comparison to the semantic ontology after collecting the data from the medical sensors. They analyzed, normalized the data and then applied context ontology [1].
5 Proposed Model Big organizations require a high degree of communication between the teams and automation between the data scientists and software engineers. This by establishing processes for software that contains build, integration, and deployment in use of DevOps and automation. Automation is the key for good connection and participation between the development and operations teams. One of the big organizations is hospitals and clinics. This paper needs to find a method to help clinicians to make better decisions and diagnoses. We have built an application used an Apriori algorithm based on AWS. Any application must pass through the following phases: plan, code, build, testing, monitor and feedback [19]. The planning and code phase will not be automated phases, but build, test, monitor feedback will be automated by using tools that facilitate the managing of the automation process easily. Some of the popular tools have been explained in Fig. 7 and the tools used in this model in each phase explained in Table 1.
Identification Diseases Using Apriori Algorithm on DevOps
151
Fig. 6. Proposed model life cycle
5.1 Dataset The Dataset used in this research contains 13,513 patient records. It contains the indicators of Indicator Category, Indicator Year, Gender, Race/Ethnicity, Value, Place, BCHC, Requested Methodology, and Source. The data was from 26 of the nation’s largest and most central cities and it is named “Big City Health Data” [21]. 5.2 Approach of Hybrid DevOps Model
Fig. 7. DevOps tools [20]
152
A. M. Morsy and M. A. A. Mostafa Table 1. Tools used in cach phase
Phase
Tool used
Tools used in each phase Plan
Code
Build & test
Deploy
Monitor
Integration
Visio/word
Visual studio
Code build AWS
Code deploy AWS
AWS monitoring(Cloud watch)
GitHub
5.2.1 Waterfall Phase Initiation phase: Agile uses an approach where there is no detailed planning and there is clarity on future tasks only. On other hand waterfall not focus on changing the requirements and the customer has very clear documented requirements. But this model focus on having strong documentation that is updated throw the project life cycle. In this phase project charter is created, project plan, cost plan, resources plan, project schedule and project feature list. The most important thing in this phase is cleaning dataset and preprocessing so the data scientist will be able to work with dataset, because this is an endless and complex task that must be clear at the beginning. Planning phase: It is the preparation phase, in this phase starts to gather requirement to be able to create project sprints and user stories and its priority based on feature list or project backlog that is created in previous phase. In addition to creating Work Breakdown Structure (WBS), also starts to design application and get sample prototype. The output of this phase will be work Business Requirement Diagram (BRD) and user stories. Agile has lack in documentation it might write user stories on board and spread tasks on standup meeting but In this model Azure DevOps Tool used for writing sprints, users stories, Task and Test case to overcome problems in communication between teams specially if there is a new team member onboard and to be more managed [22].
5.2.2 Agile Phase Agile methodologies are used by business plans to adapt the change of customers and get better feedback. Adapting to quick changes is not easy in business Culture. DevOps allows us to take business perspective into consideration and prioritize the product backlogs. This is the continuous process of analysis, design, executing, getting feedback from the customer, and the cycle continues. Analysis: In this phase the team starts to analyze user stories and project requirement for the current iteration and starts to assign tasks. Design: The designer’s starts to create a mock-up of the UI and user interface simulations together with starting the coding phase (development phase).
Identification Diseases Using Apriori Algorithm on DevOps
153
Code: This is the phase of the complete development of the project done. After completion of the development the project is built and test. Before starting the coding phase some steps are made to install AWS environment: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Create an AWS Account Download GitHub application and configure it Create AWS Policy for (Code deploy and code pipeline) Create a service role for code deploy, then attach it to policy Create EC2 instance to host application Upload application on GitHub this application is based on ASP.Net core language. Create a new application on code deploy and deployment group. Create a new application on code build. Create Code pipeline then deploy.
Testing: The testing team will apply all test cases for each sprint to make sure that project is bug-free and compatible with each other. The project is tested very frequently, through the release iterations, minimizing the risk of any major failures in the future. The Quality Assurance team create a series of tests in order to ensure the code is clean and the business goals of the solution are met.
5.2.3 DevOps Phase 1. Continuous Build and Test: Code Build is one of the best options for build, it allows fully managed build services, allows scaling automatically, and run build or dozen of build simultaneously. Define build as code using buildspec.yml file, this file used to write artifacts, Pre-build commands, build commands and post-build commands. Testing is automated through tools and scripts runs automatically after committing code. 2. Continuous Deploy: is the heart of the DevOps and it is the Center point of the software delivery. Small and change size set to be deployed using automated tools so it will be repeatable and safe. Teams that ready embrace CD can make deployment to production multiple times in a day. This allows fast delivery and feedback loops. 3. Continuous Monitor: Cloud watch allows to monitoring the application through alerts and dashboards. 4. Continuous Integration: means collecting all the changes are made to the project to the team and not restricted to specific machine and validates the code with comments. This is a cycle process across all the development phases. When all team members make a small batch then commit it. CI helps to prevent merge conflicts and rework. It also helps team quickly identify breaking changes in code phase “catch and fix problems quickly” 5. Closing: It is the last phase of the project we have already delivered the project to customer and waiting final feedback to close and assign contracts, some document must be closed like issue log, and deliver all deliverable document to user.
154
A. M. Morsy and M. A. A. Mostafa
5.3 The Procedure of Identify Frequent Diseases Application
Fig. 8. Identify frequent diseases application diagram
Figure 8 shows all steps done by the application from start to end to get the final results. Explained in details as the following: 1) Preprocessing: After opening the application, the user can upload the dataset file in the application first make the following steps: 1. Check if the file exists or not 2. Check the file extensions, it may be allowed to only upload txt files. 3. Remove all spaces in the dataset Some improvements have been made to get better results such as by shrinking the original dataset by eliminating useless data. The original data set contains a lot of nonrequired parameters like method, race, and notes. The application needs only diseases and the year so any other data has been deleted as useless transactions. 2) Implementing the Algorithm: the application applies two algorithms: • APRIORI ALGORITHM: The first pass of the algorithm counts the item occurrences to determine large 1-itemsets (L1, l2). This process is repeated until no new large 1-itemsets are identified. The support of each candidate item set is calculated by scanning the database. Remove candidate item sets that are less than support, Fig. 9 shows algorithm steps.
Fig. 9. Apriori algorithm [23]
Identification Diseases Using Apriori Algorithm on DevOps
155
• APRIORI-TID ALGORITHM: Apriori TID has the same candidate generation function as Apriori. The most important feature is that it does not use the database for count the support after the first pass. It worked on the last pass. In the later passes, the size of the check become much less than the database, thus saving memory overhead and CPU utilization, Fig. 10 shows the algorithm steps.
Fig. 10. Apriori-TID algorithm
3) Select Distinct Values: To Apply of the algorithm, first selected distinct values by checking to see if the list contains the items needed or not. splitting the items by # for the purpose of separation. A. Advantage: By Appling the previous improvements, the following results will be gained: 1234-
Overcome the lace of documentation in agile by using Azure DevOps tools. Time reduced by more than 15%. Costs saved by paying for only resource used. Improving communication between the teams through the use of an automated system using CI & CD. 5- High available and scalable system due of using cloud B. Disadvantage: 1- It is accurate by 80%. 2- Different results based on dataset
6 Results These results show that building the Hybrid model using the cloud computing will be the best choice. As shown in the Tables 2 and 3, cloud computing save the total cost
156
A. M. Morsy and M. A. A. Mostafa
with 60% less than on premise infrastructure. It also decreases the time with 15% less than on premise. Saving cost and time will encourage the data scientists to build their application using the cloud computing and DevOps. Figure 11 shows execution time for both algorithm (Apriori algorithm and the Apriori-TID) on local machine.
Fig. 11. Execution time on local machines (msec.)
Figure 12 shows execution time for both algorithm (Apriori algorithm and the Apriori-TID) on AWS, it also shows the application run on AWS (the cloud) using DevOps tools, execution time reduced on AWS so the time of the release will be reduced. Finally this reduces the time to market.
Fig. 12. Execution time on AWS
Figure 13 shows execution time for both algorithm (Apriori algorithm and the Apriori-TID) on both local machine and AWS, it is a comparison between run times on local vs cloud.
Identification Diseases Using Apriori Algorithm on DevOps
157
Fig. 13. Execution time on AWS and local
Figure 14 shows the CPU Utilization for both algorithm in megabytes.
Fig. 14. CPU utilization
As shown in Table 2, which explain advantage of hybrid DevOps over Agile with respect to specific parameters? To show that DevOps model overcome most of agile weakness and explained how it cover come those points. Table 2. DevOps vs Agile Parameters
Agile
Hybrid DevOps
Discussion
Software delivery
Medium (2 -4 week)
High Every day
With DevOps you can now deliver dozens of small updates every day (continued)
158
A. M. Morsy and M. A. A. Mostafa Table 2. (continued)
Parameters
Agile
Hybrid DevOps
Discussion
Continuous feedback
Medium
High
Agile Customer feedback involved in each sprint for both model. but in DevOps come from team
Communication with users and business stockholders
Medium
High
Communication between team in agile stop at development but in DevOps it collaborate all development and operation team
Easy to manage
High
High
Both model are easy to manage because of sprints
individual dependency
High
low
One of agile Manifesto is depending on individual than tools, this what make agile high depended on team, but in DevOps it less dependency because of automation
Support automation
Low
High
One of powerful component of DevOps is automation. But Agile does not focus on automation
Documentation
Low
Medium
One of agile Manifesto give importance to work than documentation so in agile it is very poor. But in DevOps is more managed because of DevOps Tools that allows you to manage and keep track of user stories
As shown in Table 3 which include summary of all phases required to build the project with agile and DevOps on cloud computing. It includes the cost and time for all phases. It shows that there is difference in the total time and cost required to build environment for data analytic project in both cloud computing and on premise infrastructure methods. Table 3. Phase to build model with agile and DevOps All phases
Agile
Hybrid DevOps (AWS)
Cost
Hours
Cost
Hours
437$ per month
110 h
16$ per month
90 h
Identification Diseases Using Apriori Algorithm on DevOps
159
7 Conclusion This research proposes using a technique based on association rule to get frequent disease with year and extract confidence and support for each item set. This is to help doctors and related sectors in health care environment to make decisions we used DevOps tools and the cloud to make the compilation faster, reduce cost and the CPU utilization. The dataset used in this research contains 13,513 patient records. We used AWS as it can utilize big data sets (as the system be more scalable). The system can be available 24/7 without down.
8 Future Work For future work, we recommend to implement this model using Hadoop VM Ware platform and NoSQL on AWS to be able to adopt more big data sets. Hadoop is a largescale distributed framework for multiprocessing of massive data. Based on the Google File System and Google’s Map Reduce, Hadoop is an open source project of Apache. Given that Hadoop is deployed on a large cluster of computers, the size of Hadoop’s Distributed File System (HDFS) depends on the size of the cluster and the hardware used. HDFS ensures fast and scalable access to the info.
References 1. Virmani, M.: Understanding DevOps & bridging the gap from continuous integration to continuous delivery. In: 5th Int. Conf. Innov. Comput. Technol. INTECH 2015, no. Intech, pp. 78–82 (2015), doi: https://doi.org/10.1109/INTECH.2015.7173368. 2. Borgenholt, G., Begnum, K., Engelstad, P.: Audition: a DevOps-oriented service optimization and testing framework for cloud environments. Nor. Inform., vol. 2013, no. July (2013) 3. Bruneo, D., et al.: CloudWave: where adaptive cloud management meets DevOps. In: Proc. - Int. Symp. Comput. Commun., vol. Workshops, no. July 2016 (2014), doi: https://doi.org/ 10.1109/ISCC.2014.6912638. 4. Li, Z., Zhang, Y., Liu, Y.: Towards a full-stack develops environment (platform-as-a-service) for cloud-hosted applications. Tsinghua Sci. Technol. 22(1), 1–9 (2017). https://doi.org/10. 1109/TST.2017.7830891 5. https://docs.aws.amazon.com/. Accessed 26 Dec 2020 6. Ilayaraja, M., Meyyappan, T.: Mining medical data to identify frequent diseases using Apriori algorithm. In: Proc. 2013 Int. Conf. Pattern Recognition, Informatics Mob. Eng. PRIME 2013, pp. 194–199 (2013). https://doi.org/10.1109/ICPRIME.2013.6496471. 7. Kim, G.: Top 11 Things You Need to Know about DevOps, pp. 1–20. IT Revolut. Press (2015) 8. Mohamed, S.I.: DevOps shifting software engineering strategy value based perspective. IOSR J. Comput. Eng. Ver. IV 17(2), 2278–2661 (2015). https://doi.org/10.9790/0661-17245157 9. Chen, H.-M., Kazman, R., Haziyev, S.: Agile big data analytics for web-based systems: an architecture-centric approach. IEEE Trans. Big Data 2(3), 234–248 (2016). https://doi.org/ 10.1109/tbdata.2016.2564982 10. https://www.slideshare.net/cloudifysource/dice-cloudify-quality-big-data-made-easy. Accessed 21 Dec 2020 11. Dabney, J.B., Arthur, J.D.: Applying standard independent verification and validation techniques within an agile framework: identifying and reconciling incompatibilities. Syst. Eng. 22(4), 348–360 (2019). https://doi.org/10.1002/sys.21487
160
A. M. Morsy and M. A. A. Mostafa
12. West, D.: Water-scrum-fall is the reality of agile for most organizations today. Appl. Dev. Deliv. Prof., 1–15 (2011) 13. Hayata, T., Han, J.: A hybrid model for IT project with scrum. In: Proc. 2011 IEEE Int. Conf. Serv. Oper. Logist. Informatics, SOLI 2011, pp. 285–290 (2011), https://doi.org/10.1109/ SOLI.2011.5986572 14. Thakre, K.R., Shende, R.: Implementation on an approach for mining of datasets using APRIORI hybrid algorithm. In: Proc. - Int. Conf. Trends Electron. Inform., ICEI 2017, vol. 2018, pp. 939–943 (2018), https://doi.org/10.1109/ICOEI.2017.8300845 15. Sarvestani, A.S., Safavi, A.A., Parandeh, N.M., Salehi, M.: Predicting breast cancer survivability using data mining techniques. ICSTE 2010 – 2010 2nd Int Conf. Softw. Technol. Eng. Proc. 2, 227–231 (2010). https://doi.org/10.1109/ICSTE.2010.5608818 16. Subbalakshmi, G., Ramesh, K., Chinna Rao, M.: Decision support in heart disease prediction system using Naive Bayes. Indian J. Comput. Sci. Eng. 2(2), 170–176 (2011) 17. Abaya, S.A.: Association rule mining based on Apriori algorithm in minimizing candidate generation, vol. 3, no. 7, pp. 1–4 (2012) 18. Sornalakshmi, M., Balamurali, S., Navaneetha, M.V.M., Lakshmana, K., Ramasamy, K.: Hybrid method for mining rules based on enhanced Apriori algorithm with sequential minimal optimization in healthcare industry. Neural Comput. Appl. 2, 1–14 (2020). https://doi.org/10. 1007/s00521-020-04862-2 19. Soni, M.: End to end automation on cloud with build pipeline: the case for DevOps in insurance industry, continuous integration, continuous testing, and continuous delivery. In: Proc. – 2015 IEEE Int. Conf. Cloud Comput. Emerg. Mark. CCEM 2015, pp. 85–89 (2016), doi: https:// doi.org/10.1109/CCEM.2015.29. 20. https://www.edureka.co/blog/devops-tools. Accessed 21 Dec 2020 21. https://www.kaggle.com/noordeen/big-city-health-data. Accessed 21 Dec 2020 22. Mamatha, C., Ravi Kiran, S.C.V.S.L.S.: Implementation of DevOps Architecture in the project development and deployment with help of tools. International Journal of Scientific Research in Computer Science and Engineering 6(2), 87–95 (2018). https://doi.org/10.26438/ijsrcse/ v6i2.8795 23. Palancar, J.H., León, R.H., Pagola, J.M., Hechavarría, A.: A compressed vertical binary algorithm for mining frequent patterns. In: Lin, T.Y., Xie, Y., Wasilewska, A., Liau, C.J. (eds.) Data Mining: Foundations and Practice. Studies in Computational Intelligence, vol. 118. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78488-3_12
IoT for Diabetics: A User Perspective Signe Marie Cleveland and Moutaz Haddara(B) Kristiania University College, Prinsens gate 7-9, 0107 Oslo, Norway [email protected]
Abstract. Internet of Things (IoT) technologies aim at creating a better world for human beings and are widely adopted in several industries, including healthcare. One area within the healthcare industry where humans are directly impacted by the use of IoT, is in monitoring of patients with chronic diseases, such as diabetes. The number of patients living with diabetes has increased drastically in recent years and is predicted to keep rising. The use of IoT technologies in monitoring diabetic’s health, such as continuous glucose monitoring (CGM) and Insulin infusion pumps, is believed to improve the life-quality of diabetic patients. Although IoT offers numerous benefits within monitoring diabetes, however there are several challenging issues concerning privacy and security. Existing reports show that, out of those using IoT in healthcare, 89% have suffered an IoT-related security breach, yet more than 80% of consumers are willing to wear technology to monitor their health. In this study, we have conducted a review of literature, and we complimented it with semi-structured interviews. Our findings suggest that diabetics consider their data not to be valuable to privacy violators and consider that the life-improving qualities by using those devices as more rewarding than the potential harm or abuse of their data may cause. Keywords: IoT · Internet of Things · Healthcare · Diabetes · CGM
1 Introduction Internet of Things (IoT) technologies play an important role across various areas of our modern-day life. IoT enables the possibility to measure, infer, process and analyze data through sensors and actuators that blend seamlessly into different environments, and share the information across other platforms for analysis [1–3]. IoT revolutionized several industries with new possibilities and opportunities, especially within the healthcare sector. One of the applications in healthcare where humans are critically and directly impacted by those IoT-technologies, are the applications related to patients with chronic diseases [4]. By integrating IoT in healthcare devices, this can improve the quality and effectiveness in patients’ treatment, and their life-quality [5]. Over the past decades, the number of patients living with diabetes has increased significantly worldwide, and it is expected to surge even more in the decades to come [6]. In Norway, there are currently ca. 28,000 individuals diagnosed with type 1 diabetes [7]. Type 1 diabetes occurs when the body is unable to produce Insulin itself and needs Insulin injections in order to balance the glucose levels [8]. As of today, type 1 diabetes © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 161–172, 2021. https://doi.org/10.1007/978-3-030-80129-8_13
162
S. M. Cleveland and M. Haddara
is treatable but uncurable. If the disease isn’t monitored and treated, it will in most cases end the patient’s life [5, 9]. One way of monitoring diabetes is through the use of different body sensors. Extant literature discussed the potentials of IoT implementations in healthcare, and how IoT can potentially improve the life of diabetic patients in theory, but there is little to no research covering the patients’ perceptions of how the use of IoT diabetes-related devices have impacted their lives. Also, there are risks regarding security and privacy using IoT health devices, such as potential leak of information, theft of equipment and harmful intentions, yet patients keep using these IoT-devices. The aim of this study is to investigate how diabetic patients experience the transition from manual diabetes monitoring equipment to IoT-based equipment. How this has impacted their lives, and what are their perceptions and concerns regarding data storage, and security and privacy breaches regarding their own equipment. The rest of this paper is structured as follows: first the theoretical background and previous literature are presented in Sect. 2. In Sect. 3 the research methodology is illustrated, followed by main findings and discussion in Sect. 4. Finally, a conclusion, study implications and recommendations for future research are presented in Sect. 5.
2 Literature and Theoretical Background 2.1 Internet of Things in Healthcare The Internet of Things refers to the interconnection of physical objects and actors, by outfitting them with sensors, actuators and means to connect to the Internet [2, 10]. The term “Internet of Things” was first coined in 1999, in the context of supply chain management. Throughout the last decade, however, the definition has been adjusted and now covers a wider range of applications like healthcare, utilities, transport, among others [1]. Ultimately, one of the goals of IoT is to enhance the world for human beings, where objects around us have knowledge about what we like, need and want, and act accordingly without explicit instructions or commands [11]. With the advancements in wireless technologies, such as Bluetooth, radio frequency identification (RFID) and near-field communication (NFC), embedded sensors and actuator nodes, IoT is playing a huge part in transforming the Internet and the world we live in [1, 12, 13]. The number of interconnected devices is growing at a rapid pace and exceeded the number of people connected to the Internet worldwide already in 2011 [1, 11]. Over the recent years, smartphone-controlled sensors have emerged [4], and there are projections that by the end of 2020 there will be around 50 billion interconnected devices, which is seven times the world’s population [14]. Since healthcare represents one of the most attractive areas for development and implementation of IoT, hence the acclimatization of IoT in the healthcare sector is expected to flourish in the next years [4]. All around the world, healthcare organizations are transformed by adopting systems that are more effective, coordinated and usercentered into use [14]. More than ever before, the healthcare applications become more personalized, cost-effective, scalable, and capable of achieving successful outcomes [4]. Specifically, the implementation of IoT in healthcare is said to improve the effectiveness and quality of service for patients with chronic conditions and diseases [9].
IoT for Diabetics: A User Perspective
163
In order to effectively monitor patients, smart healthcare devices and systems usually integrate sensing technologies with other applications, platforms, or devices [5]. IoT gives a seamless ecosystem to realize the vision of ubiquitous healthcare using body-area sensors, and IoT backend systems to upload data to servers, e.g. a Smartphone can be used for communication along with interfaces/protocols like Bluetooth for interfacing sensors measuring physiological parameters [1]. For some patients, like diabetics, continuous tracking and collection of data is vital. Wearable devices are one of the technologies that help collecting real-time data about different bodily measurements and organs’ performance, like heart rate and glucose levels [9]. According to literature, more than 80% of consumers are willing to wear technology to monitor their health, and this trend is expected to rise [15]. As patients become more comfortable and used to generating their own data, they are becoming more willing to share it with their medical specialists [15]. A study by [16] shows that, patients perceive systems developed to monitor their health as tools that aid them in making better and informed decisions based on the recommendations that are generated by these systems, or the healthcare personnel, which also have access to the patients’ data. 2.2 Patient Monitoring Through the years, sensors and actuators have become more powerful, cheaper and smaller in size [17], and have played a huge part in overcoming healthcare-related challenges when used for patient monitoring purposes [9]. Networked sensors, which can be worn on the body, make it possible to gather information about the patient’s health and vital systems [9]. Sensor networks consist of one or more sensing nodes that communicate with each other in a multi-hop approach [11, 13]. Wireless Body Area Network (WBAN) is a three-layer architecture that comprises various wearable sensors, which is one the most recommended frameworks for remote health monitoring. Usually, the collected data from these sensors is transmitted to a gateway server through a Bluetooth connection. The gateway turns the data into an observation and measurement file that is stored remotely (usually on a cloud-based server) for later retrieval and analysis, which healthcare personnel can access through an online content service application [9]. As a result of IoT adoptions in healthcare, it is predicted that more healthcare personnel will perform house calls instead of having patients come to their offices [15], as IoT is able to deliver more valuable real-time data directly to healthcare facilities [5]. 2.3 Continuous Glucose Monitoring Continuous glucose monitoring (CGM) is an advanced method to monitor real-time glucose levels, by taking glucose measurements at regular intervals and translating the measures into data that generates glucose direction and rate of change [18]. The use of CGM is usually associated with improving the hemoglobin A1c (HbA1c) levels, and the real-time monitoring of glucose levels can provide significant clinical information that leads to timely intervention of hypo- and hyperglycemic episodes of diabetes [19, 20]. Statistics show that CGM systems can reduce patient’s long-term complications between 40% and 75%, due to proactive monitoring [9, 18].
164
S. M. Cleveland and M. Haddara
Traditionally glucose levels are controlled by puncturing the skin, typically from the fingertip several times a day, in order to get a blood sample [20]. In the last 20 years, CGMs have continuously been developed and improved to become more accurate, longer lasting and smaller in size [19]. Several CGM devices are based on IoT solutions; a sensor is attached to the user’s body with an adhesive patch and consists of a transmitter with a tiny sensor wire that is inserted just under the skin by an automatic applicator [18]. This offers great relief to many patients, as the puncturing of skin only occurs when the sensors is being applied and at occasional calibrations (how often the sensor needs to be changed or calibrated varies by model) [20]. The transmitter sends real-time data wirelessly to a receiver or an app installed at the patient’s smartphone, usually via Bluetooth, for the user to monitor her/his glucose levels [21]. If the patient is using the app, the data can automatically be uploaded to medical personnel systems. The CGM receiver and app alert the user when certain glucose thresholds are reached and can, when used in combination with supporting Insulin pumps, temporarily suspend the Insulin infusion, or pump more Insulin to the body [18, 19]. 2.4 Insulin Pumps For controlling type 1 diabetes, Insulin pumps are considered to be the most effective therapy. Insulin pumps have a minicomputer that infuses Insulin consistently into the body through a tube, which either can be attached to the patient’s belt or stored in a pocket (needs to be replaced or refilled every few days) or be attached directly to the patient’s body [22]. Insulin pump manufacturers like Omnipod and Tandem offer pumps that employ IoT-technologies, where the pump is either controlled and monitored through a personal medical device (PDM), an app on smartphones or directly on the pump itself. While the current Insulin pumps by Omnipod and Tandem have no option for controlling Insulin infusions through apps, but Tandem’s integration with Dexcom sensors make it possible to read Insulin infusion data in the app [23, 24]. In addition, the diabetesrelated software company Tidepool is currently developing an app (Tidepool Loop), that will enable Bluetooth communicating Insulin pumps from Omnipod and CGMs from Dexcom to automate Insulin dosage [25]. The Tidepool Loop-app will also make it possible to control the pump from the phone [26]. Both sensor and Insulin data are possible to upload to Diasend; a cloud-based management platform for diabetics and diabetes healthcare personnel, with the main purpose of making diabetics’ lives, and the work of healthcare personnel easier by optimizing diabetes data management [27]. 2.5 IoT Privacy Concerns and Challenges In their privacy policies, the majority of diabetes technology companies state that they gather information including IP addresses, Internet Service Provider’s information, web browser types, and other information related to the user’s computer applications, location, and Internet connection. In addition, they also gather other data related to the user’s activity while using their products and services. They further argue that they anonymize any personal information, without specifically elaborating on how this process is done.
IoT for Diabetics: A User Perspective
165
In most privacy policies, the companies disclose that as long as the personal data isn’t identifiable, then it can be used for any purpose. Unfortunately people with diabetes are faced with accepting such privacy policies or they won’t be able to use the products and devices [28]. Although IoT offers numerous benefits within the healthcare sector, several challenging issues need to be resolved before the various applications of IoT get widely accepted and adopted [29]. Previous reports have revealed that from those using IoT in healthcare, the majority have suffered an IoT-related security breach [30]. Thus, privacy and security are naturally the main barriers for those who are considering to start using IoT-technologies, specifically for healthcare-related purposes [9]. Atzori et al. [29] argue that people will resist IoT adoptions if there is no public confidence that IoT won’t cause serious violations and threats to their privacy, because the concept of privacy is deeply rooted into our civilizations. Yet, as argued by [28], in many cases, the patients are faced with an ultimatum, either to accept the privacy terms or don’t use the device/service, which leads many patients to adopt the technology despite their privacy concerns. In general, several scholars and practitioners argue that IoT technologies are generally prone to attacks; as several components are mostly unattended and therefore easy to be physically attacked [29]. On the other hand, some research argue that most security threats occur due to unauthorized access to data, data breaches and impersonations [31]. Also, there are still critical gaps in the collection and transferring of patient information and data to healthcare personnel in current healthcare systems. In addition, human errors can occur in the handover and data handling among healthcare personnel [9]. These types of breaches and errors could result in life-threatening scenarios [9]. One of the potential security threats for IoT within healthcare is eavesdropping of wireless communications, such as CGMs and Insulin pumps’ communications. Eavesdropping can potentially harvest medical information for non-health purposes, such as non-life threatening scenarios like influencing the selection of which candidate to employ based on vital signs during an interview or leaking information to insurance companies that will decide whether or not you will get insured, and other life-threatening scenarios like remotely perform harmful treatments e.g. deliver high doses of Insulin to diabetic patients [32]. According to [22], Insulin pumps are one of the riskiest types of PDMs due to their functionality, connectivity to other components and critical role in patient treatment. Many of today’s healthcare IoT devices use secure methods to communicate with the cloud, however, they could still be vulnerable to hackers both stealing and misusing personal information and can be used for physical harm [9, 33]. Both Pulkkis et al. [21] and Barati and Rana [34], advocate for the use of blockchain technologies to encrypt the data collected from wearable sensors as a mean to protect the patient’s information and health against security breaches, which also supports the General Data Protection Regulations (GDPR). Despite the challenges related to IoT-technologies, it’s still believed to continue expanding and to play an important role in the healthcare industry worldwide [9].
166
S. M. Cleveland and M. Haddara
3 Research Methodology This research employed an explorative qualitative case study methodology [35], where the primary sources for data collection are document analysis and semi-structured interviews [35, 36]. Document analysis is often employed as a mean of triangulation with other qualitative research methods in order to enrich the understanding of a topic. Qualitative researchers are expected to draw upon more than one source of evidence in order to seek validation through the use of different methods and sources of data [35]. Due to the narrow scope of this paper, we have identified only three (3) informants for interviewing in Norway (see Table 1). All the interviews were semi-structured and were held via video conferencing. The informants were selected by fulfilling the following criteria: (a) have been diagnosed with type 1 diabetes, (b) have been using manual equipment like Insulin pens and glucose meters, and (c) are currently using automatic equipment like wireless glucose measuring sensors and/or Insulin pumps. In order to comply with GDPR and Norwegian data regulations, all the informants are anonymized, and the interviews were not recorded. However, notes were taken and later have been transcribed. Prior to the interviews, an interview guide was developed based on previous research, consisting of 13 questions to be sure that all relevant questions for the research were covered [35]. The interview questions were directed towards: (I) how the informant experience his/her diabetes, (II) how the informant experience using diabetes equipment, and finally (III) the informant’s thoughts regarding privacy when using IoT technologies to control their diabetes. All interviews were conducted by the first author and were carried out in Norwegian, and later the notes were translated to English. The previous research review has been chosen through literature searches using Google Scholar, and by tracing relevant cited articles within the identified articles. The literature period was set between 2010 and May 2020. At first, searches were mainly targeting research from “The basket of eight”, the leading journals within Information Systems, however, no relevant publications were identified. Thus, the search was widened to include all journals and conferences within medicine, information systems and technology. In the selection process, the journal’s impact factor, the papers’ citations relative to the year of publication have been taken into consideration to find the most reliable research. For the literature search the keywords “Internet of Things”, “IoT”, “Healthcare” and “diabetes” were used in different forms, in combination with the keywords “patient monitoring”, “CGM”, “e-health”, “privacy” and “GDPR”. Table 1. Overview of informants. Diabetes details
Informant 1
Informant 2
Informant 3
Age when diagnosed with diabetes
9
5
3
Years lived with diabetes
13
21
24
Years using IoT diabetic device
6
4
6
Current glucose sensor
Dexcom G5
Dexcom G5
Dexcom G6
Current Insulin pump
Tandem t:slim X2
Tandem t:slim X2
Omnipod
IoT for Diabetics: A User Perspective
167
After the literature was identified, the abstracts were skimmed through by both authors to check for relevance for this research. When the potential articles were selected, they were read independently by both authors in order to identify existing knowledge and the main themes discussed in the existing body of knowledge. In total, 19 articles are included as the theoretical background for this case study.
4 Findings and Discussion The goal of this research is to investigate diabetic patients’ experience of the transition from manual diabetes monitoring equipment to IoT-based equipment and understand how this has impacted their lives. In addition, we explore their perceptions and thoughts about data storage, handling, security and privacy breaches regarding their own equipment. 4.1 Living with Diabetes Being diagnosed with diabetes is a life-changing event and can be difficult to deal with both physical and mentally. All of a sudden there are so many things that have to be changed and taken care of, like repeatedly painful needle punctures that may cause skin inflammations, in order to keep the glucose levels stable, and have proactive precautions and measures to prevent possible diseases related to diabetes. Living with diabetes type 1 is described as hard on both the patients, and their parents when diagnosed as a child. “For my mom, my diabetes became a full-time job. Because of it being difficult to control, she quit her job to take care of me.” – (Informant 1). The parents’ constant fear of their child’s life and they having to wake up several times a night to measure glucose levels on a child screaming, tossing and turning, terrified of the needles, are dreadful memories described by the informants. One informant described the fear of needles as still being a big issue, as both the Insulin pump and CGM needs to be changed regularly and the CGM to be calibrated twice a day. A general statement from all the informants is the embarrassment of having to do manual glucose measurements and Insulin injections at schools and in public places. Also, they felt uncomfortable when using previous Insulin pumps, which were visible due to their bulky size even under their clothes. Whilst one informant remembers not being bothered by others asking about the equipment, another detested all the extra attention. All informants explain that the first couple of years using Insulin pumps, when being teenagers, they often removed the pumps so people wouldn’t notice and often wished they still were using Insulin pens. All the informants explain that keeping up with the measurements was difficult, especially in the beginning, with the manual equipment. The transition to wireless glucose sensors and Insulin pumps has made this much easier. 4.2 Impact of IoT-Based Diabetes Devices In Longva and Haddara’s paper [9], they describe how IoT can potentially improve the life-quality of diabetic patients. In this study, all three informants confirm that after starting to use the IoT-based diabetes equipment, their lives have become more “normal”.
168
S. M. Cleveland and M. Haddara
All of our informants have made the transition from manual glucose meters and Insulin pens to IoT-based sensors and Insulin pumps. Two of our informants described the old manual equipment as horrible and painful, and the transition to Insulin pumps reduced the number of needle punctures. One informant stated that the first Insulin pump made her more self-conscious about her diabetes in the beginning, because it was visible, but eventually got used to it. Whilst the informants don’t agree that the transition from pens to pumps were all great, they are unanimous in that the transition from manual glucose measuring to CGM has been a real life-changer over time for them. As they got adjusted to having the sensor attached to their body, and this has enhanced their control of their diabetes and enabled them to live more normal-like lives. In addition, supporting previous research [19, 20], they all experienced their HbA1c to stabilize after starting using CGM devices. The informants now feel more comfortable monitoring their diabetes in public as the devices has become smaller and less visible. Two informants are currently carrying just one control device; one of them reads the CGM data on the Insulin pump, and the other one reads it in the Dexcom mobile app. The third informant wasn’t aware of these possibilities, but when made aware, expressed what a positive impact that would make. Through the years the devices have become more seamlessly in their integration and that makes it easier to control diabetes and glucose levels, which in a long-term perspective can prevent other diseases as a consequent of diabetes, like eye diseases and lower limb amputation, which will also lower the cost of potential treatments for the Norwegian government; “The equipment itself is more expensive than Insulin pens and manual glucose meters, but long-term, it’s more beneficial for the society.” – (Informant 3). One informant stated that before starting using the IoT based equipment, she was a bit negligent towards her diabetes, but after starting using it she found it more interesting and has learned more about her own diabetes, and now feel more competent taking care of it. All three informants argue that they would prefer a one combined solution for CGM and Insulin infusions, such as the device provided by Tidepool [25]. As in this case, they will only need one item attached to their body, and for this one device to be controlled through an app, would be a huge enhancement for the equipment, and an additional positive impact on their life-quality. Two informants stated that, in order to reduce needle punctures even more, they would be willing to implant a chip with a sensor into their body for CGM, if it only had to be replaced every six months or so. On the other hand, one of the informants stated that implanting a device for Insulin is off the table, as if the Insulin container would burst inside the body, the chances of surviving would be marginal. 4.3 Healthcare Facility Visits All informants are currently doing their regular healthcare facility visits three to four times a year, as they did prior to using IoT devices. However, as their HbA1c has stabilized, the visits have become shorter and revolve around prescription renewals and HbA1c measurements, rather than previous adjustments to Insulin dosages and lifestyle changes.
IoT for Diabetics: A User Perspective
169
Two informants predicted that in the future their need for visiting the healthcare facilities will be minimal, as the healthcare personnel already can check up on their diabetes data through the Diasend system connected to their devices - (Informant 1 and 2). One informant stated that “If they can see it without me visiting the office, why should I take up their time that could be used dealing with more important things when I’m in control?” - (Informant 3), which is an argument also made by [9], as it would benefit healthcare facilities in terms of e.g. expenses and workload. 4.4 Privacy, Safety, and Security Concerns Although IoT is extremely vulnerable to attacks [2, 13], security breaches in healthcare IoT could result in life threatening situations [5], however the informants describe their utter trust in the devices and their security. One informant even stated: “I know that might be naive, but I trust all technological devices.” - (Informant 2). All three informants consider the gains from having IoT equipment as more rewarding than the possible harm of getting their data leaked. However, all informants stated that they didn’t really know what kind of data is stored, other than their glucose levels, and possibly their Insulin infusions if uploaded to Diasend manually. When checking their apps and control devices, one informant found that limited personal information (name, e-mail address and phone number), in addition to some data about her glucose levels and Insulin infusions she had manually added, were stored in Diasend – (Informant 1). Another informant, on the other hand, found more revealing personal information (name, personal ID, a picture of herself, body measures and when she was diagnosed), in addition to her glucose levels that are automatically transferred from her Dexcom G6 sensor, were stored in the Diasend. After checking this, they still don’t consider the information registered to be sensitive enough to worry about possible breaches. As eavesdropping on healthcare IoT-devices potentially can lead to information being used for non-health purposes [37], the informants were asked about their thoughts regarding this. All informants find it unethical if someone with diabetes were to miss the opportunity to get a job, insurance or medical treatment, because, for instance, their HbA1c isn’t perfect. They do, however, consider their own diabetes as well taken care of and that it shouldn’t be a problem in case of medical information being a factor for them to get the treatment they will need if their data were leaked. The informants consider the risk of someone stealing their devices as a low-risk, because they don’t see much use of the devices for non-diabetics. If, however, the thief or hacker was having bad intentions and knew how to operate their devices, the informants’ first response was to remove the Insulin pump in order for the Insulin infusion to stop. The second response was that if they were being infused with Insulin before noticing the theft, they would be in a more critical situation. Some Insulin pumps limit the number of units to be infused in one session, yet several sessions can be performed repeatedly. “If my glucose level were 4 and someone infused 20 units, it would be a big issue. If they infused the whole storage, 200 units, I would probably die.” – (Informant 1). The possibility for someone stealing their phones, is however considered as bigger risk by the informants, but they didn’t think the diabetes apps would be of interest. If someone, however, where stealing with bad and harmful intentions, and deliberately wanted to hurt them, this was considered highly critical by our informants, as the devices and apps
170
S. M. Cleveland and M. Haddara
lack security in which there is no security code needed for accessing the devices/apps or for doing changes.
5 Conclusions and Future Research In this study, diabetics’ own perception of improved life-quality from the use of IoT devices and concerns regarding privacy and security have been discussed. Our main findings confirm previous research arguing that healthcare IoT improves the life-quality of diabetes patients. The findings also show that despite of the major concerns around the importance of data privacy and the data leaks in recent years, and the extensive research regarding privacy and security breaches within IoT-technologies, diabetics have a tendency to be trusting in regards on how their personal health information is stored and used. Also, our findings indicate that diabetics consider their data to not being of interest to privacy violators and find the life-improving qualities by using the devices are more rewarding than the potential harm related to their data being compromised. Predictions have been made that IoT adoption in healthcare can reduce the frequent healthcare facility visits for patients with chronic diseases, and the use of IoT in treating diabetic patients can reduce the long-term costs for governments, in terms of diabetes complications through proactive monitoring. As the number of interviews and sample of informants in this study is not extensive enough to generalize the findings, and this study only takes Norwegian patients’ perceptions into consideration, hence further research should be done to widen the scope of perceptions on life-impact and security concerns, from both patients and healthcare personnel perspectives. Moreover, other studies in different countries or contexts may shed light on new findings.
References 1. Gubbi, J., et al.: Internet of Things (IoT): a vision, architectural elements, and future directions. Futur. Gener. Comput. Syst. 29(7), 1645–1660 (2013) 2. Sajid, O., Haddara, M.: NFC mobile payments: are we ready for them? In: SAI Computing Conference (SAI 2016). IEEE (2016) 3. Haddara, M., Larsson, A.: Big Data: Hvordan finne relevant beslutningsgrunnlag i store informasjonsmengder. In: Næss, H.E., Lene, P. (eds.) Metodebok for kreative fag, pp. 136– 149. Universitetsforlaget, Oslo (2017) 4. Islam, S.R., et al.: The internet of things for health care: a comprehensive survey. IEEE Access 3, 678–708 (2015) 5. Haddara, M., Staaby, A.: Enhancing patient safety: a focus on RFID applications in healthcare. Int. J. Reliable Quality E-Healthcare (IJRQEH) 9(2), 1–17 (2020) 6. WHO: Diabetes. https://www.who.int/health-topics/diabetes#tab=tab_1 (2020) 7. DiabetesForbundet: Diabetes Type 1. https://www.diabetes.no/diabetes-type-1/ (2020). Accessed 05 Jan 2020 8. School, H.M.: Type 1 Diabetes Mellitus. https://www.health.harvard.edu/a_to_z/type-1-dia betes-mellitus-a-to-z (2018). Accessed 23 Nov 2020 9. Longva, A.M., Haddara, M.: How can IoT improve the life-quality of diabetes patients? In: MATEC Web of Conferences. EDP Sciences (2019)
IoT for Diabetics: A User Perspective
171
10. Dijkman, R.M., et al.: Business models for the internet of things. Int. J. Inf. Manage. 35(6), 672–678 (2015) 11. Perera, C., et al.: Context aware computing for the internet of things: a survey. IEEE Commun. Surv. Tutorials 16(1), 414–454 (2014) 12. Zanella, A., et al.: Internet of things for smart cities. IEEE Internet Things J. 1(1), 22–32 (2014) 13. Helgesen, T., Haddara, M.: Wireless power transfer solutions for ‘things’ in the internet of things. In: Arai, K., Bhatia, R., Kapoor, S. (eds.) FTC 2018. AISC, vol. 880, pp. 92–103. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-02686-8_8 14. Fernandez, F., Pallis, G.C.: Opportunities and challenges of the Internet of Things for healthcare: systems engineering perspective. In: 2014 4th International Conference on Wireless Mobile Communication and Healthcare-Transforming Healthcare Through Innovations in Mobile and Wireless Technologies (MOBIHEALTH). IEEE (2014) 15. Lerman, L.: Where healthcare IoT is headed in 2020. In: IoT Agenda. Techtarget.com (2020) 16. Gómez, J., Oviedo, B., Zhuma, E.: Patient monitoring system based on internet of things. Procedia Comput. Sci. 83, 90–97 (2016) 17. Da Xu, L., He, W., Li, S.: Internet of things in industries: a survey. IEEE Trans. Industr. Inf. 10(4), 2233–2243 (2014) 18. Galindo, R.J., et al.: Comparison of the FreeStyle Libre Pro flash continuous glucose monitoring (CGM) system and point-of-care capillary glucose testing in hospitalized patients with type 2 diabetes treated with basal-bolus insulin regimen. Diabetes Care 43(11), 2730–2735 (2020) 19. Rodbard, D.: Continuous glucose monitoring: a review of successes, challenges, and opportunities. Diabetes Technol. Therap. 18(S2), S2-3–S2-13 (2016) 20. Istepanian, R.S., et al.: The potential of Internet of m-health things “m-IoT” for non-invasive glucose level sensing. In: 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE (2011) 21. Pulkkis, G., et al.: Secure and reliable internet of things systems for healthcare. In: 2017 IEEE 5th international conference on future internet of things and cloud (FiCloud). IEEE (2017) 22. Kintzlinger, M., Nissim, N.: Keep an eye on your personal belongings! The security of personal medical devices and their ecosystems. J. Biomed. Informatics 95, p. 103233 (2019) 23. Omnipod: The Omnipod system provides non-stop insulin delivery. https://www.myomnipod. com/Omnipod-system (2020). Accessed 27 Dec 2020 24. Tandem: t:slim X2 Insulin Pump. https://www.tandemdiabetes.com/nb-no/home (2020). Accessed 27 Dec 2020 25. Tidepool: Tidepool Loop – Automated Insulin Dosing. https://www.tidepool.org/automatedinsulin-dosing (2020). Accessed 27 Dec 2020 26. Hannus, M., et al.: Construction ICT roadmap. ROADCON project deliverable report D52 (2003) 27. Diasend: What is Diasend?. https://support.diasend.com/hc/en-us/articles/211990605-Whatis-diasend (2020). Accessed 28 Dec 2020 28. Britton, K.E., BrittonColonnese, J.D.: Privacy and security issues surrounding the protection of data generated by continuous glucose monitors. J. Diabetes Sci. Technol. 11(2), 216–219 (2017) 29. Atzori, L., Iera, A., Morabito, G.: The internet of things: a survey. Comput. Netw. 54(15), 2787–2805 (2010) 30. Ponemon, I.: Sixth annual benchmark study on privacy & security of healthcare data. Technical report (2016)
172
S. M. Cleveland and M. Haddara
31. Amaraweera, S., Halgamuge, M.: Internet of things in the healthcare sector: Overview of security and privacy issues. In: Mahmood, Z. (ed.) Security, Privacy and Trust in the IoT Environment, pp. 153–179. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-180 75-1_8 32. Venkatasubramanian, K., et al.: Interoperable medical devices. IEEE Pulse 1(2), 16–27 (2010) 33. Vollen, A., Haddara, M.: IoT aboard coastal vessels: a case study in the fishing industry. In: Awan, I., Younas, M., Ünal, P., Aleksy, M. (eds.) Mobile Web and Intelligent Information Systems: 16th International Conference, MobiWIS 2019, Istanbul, Turkey, August 26–28, 2019, Proceedings. LNCS, vol. 11673, pp. 163–177. Springer, Cham (2019). https://doi.org/ 10.1007/978-3-030-27192-3_13 34. Barati, M., Rana, O.: Enhancing User Privacy in IoT: Integration of GDPR and Blockchain. In: Zheng, Z., Dai, H.-N., Tang, M., Chen, X. (eds.) Blockchain and Trustworthy Systems: First International Conference, BlockSys 2019, Guangzhou, China, December 7–8, 2019, Proceedings. CCIS, vol. 1156, pp. 322–335. Springer, Singapore (2020). https://doi.org/10. 1007/978-981-15-2777-7_26 35. Bryman, A.: Social research methods. OUP, Oxford (2012) 36. Yin, R.K.: Case Study Research: Design and Methods. SAGE Publications, Thousand Oaks, CA (2009) 37. Perumal, A.M., Nadar, E.R.S.: Architectural framework and simulation of quantum key optimization techniques in healthcare networks for data security. J. Ambient Intell. Humaniz. Comput. 1–8 (2020)
Case Study: UML Framework of Obesity Control Health Information System Majed Alkhusaili(B) and Kalim Qureshi College of Life Sciences, Kuwait University, Sabah Al-Salem City, Kuwait
Abstract. The aim of this paper deals with a model design through the Unified Modeling Language (UML) to present an Obesity Control Health Information System. Due to the increasing numbers of obesity around the world and for a better lifestyle of people, this design model is proposed in form of UML diagrams. To fulfilled the requirements of this paper, a complete case study scenario of the OCHIF is been considered. The outcomes of this model fill the business market needs for such a system. Besides, it will encourage the designers and developers to explore different ways of how to model an entire system using UML diagrams. Keywords: UML diagrams · Obesity Control Health Information System · System design
1 Introduction Model transformations play a significant role in software development systems [1–3]. Automation of repetitive and well-defined steps considered to be the key advantage in every developed transformation. It helps to shortening design time and minimizing a number of errors [2]. In an object-oriented approach, analysis, design, and implementation are the consistent signs of progress for any software development system [1]. Nowadays, many software programs support the idea of model transformations. For instance, the Unified Modeling Language (UML), and Computer-Aided Software Engineering (CASE). Both software tools proven to carry out several engineering activates, including modeling and diagraming [3]. However, CASE tools are not incompatible with UML when it comes to hard tasks. The integration tools of UML have to show better results indeed [1]. Due to that, UML has become the most widely used tool for modeling object-oriented software systems [1, 3]. UML is a set of diagram centric design notations that are loosely connected. It allows designers to experience independent views of a software system [4]. Additionally, UML provides various diagrams classified into two major levels of diagrams, i.e. structural diagrams and behavioral diagrams [5]. Structural diagrams reveal the static aspects of the system under development, while behavior diagrams address the dynamic aspects. By implementing these two features in UML, the chance of making the right decisions are more incompatible and the detection of inconsistencies will become even vital for the development of accurate software [5, 6]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 173–191, 2021. https://doi.org/10.1007/978-3-030-80129-8_14
174
M. Alkhusaili and K. Qureshi
UML has led to various completion of electronic software models. It has brought a lot of positives feedback regarding the improvement of information management [3]. Electronic Health Information System (EHIS) is a great example to be considered as a type of electronic software model. EHIS is a collection of structures and procedures arranged to collect information that supports health care management decisions at all levels of an entire health system [7]. It has the potential to simplify work processes, make procedures more accurate, identify each case separately, reduce the risk of human errors, and improve the performance of the delivery of health services [7, 8]. In this context, Obesity Control Health Information System (OCHIS) is one of many eHealth platforms available in today’s world. The idea of modeling an entire software of an obesity system has been a grey area that needs to be highlighted [4]. A method to establish such a system considers being a challenging issue that needs to be addressed. The focus of this paper is to analyze and model an entire OCHIS. Consequently, this study will be performed using UML application tools. The rest of the paper is being organized as follows:
2 System Document Vision This section evaluates the potential approach behind the entire system. Problem Description Obesity is becoming an alarming issue with implications affecting society and the healthcare sector. In response, multi-professional programs with physical activity, nutritional and psychological components have been proposed [9]. Still, due to limited resources, only a small number of patients can be included in these programs. Health Information System (HIS) has the potential to tackle these challenges. Yet little is known about the design and effects of HIS in the domain of multi-professional obesity programs, in particular those tailored to children and adolescents. Therefore, to address this problem, a HIS software prototype to support obesity interventions is necessary be built. System Capabilities The proposed system should have the capacity to: • • • •
Collect and store health data and information about the patients. Collect data and information about food eating levels. Collect and store data and information about exercises. Keep track of health-related conditions.
Business Benefits Deploying the proposed system will bring about the following business benefits: • Potential to improve outcomes. • Reduce costs of health interventions.
UML Framework of Obesity Control Health Information System
175
• Providing the possibility for self-monitoring. • Simultaneously supporting the patients’ decision making through real-time access to relevant information.
3 System Work Breakdown Structure The framework parameters can be adjusted based on the needs and the available models as follows: 3.1 Analysis of Requirements and Specifications: Where the essential information of the system is collected. This includes: 1. 2. 3. 4. 5. 6.
Designing proposal and presenting to board – 4 h Meeting with Obesity Department manager – 2 h Meeting with several patients – 5 h Identifying and defining use cases – 3 h Identifying and defining requirements – 3 h Developing workflows – 5 h
3.2 Designing Components of the Proposed System: Where some diagram types of the system are assigned. This cover: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Designing class diagrams – 2 h Designing database schema – 4 h Designing class diagram – 2 h Designing architecture design diagram – 3 h Designing entity relation diagram – 2 h Designing state machine diagram – 2 h Designing deployment diagram – 2 h Designing package diagram – 2 h Designing input and output screens – 4 h Designing overall system architecture – 6 h
3.3 Implementing the Proposed System: This is to measure whether the system is clear, easy, consistent, and simple to be followed. This include: 1. Coding Graphical User Interface (GUI) layer components – 16 h 2. Coding logical components of the system – 12 h
176
M. Alkhusaili and K. Qureshi
3.4 Testing and Debugging Proposed System: Where the performance of the entire system is examined. This cover: 1. Performing system functionality testing – 8 h 2. Performing user acceptance testing – 10 h 3.5 Deployment of the Proposed System: Where the interaction of the fundamental services are assessed. This include: 1. The integrating system within the organization – 10 h 2. Performing maintenance – 8 h
4 System Use Case A system use case is considered to be a major processing step in modeling realization [10]. It is a way to avoid system defects by decrements what could go wrong earlier in stages. According to the system requirements, the study demand proposing the following descriptions: Table 1. Defines the project use cases. Use case
Description
Look up patient
Using the patient name, find health data and inform out the patient
Enter/update patient information
Enter new or update existing patient data and information
Look up contact information
Using the patient name, find contact information
Enter/update contact information
Enter new or update existing contact information
Enter/update meal plan information
Enter new or update existing meal plan information
Enter/update exercise plan information
Enter new or update existing exercise plan information
Table 1 demonstrates the task of each system use case. It represents the description of all the methods an end-user follows to perform the system. These methods are like requests of the system. It works as a joint that links the system to the user. Each use case is performed as a sequence of simple steps, starting with the user goal and ending when that goal is fulfilled. Consequently, by looking into the provided use case diagram in Fig. 1, it can be concluded that there are two actors in a position which are the therapist and patient. These actors represent the users of the system. Each actor has a specific role to play; assign according to the requirements of the system e.g., a therapist can view and modify
UML Framework of Obesity Control Health Information System
177
Fig. 1. System use case diagram.
all the associated tasks to the system. However, the patient is only limited to some of these tasks. The idea behind this diagram is to define the relationships of each task by their actor.
5 System Preliminary Class Diagram A class diagram can be defined as a static type of diagram, for which it represents the static views of an application. It describes the attributes, the associated methods of the class, and the constraints imposed on the system. Besides, it works as a collection of classes, interfaces, associations, and collaborations which eventually perform a structural diagram to an entire system. Based on the system requirements, a division of five classes has been modeled using the UML software program. All the associated classes are being mapped with each other so that the entire design system is being represented as one. The assigned numbers refer to the orders system which describes a particular aspect of the related system [4]. For instance, based on Fig. 2, at least one therapist has to examine one or more patients. Similarly, at least one therapist should provide one or more reports to the patient. On
178
M. Alkhusaili and K. Qureshi
Fig. 2. A UML representation of object class diagram with attributes.
the other hand, the patient may receive either more than one exercise and meal plans or none at all. Depending on the assigned treatment plans.
6 System Activity Diagram An activity diagram is another representation of UML models, which describe the dynamic aspects of the system. It consists of flowcharts that display the flow of one activity to another. Due to the various types of activity diagram flow features, the control flow from one operation to another can be either sequential, branched, or concurrent. These types of flows assist the reader to follow and understand the order elements of each task according to the diagram easily [11]. Furthermore, the activity diagram provides tools that facilitate the use of all kinds of flow control such as fork, join, etc. UML activity diagram in Fig. 3, explain the following processes; the process order starts when the system receives a login verification from both the patient and therapist. Then, the system provides the details as being organized. After that, the therapist shall select the patient and initiate the order. Next, at the diamond-shaped flow-node, the therapist evaluates the condition within the square brackets of the first flow. If the condition is yes, the therapist chooses that flow and move to complete the order node. Otherwise, the order will be rejected and move to add new tasks till it merged as a complete order. Once the control flow moves to the merge node, the order fulfillment is considered to be completed. In the end, a close order action is executed.
UML Framework of Obesity Control Health Information System
179
Fig. 3. System UML activity diagram.
7 System Design Database Schema A database schema is a representation model that is usually created at the start of any system. It is a method planned out to build the infrastructure of the entire system before being implemented. For that, once the schema is implemented, it is difficult to make any changes [12]. The following Table 2 shows the database schema of the required system. Obesity control databases have been divided into four main classes. Each class has been defined with a set of commands. These commands are been represented as attributes. The purpose of each attribute is to describe the number of databases often refers to as schema. With the help of system developers and administrators, the schema can be
180
M. Alkhusaili and K. Qureshi Table 2. System databases schema. Class name
Attributes
Therapist
therapist_ID: integer [PK] therapist_name: string email_address: varchar phone_number: integer position: varchar
Patient
patient_ID: integer [PK] patient_name: string address: varchar DOB: date health_desc: string comments: string
Meal plan
meal_ID: integer [PK] meal_category: vachar meal_name: string meal_desc: string meal_outcome: string
Exercise plan exercise_ID: integer [PK] exercise_category: vachar exercise_name: string exercise_desc: string exercise_outcome: string
converted into Structured Query Language (SQL). It is a modern technology tool used to create, manage, and secure databases of a developed information system [12].
8 System Preliminary Class Diagram Object-Oriented Programming (OOP) provides various forms of system class diagrams. The preliminary class diagram on the other hand is one of them. It is an illustration of relationships and commands defined between classes with the aid of some modeling software e.g. UML [13]. The preliminarily class diagram of the system has been modeled as follows. According to the class diagram shown in Fig. 4, the classes are organized in groups which all share common characteristics. In general, a preliminary class diagram consists of a flowchart in which classes are fitted in boxes, each box containing either two or three rectangles inside. Depending on the description if it is three rectangles, the top rectangle specifies the name of the class; the middle rectangle holds the attributes of the class; the lower rectangle addresses the methods, also known as the operations of the class. However, sometimes a class diagram comes in only two boxes. In this case, the first rectangle contains the name of the class and the last rectangle describes the methods. Furthermore, a class diagram allows the usage of some arrow lines between
UML Framework of Obesity Control Health Information System
181
Fig. 4. Representation of system preliminary class diagram.
the classes. These arrow lines connect the boxes. Hence, It defines the relationships and associations between the classes.
9 Subsystem Architecture Design Diagram A subsystem diagram is a part of UML diagrams, which represent independent behavioral units in the system. The role of the subsystem architecture diagram is to view large-scale components of the modeled system. It is a way to build an entire system as a hierarchy of subsystems. The following model displays the subsystem architectural diagram of the system. A subsystem diagram normally resembles three modeling items, subsystem packages, subsystem classes, and subsystem interfaces. Based on Fig. 5, the first main package is the system name itself. Consequently, that package has been divided into two internally subsystem packages, represented as the view layer and domain layer. Each of these layers contains two independent subsystem classes. Subsequently, each subsystem classes realize one or more subsystem interfaces that define the behavioral operations of
182
M. Alkhusaili and K. Qureshi
Fig. 5. Subsystem design diagram.
the subsystem. In general, the compartments of all display subsystems diagram aims to minimize the defects from the system by examining all the necessary content and roles for better system development.
10 System Functional Requirements The paramount functional requirements of this system are: • • • • • •
The system shall allow patient registration by the therapist. The system shall allow the therapist to review and modify patient information. The system shall allow report generation. The system shall allow communication between patient and therapist. The system shall allow the therapist to search for and review patient information. The system shall allow the patient to review their meal program and therapist instructions.
UML Framework of Obesity Control Health Information System
183
11 System Non-functional Requirements The short-term requirements of the system are: • Security requirements. o The system shall authenticate each user with a unique identification method. • • • • •
Usability requirements. Reliability requirements. Performance requirements. The system shall allow patients to sign in. The system shall allow therapists to sign in.
12 System Stakeholders Based on the system requirements, the decision will be answered into two categories as follows: 12.1 Internal Stakeholders: Represent Direct Interaction with the System. • Managers. • Therapists. • Staff 12.2 External Stakeholders: Represent Non-direct Interaction with the System. 1. Patients. 2. Parents. 3. Public.
13 System Information Collecting Techniques Some of the collecting system information methods are: 1. 2. 3. 4.
Interviewing internal and external stakeholders. Observing activities and processes within the organization. Conducting research online. Distribution and collection of questionnaires.
14 System User Goals This section presents the tasks of each user according to the system requirements. Each user has been assigned to sets of criteria as follows. Table 3 represents the system allocating roles according to every user’s perception. Choosing the right people consider to be a primary success to every system. Initiating this step allows to organize the data flows between sub-users which leads to more system efficiency and less occurring of error.
184
M. Alkhusaili and K. Qureshi Table 3. System user goals. User
User goal and resulting use case
Patient
Access health information Search for a meal program View and update the current meal program
Therapist
Add/update patient information Add/update meal program information Allocate meal program to patients
Parent
View patient information Track patient information
Department manager View therapist performance View patient reports
15 System External, Temporal, and State Events • External event o Patient attempting to login to access system at home. • Temporal event o Reports are generated at the end of every week. • State event o Therapist allocating a meal program to a specific patient.
16 System State Machine Diagram State machine diagrams design to describe the state-dependent behavior of an object being processed in a system. It is mainly designed for objects. However, it can be used to any methods that had have behavior to other entities e.g., use cases, actors, elements, and system subsystems [14]. The following model represents the state machine diagram of the system. Figure 6 represents the system state machine diagram. It covers the mechanical performance and state processes of the system’s behavior. According to the diagram, a doctor considers being an object in the system. The state machine diagram allows the doctor to act differently from one event to another depending on the functions being processed.
UML Framework of Obesity Control Health Information System
185
Fig. 6. System state machine diagram.
17 System Use Case Description This section shows the system use case descriptions as represented in Table 4. Table 4. System use case descriptions. Use case name:
Create patient account
Scenario:
Creating a patient account for a new obese patient
Triggering event:
A new obese patient wants to start an obesity control program and visits a therapist physically
Brief description:
The therapist takes basic details and account information, including contact, address, and billing information
Actors:
Therapist, patient
Related use cases:
Might be invoked when an existing patient record needs to be modified (continued)
186
M. Alkhusaili and K. Qureshi Table 4. (continued)
Use case name:
Create patient account
Stakeholders
Obesity department, hospital management
Pre-conditions:
Patient registration sub-system must be available
Post-conditions:
• Patient account must be created and saved • The patient address must be created and saved • Patient DOB must be created and saved • Patient health description must be created and saved
The flow of activities: Actor 1. The patient shows a desire to enroll and provides basic information 2. The patient provides full names 3. Patient provides address 4. Patient provides DOB 5. The patient provides a health description
System 1.1 System creates a new patient record 2.1 System saves the patient name 3.1 System save the s patient address 4.1 System saves patient DOB 5.1 System saves patient health description
Exception conditions: 2.1 Incomplete patient names 3.1 Invalid patient address
18 System Sequence Diagram A sequence diagram is one of the most fundamental UML diagrams which represent the behavioral aspect of the system under development. It is designed according to a specific use case scenario that encompasses both normal and alternative flows [15]. The role of the sequence diagram is to describe the time order of messages that have been generated between objects [5]. A complete sequence diagram for the system is been represented as follows. According to Fig. 7, in a sequence diagram, all the objects are positioned at the top of the diagram along the X-axis. The new objects shall be added on the right side of the other objects as further interactions are been initiated. On contrary, messages are placed along Y-axis, dragging down to reveal time increasing. Furthermore, objects in the sequence diagram are categorized into three stereotypes, for instance, boundary, entity, and control [15]. A boundary which is also known as the interface is the object that deals with the communication between actors and system. It has been defined from the use case scenario. On the other hand, an entity object describes procession information controlled by the system, which is usually related to a special relevant concept in life. Nevertheless, a control object drives the interaction flow of one use case scenario. It works like glue between boundary and entity objects. In a sequence diagram, there are more than one boundary and entity objects. However, only one control object is recommended to be used in the use case [15].
UML Framework of Obesity Control Health Information System
187
Fig. 7. System sequence diagram.
19 System Design Component Diagram In UML, a component diagram is a static kind of diagrams. Typically, component diagrams are used to visualize the physical aspects of the system, e.g., executables, files, libraries, and documents which all reside in a node [16]. A single component diagram cannot represent an entire system. However, a collection of component diagrams are used to represent an entire system. The following model represents the component diagram of the system. A component diagram usually resembled after the system is being designed. Particularly, when the artifacts of the system are ready, the component diagrams are used for system implementation [16]. According to Fig. 8, the user will be able to reach the
188
M. Alkhusaili and K. Qureshi
Fig. 8. System component diagram.
system internet server through the internet network. Then, the user will be able to access the system through a secure Virtual Private Cloud (VPC). The role of VPC is to help organizations protect the confidential details of their customers, which are known to be the users, and their system assets [16]. Furthermore, in a sequence diagram, it is known that the artifacts are the files. Consequently, there are two identified artifacts represented as the therapist information subsystem and the patient information subsystem. Both of these artifacts are being linked into two different dependent databases which work as an engine to sort, change, or serve the information within the system.
20 System Deployment Diagram A deployment diagram is another representation of a static diagram, which describes the deployment view of the system. It is usually modeled to give an overview look at the hardware components of the system. Precisely, both the component diagram and deployment diagram are somehow related to each other. The component diagram defines the components and the deployment diagram shows how these components are been deployed in system hardware [17]. The next diagram represents the deployment diagram of the system.
UML Framework of Obesity Control Health Information System
189
Fig. 9. System deployment diagram.
Deployment diagram is mainly used to control various engineering parameters, e.g., system performance, scalability, maintainability, and portability [17]. A simple deployment diagram has been utilized in Fig. 9, based on the system requirements. This system uses a different type of nodes which are represented as the portal layout, available devices such as the application and computer mainframe, and servers. The system is assumed to be a web-based program, for which it will be deployed in a lumped environment to get the benefits of web server, application server, and database server. All the connections will be processed using the internet network. The control flows from servers back to the environment in which the software components reside.
21 System Interaction Diagram An interaction diagram is used to describe the dynamic nature of a system by defining various types of interaction among different elements in a system. Sometimes visualizing the interaction could be difficult to process. However, the developers might use different types of models to capture different aspects of the interaction [18]. According to Fig. 10,
190
M. Alkhusaili and K. Qureshi
the dynamic aspect in this situation is the therapist. In this way, the system flow messages will be described along with the structural organization of the interacting objects. The following representation explains the interaction diagram of the system.
Fig. 10. System interaction diagram.
22 Conclusion From the above work, it is obvious that a complete model of an entire system can be generated using UML applications. With the aid of these promising software tools, OCHIS has been successfully designed. By applying various UML techniques, nearly ten diagrams have been generated and analyzed, and more than three tables along with research definitions have been collected and classified as the current body of evidence regarding the approach of this paper. The result of this study can be used to encourage the designers and developers to explore the different ways of how to model an entire system using UML diagrams. The object of this paper was to analyze and design an entire OCHIS using some sort of modern software programs such as UML. The outcomes of this research have been achieved and they can be represented as suitable tool-chains to fulfilled business needs.
References 1. Massago, M., Colanzi, E.: An assessment of tools for UML class diagram modeling: support to adaptation and integration with other tools. In: SBQS 2019: Proceedings of the XVIII Brazilian Symposium on Software Quality, pp. 10–19. https://doi-org.kulibrary.vdiscovery. org/10.1145/3364641.3364643 (2019). Accessed 19 Oct 2020. 2. Technicznej, B.: Construction of UML class diagram with model-driven development. Wojskowa Akademia Techniczna, Redakcja Wydawnictw WAT, ul. gen. S. Kaliskiego 2, 00-908 Warszawa, 65(1), 111–129 (2016)
UML Framework of Obesity Control Health Information System
191
3. Sergievskiy, M., Kirpichnikova, K.: Optimizing UML class diagrams. ITM Web Conf. 18, 03003 (2018) 4. George, R., Samuel, P.: Fixing class design inconsistencies using self-regulating particle swarm optimization. Inform. Softw. Technol. 99, 81–92 (2018). https://doi.org/10.1016/j.inf sof.2018.03.005. Accessed 20 Oct 2020 5. OMG: OMG Unified Modeling Language (OMG UML) Version 2.5.1. Object Management Group (2017) 6. Torre, D., Labiche, Y., Genero, M., Elaasar, M.: A systematic identification of consistency rules for UML diagrams. J. Syst. Softw. 144, 121–142 (2018). https://doi.org/10.1016/j.jss. 2018.06.029. Accessed 20 Oct 2020 7. Khumalo, N., Mnjama, N.: The effect of ehealth information systems on health information management in hospitals in Bulawayo, Zimbabwe. Int. J. Healthc. Inform. Syst. Inform. 14(2), 17–27 (2019). https://doi.org/10.4018/ijhisi.2019040102. Accessed 20 October 2020 8. Phichitchaisopa, N., Naenna, T.: Factors affecting the adoption of healthcare information technology. Exp. Clin. Sci. Int. J. 12, 413–436 (2013) 9. Pletikosa, I., Kowatsch, T., Maass, W.: Health information system for obesity prevention and treatment of children and adolescents. In: Conference: European Conference on Information Systems, Tel Aviv, Israel (2014) 10. Niepostyn, S.: The sufficient criteria for consistent modelling of the use case realization diagrams with a new functional-structure-behavior UML diagram. Przegl˛ad Elektrotechniczny 1(2), 33–37, (2015). https://doi.org/10.15199/48.2015.02.08. Accessed 22 Oct 2020 11. Ahmad, T., Iqbal, J., Ashraf, A., Truscan, D., Porres, I.: Model-based testing using UML activity diagrams: a systematic mapping study. Comput. Sci. Rev. 33, 98–112 (2019). https:// doi.org/10.1016/j.cosrev.2019.07.001. Accessed 22 Oct 2020 12. Link, S., Prade, H.: Relational database schema design for uncertain data. Inform. Syst. 84, 88–110 (2019). https://doi.org/10.1016/j.is.2019.04.003. Accessed 24 Oct 2020 13. Faitelson, D., Tyszberowicz, S.: UML diagram refinement (focusing on class - and use case diagrams). In: ICSE 2017: Proceeding of the 39th International Conference on Software Engineering, pp. 735–745 (2017). https://dl-acm-org.kulibrary.vdiscovery.org/doi/10.1109/ ICSE.2017.73. Accessed: 25 Oct 2020 14. Khan, M.: Representing security specifications in UML state machine diagrams. Procedia Comput. Sci. 56, 453–458 (2015). https://doi.org/10.1016/j.procs.2015.07.235. Accessed 26 Oct 2020 15. Kurniawan, T., Lê, L., Priyambadha, B.: Challenges in developing sequence diagrams (UML). J. Inform. Technol. Comput. Sci. 5(2), 221 (2020). https://doi.org/10.25126/jitecs.202052216. Accessed 27 Oct 2020 16. Ermel, G., Farias, K., Goncales, L., Bischoff, V.: Supporting the composition of UML component diagrams. In: SBSI 2018: Proceedings of the XIV Brazilian Symposium on Information Systems, Article No. 5, 1–9 (2018). https://doi-org.kulibrary.vdiscovery.org/10.1145/ 3229345.3229404. Accessed 27 Oct 2020 17. Barashev, I.: Translating semantic networks to UML class diagrams. Procedia Comput. Sci. 96, 946–950 (2016). https://doi.org/10.1016/j.procs.2016.08.085. Accessed 28 Oct 2020 18. Mani, P., Prasanna, M.: Test case generation for embedded system software using UML interaction diagram, J. Eng. Sci. Technol. 12(4), (2017). https://doaj.org/article/9bcd7013a 5874abdb52c7fc402c86e30. Accessed 28 Oct 2020
An IoT Based Epilepsy Monitoring Model S. A. McHale(B) and E. Pereira Department of Computer Science, Edge Hill University, Ormskirk L39 4QP, UK [email protected]
Abstract. The emerging approach of personalised healthcare is known to be facilitated by the Internet of Things (IoT) and sensor-based IoT devices are in popular demand for healthcare providers due to the constant need for patient monitoring. In epilepsy, the most common and complex patients to deal with correspond to those with multiple strands of epilepsy, it is these patients that require long term monitoring assistance. These extremely varied kind of patients should be monitored precisely according to their key symptoms, hence specific characteristics of each patient should be identified, and medical treatment tailored accordingly. Consequently, paradigms are needed to personalise the information being defined by the condition of these patients each with their very individual signs and symptoms of epilepsy. Therefore, by focusing upon personalised parameters that make epilepsy patients distinct from each other this paper proposes an IoT based Epilepsy monitoring model that endorses a more accurate and refined way of remotely monitoring and managing the ‘individual’ patient. Keywords: IoT · Healthcare · Personalisation
1 Introduction By integrating IoT sensor-based devices deployed remotely and personalised patient data into a combined monitoring framework a vision of personalisation is realised. This study revealed some irrefutable evidence derived from patient profile analysis and experimental data that seizure detection using sensors positioned on different parts of a patents body ultimately makes an impact on the monitoring of epilepsy, endorsing that modern computer science is providing a timely chance for a more personalised approach to the monitoring and management of epilepsy. The chances of capturing seizure data can be greatly increased if a correctly assigned sensor is placed on the correct part of the patient’s body and ultimately, such a concept could ‘enhance the overall monitoring scheme of a patient usually performed by caring persons, who might occasionally miss an epileptic event’ [1]. This paper is organised as follows. In this section, Sect. 1 and Sect. 2 the state of the art is analysed; the complexity of epilepsy together with smart healthcare monitoring approaches are highlighted. There is also a focus upon the sensors available for epilepsy following on with an emphasis on the limitations for a personalised approach. Section 3 presents the driving questions in this study and describes the experiment and findings © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 192–207, 2021. https://doi.org/10.1007/978-3-030-80129-8_15
An IoT Based Epilepsy Monitoring Model
193
from capturing seizure data. Section 4 introduces the proposed IoT based Epilepsy monitoring model and reveals the PMP (Personalised Monitoring Plan) framework whereby the patient can be matched with the correct device, while Sect. 5 presents how this was evaluated and Sect. 6 outlines long term use. Concluding remarks are drawn in Sect. 7. 1.1 Motivation Epileptic seizure monitoring and management is challenging. Most current studies of epileptic seizure detection disclose drug resistant epilepsy still lacks an ultimate solution, despite the increase in anti-epileptic drugs [2]. Epilepsy is not a single disease, but a family of syndromes that share the feature of recurring seizures. In some instances, it may be related to a genetic aetiology, or it can occur in association with metabolic disorders, structural abnormalities, infection or brain injury [3]. In the United Kingdom epilepsy affects 3 million people and in the United States it is the 4th most common neurologic disorder, only migraine, stroke, and Alzheimer’s disease occurs more frequently [4]. There are around 60 different types of seizure and a person may have more than one type. Seizures vary depending on where in the brain they are happening. Some people remain aware throughout, while others can lose consciousness [5]. Aside from their unpredictability, the worst part of having seizures is their utter complexity. The complex nature of epilepsy is noticeable in the variation of seizures types and symptoms between one patient and another. Distinguishing or classifying an individual epilepsy patient makes it difficult to manage and monitor. The negative impact of uncontrolled seizures spreads beyond the individual to affect their family, friends, and society. Chronic anxiety is experienced by the families and friends of people with epilepsy and many lives are adjusted to ensure the safety of their loved one. Novel approaches to epilepsy treatment are still greatly needed [6] novel therapies that better manage and monitor seizures as well as technology can help to handle the consequences of seizures. Insufficient knowledge about epilepsy, which is a very common disorder, has a great and negative impact on people with epilepsy, their families and communities, and the healthcare systems. There is need for a better understanding of the disease to make way for new approaches to monitor it. In the modern day of personalised medicine and rapid advancements in IoT a question that needs addressing is whether epilepsy monitoring can benefit from personalised approach. Can the IoT have the potential to significantly improve the ‘patients’ daily lives whose seizures cannot be controlled by either drugs or surgery [7]? 1.2 Smart Healthcare Monitoring Approaches In the history of time it is only relatively recently that computers began to assist healthcare monitoring, in 1950s’ patients began to be continuously monitored by computerised machines [8] and clinical monitoring was first envisaged in the home [9]. For computer assistance to epilepsy it was not until 1972 in the field of imaging, when computerized
194
S. A. McHale and E. Pereira
tomography (CT) was invented by the British engineer Godfrey Hounsfield [10] and only in recent decades where specific epilepsy healthcare ‘monitoring systems’ have been proposed. Much of this recent growth being due to the advent of current IoT technology whereby the rise of ‘smart environment’ approaches to healthcare monitoring is witnessed. There are many IoT approaches for the monitoring and management of epilepsy many of which encompass a network of connected smart devices which are equipped with sensors either embedded in clothing or smart phones, to either detect, predict or manage epilepsy. Discoveries disclose how IoT is utilised to support the ever-growing trend of personalised healthcare. These recent ‘smart’ approaches in healthcare demonstrate the trend toward ‘sensor use’ and ‘remote monitoring’.
2 Related Work Researchers are bounding toward the new generation of smart technology and IoT (Internet of Things). Novel devices such as smart watches, smart bands & smart clothing are all competing for the ultimate solution. Yet it is found there is limited research which focuses upon the concept of a more holistic, personalised approach to help manage epilepsy. One study deemed the significance of attention on smart technologies and its potential to identify early indicators of cognitive and physical illness [11] and observed that researchers have argued and predicted that assessing individuals in their ‘everyday environment’ will provide the most ‘valid’ information about everyday functional status [12]. Indeed, there is evidence recently of this indication as several IoT platforms to manage & monitor healthcare remotely, are observed. For example, one IoT paradigm comprising of Wireless Health Sensors (WHS) permits the continuous monitoring of biometric parameters such as pulse rate, pulmonary functional quality, blood pressure and body temperature [13]. This IoT paradigm is being used to assist predictive analysis via smart healthcare systems by a medical practitioner. Using sensors connected to Arduino patient status is tracked, and by a Wi-fi connection data is collected and transmitted and can receive user requests. This data is shared with doctors through a website where the doctor can analyse the condition of the patient and provide further details online and intimate patient about future severity well in time [13]. 2.1 Sensors for Epilepsy EEG, an electroencephalogram is a recording of brain activity. This is the chief gold standard method used within hospitals to detect and monitor seizures. Several approaches have been reported with the aim to embed this method in other settings and platforms. Developments in some topics have been published, such as modelling the recorded signals [14, 15] or the design of portable EEG devices to deploy such models. As an alternative and sometimes supplement to EEG there exist many sensors embedded in clothing or worn on the body to obtain bio-signals such as gyroscopes, accelerometers, pulse rate, temperature sensors, magnetometers, galvanic skin response sensors (GSR),
An IoT Based Epilepsy Monitoring Model
195
implanted advisory system, electromyography, video detection systems, mattress sensor, and audio systems [16]. A large amount of apps have been published more recently especially in the commercial sector for the detection and management of seizures using either the Smartphone sensors or external sensors, for example Epdetec [17] and Myepipal [18] and web logging which facilitates the way a patient records daily information concerning her/his epileptic events, medication, and news, My Epilepsy Diary [19] and Epidiary [20]. Another app attracting attention and recently reviewed in the press is the Alert App by Empatica. This app sends caregivers an automated SMS and phone call when it detects unusual patterns that may be associated to a convulsive seizure [21] yet it is only designed to work with the Embrace Smartband by Empatica and can prove expensive for the user [22]. Regrettably, there are few specific sensor detection options for each specific seizure type, this is an imminent requirement for patients and their carers. Ideally when choosing a seizure detection device, the patient-specific seizure semiology’s should be considered [16]. Thus, highlighting the need for a type of monitoring that distinguishes one patient from another and depicting the need for devices to pinpoint the patient-specific signs and symptoms. 2.2 Addressing the Gaps Despite the focus in literature on smart healthcare monitoring approaches there is limited emphasis on the embracing of a truly personalised approach for epilepsy as previously described. Even though the ‘diversity’ of epilepsy is acknowledged and has been identified in other studies, i.e. by highlighting the importance of distinguishing each ‘seizure type’, there is still a gap to address such parameters. The ‘seizure type’ is just one of many parameters that can distinguish one seizure patient from another. Therefore, these very individual characteristics can be further identified to address the challenge to achieve a truly personalised approach to managing epilepsy. More so recently it is recognised that devices should specially take into account the user’s seizure types and personal preferences [23], focus should be shifting not only on the desires of the users but seizure detection devices should be able to ‘adapt’ to the patient’s characteristics and seizures [23]. It is already becoming known that wearing sensors on the body is starting to be popular, as observed recently in a 2018 study where a great interest was highlighted in the use of wearable technology for epilepsy carers, this being independent of demographic and clinical factors and remarkably outpacing data security and technology usability concerns thus demonstrating the vital factor of comfortability [24]. Yet as discovered during a review to select the best sensor for each individual patient there was limited data on which was the best sensor for each seizure type, this was unfortunate despite an internationally active research effort, signifying the gap in knowledge, again, for understanding the individual epilepsy patient [16].
3 Experiment and Findings This section discusses the experiment that was performed to capture seizure data, obtained from sensors, which are positioned on different parts of the patient’s body.
196
S. A. McHale and E. Pereira
This was done to test the assumption that it is ‘the individual profile’ that makes the difference in which device to choose. The results from this experiment are used to inform a typical model or a PMP (Personalised Monitoring Plan) discussed in the next section. The actual ‘sensor’, and their ‘position’ (worn by the patient) are significant for epilepsy and the focus in the experiment was on how patients exhibit behaviour, rather than any actual testing of devices. It was therefore important to choose the most accurate sensors for monitoring epilepsy; those were found to be the accelerometer and heartrate sensors, although latest studies suggest making use of other sensors too such as peripheral temperature, photo plethysmography (blood circulation), respiratory sensors [25], and galvanic (changes in sweat gland activity) among others [26]. 3.1 Preliminary Investigations Numerous studies have been previously been conducted with sensors and use for epilepsy [27, 28]. Since the ‘gold standard’ for epilepsy monitoring is video-EEG monitoring (which takes place within hospitals) [29] the driving questions addressed here were: • Can the patient be just as accurately monitored at home with an inexpensive, easily obtainable accelerometer and heart-rate sensor-based device? • Can the individual requirements of the patient be pinpointed? If so, is it possible that these sensors can be worn at home (a personalised approach) and be just as effective as using EEG monitoring in the hospital setting? From the analysis of the patient data it is clear that a patient profile based on particular characteristics can indicate which position the sensor is best placed on the patient’s body. Sample patient profiles where selected based upon criteria informed from discussions with clinicians. For example, Patient Profile 1 seizures begins with the right arm suddenly raising, therefore can the sensor be placed upon the right shoulder? Patient Profile 4 has a lot of shaking during their Focal Onset Seizures with shaking starting on the left arm so therefore can the sensor be useful attached to the left wrist? Whereas Patient Profile 5 begins their seizures with severe tremors on the right leg, can the sensors detect movement and heart-rate changes with sensor in this position? During the investigation practicable devices to use in the experiment to monitor epilepsy were analysed. The ‘Fitbit Ionic’ was chosen as the best option since both the heart-rate and accelerometer can be extracted. The commercial activity device has been used in other studies, most notably recently whereby it used data from more than 47,000 Fitbit users in five U.S. states and data revealed that with Fitbit use the state-wide predictions of flu outbreaks were enhanced and accelerated [30]. This use demonstrates the viability and potential suitability of Fitbit as a healthcare device. 3.2 Experiment Description The objectives of the experiment were to assess the movement from the accelerometer sensor and the pulse from the heart-rate sensor in the detection of epileptic seizures. Participants with confirmed epilepsy are recruited. The non-invasive wrist, leg, knee or arm-worn sensors are used to acquire heart-rate activity and movements. The study
An IoT Based Epilepsy Monitoring Model
197
evaluated the movement from the accelerometer sensor and the pulse from the heart-rate sensor in the detection of an epileptic seizure. Over a period of 5 days the patients were asked to wear the device and continue recording seizures in their seizure diary. The study also evaluated any differences in result due to the ‘position’ of the sensor on the body together with the patients’ acceptability and comfort. The instructions contained daily forms for the patient to complete, hence, keeping a diary of the times of seizure, if they did not use this method an EEG recording was obtained. This way the actual time stamp of the patients recorded seizure can be checked against the server time stamp observations of the seizure, so for example if the patient records their seizure at 10.20 am and the server readings reveal heart-rate peaks and rapid movement from the accelerometer also at 10.20 am, then this confirms the server readings match the patients (or EEG) known seizure occurrence, see Fig. 1.
Fig. 1. Seizure time stamps
3.3 Experiment Results The heart-rate and accelerometer sensors used to detect characteristics of seizure events can successfully record seizure data, without need for participant cooperation beyond wearing the sensor-based device, even recharging the battery (battery life is 5 days when fully charged) was not required by the participants. Both the sensors detected the ‘shaking’ seizures correctly. Both accelerometer and heart-rate sensors have been used to detect seizures in numerous previous studies [25] but in this study it was found that when used together in one device they did not always work in sync “together”. This is because when the sensors were worn on the non-dominant side and a seizure occurs only the heart-rate change
198
S. A. McHale and E. Pereira
was indicated: the accelerometer showed no change. Yet when in correct position on the body they work in union as an excellent detection method. Therefore, demonstrating that body placement or position is paramount. For example, one patient’s dominant side was the right arm. This means seizures are known to occur on the right. The results from “Observation c HP4” (Fig. 2), can be seen below. During shaking from the right wrist at the recorded time: 00.23 am during a GTCS seizure all 3 measurements on axis X, Y and Z showed sudden movements and the heart-rate increased to its highest peak at 128. Before the seizure the heart-rate was much lower at 80, then rising rapidly to 90 and up to 128. This suggests both the sensors detected the seizure correctly.
Fig. 2. Observation c HP4
Yet the results from “Observation a HP4”, seen below in Fig. 3 indicate that during a GTCS at the recorded time ‘12.44 pm’ the 3 measurements on axis X, Y and Z did not show any sudden movement, in fact barely any movement at all, yet the heartrate increased to its highest peak at 124, in keeping with typical heart-rate increase measurement during a GTCS for HP4. Since the accelerometer was positioned on the left wrist this reveals the sensor did not detect movement therefore demonstrating the sensor was positioned in the wrong position.
Fig. 3. Observation a HP4
Detection of seizures using an everyday sensor-based device and data transfer to online database was successful. The experiment presented evidence that remote monitoring of specific epilepsy patients’ profiles with known characteristics can be
An IoT Based Epilepsy Monitoring Model
199
improved. The comfortable sensor-based device with heart-rate and accelerometer provided accurate data and was a more dependable method than a patient’s paper diary.
4 Proposed IoT Based Epilepsy Monitoring Model The purpose of the IoT based Epilepsy monitoring model is to support a ‘Personalised Monitoring Plan’ framework in collecting data from a variety of potential epilepsy device sensors and also provide optimal analysis tools to utilise the sensor data thus supporting clinicians to monitor epilepsy patients. 4.1 PMP Framework This section proposes a Personalised Monitoring Plan (PMP) framework. In the previous section experiments were performed to capture seizure data, obtained from sensors, which are positioned on different parts of the patient’s body. The results from this experiment are used to inform a PMP (Personalised Monitoring Plan) which recommends which sensor-based device to use based on those very individual, personal characteristics of a given patient. The proposed ‘Personalised Monitoring Plan’ (PMP) framework is a model for which doctors and healthcare professionals (HCPs) can use to assist in identifying which device they should recommend to the individual patient for remote monitoring. The PMP framework integrates two types of ‘personalisation’: – The patient as the individual (derived from an ontology language) – Patients in a category (using the K-means Clustering method) Both these personalisation elements are described below in Sect. 4.2.: The third tool of the PMP framework supports the decisions surrounding recommending the correct IoT sensor-based devices. The main purpose is to help HCPs decide which IoT Sensors to recommend for monitoring and which position on the patient’s body. The PMP framework ultimately allows users to provide a description of the ‘seizure condition’ of a single patient or a patient type, and to automatically obtain a PMP adjusted to the patient requirements. The proposed framework consists of two features: the first being ‘Personalisation’ (based on this study) and the second is the anticipated ‘Remote Monitoring’, shown in pink and blue respectively in Fig. 4. 4.2 Personalisation Elements The personalisation contributions are part of the preceding findings in this study. The ontology language was created to support the need of the healthcare process to transmit, re-use and share individual patient profile data related to their seizures. The ontology was achieved by the initial examination of 100 anonymous epilepsy patient medical records. The data was analysed to discover if values for each of the
200
S. A. McHale and E. Pereira
Fig. 4. PMP framework
attributes are different for each patient, together with the investigation of epilepsy ‘terminology’ and existing seizure type classifications/categories were analysed so that an ‘individual’ seizure type patient profile could be formed. A close collaboration with clinicians helped to build a data model fit for real-world adoption inside hospital settings and thus an ontology was developed to model the concept of the epilepsy patient profile, namely ESO ‘Epilepsy Seizure Ontology’. This was a driving force for the PMP Framework and a critical aspect for this concept. In order to make ESO useable for HCPs (Health Care Professionals) the ontology was transformed into a language that is understandable by humans and machines, this was accomplished by XML and the outcome was PPDL (Patient Profile Description Language). The second personalisation element was achieved by using K-means Clustering analysis. Different clustering techniques were initially analysed to find the most appropriate approach for the acquired epilepsy data and an in-depth focus upon ‘clustering considerations’ was undertaken to confirm validity. The outcome was a set of six distinct ‘clustering’ groups, shown in Table 1 these 6 cluster groups revealed six completely different categories of patients each with their distinct seizure related information. The results revealed the distinct groups of epilepsy patients that share similar characteristics using Clustering Analysis. This will enable the health carers to define a ‘type’ of epilepsy patient.
Automatism
Either
Nocturnal and diurnal
Arm/leg
Nocturnal/diurnal
Nocturnal and diurnal
Leg
Cognitive automatism
None
Urinary/In-continence
Common sign/symptoms
LOC
LOC
Key sign/symptoms
Un-classified
Cluster 1
NMA
Cluster 0
Cluster
Seizure type
Attribute
Diurnal
Either
Automatism
Sensations
FAS
Cluster 2
Table 1. Cluster groups
Diurnal
Either
Automatism
None
Gelastic
Cluster 3
Nocturnal and diurnal
Leg
Sensory cognitive
LOC
Myoclonus bilateral clonus
GTCS
Cluster 4
Diurnal
Either
Automatism
LOC
Bilateral clonus
MTC
Cluster 5 An IoT Based Epilepsy Monitoring Model 201
202
S. A. McHale and E. Pereira
4.3 PMP Framework Loop and Maintenance With this PMP framework, the PPDL (Patient Profile Description Language) can be directly maintained (and extended) by HCP’s. The framework has the flexibility, (as the ontology grows with new seizure related concepts), to deal with the mounting diversity of seizure type patients. Therefore, the PMP framework is somewhat reliant on the integration of new knowledge in the PPDL. As the HCP of the PMP Framework approves the recommendations, the information about the patient and the advice may change and this new information can cause the PMP Framework to continue providing new suggestions to the HCP. This loop will stop either when the framework is not able to provide new recommendations or when the HCP considers that the current condition of the patient is correctly represented by the recommendation. At any time, the PMP for the patient is fundamentally controlled by the HCP who is using the framework. Consequently both the personalised ‘seizure related data’ and the ‘cluster classifier data’ of a patient may evolve as the patient disorder changes, for example when the information about the patient changes in the patient record of that patient or as a result of the application of the PMP Framework to find out new ‘seizure type’ knowledge about the current patient. The datasets are expected to evolve and are continuously stored as part of the record of that patient. 4.4 IoT Based Epilepsy Monitoring Model To achieve the type of monitoring described in the PMP framework, several IoT components can be deployed to retrieve sensor data from the epilepsy patient to be accessed remotely. These components include the integration of the personalisation components described in the PMP framework, those of an internet connection and protocols which form the ‘network layer’, a cloud platform to manage the data analysis and fundamentally the sensor-based devices forming the sensor layer. These components make the ingredients of an IoT solution, proposed in the IoT based Epilepsy monitoring model shown in Fig. 5. The sensor layer has the task of acquiring and sending the data from the different epilepsy devices involved in capturing seizure data, to the proposed cloud platform. The ‘IoT based Epilepsy monitoring model’ proposal in Fig. 5 shows areas on the body where parameters are measured, each area is indicated with a colour matching the parameter. 4.5 Cloud Platform The proposed cloud platform provides all the necessary services for the clinician to manage, process and visualise the seizure data. All the processes that involve the interaction between the personalisation layer and the sensor layer are carried out through the following modules: PMP data management, machine learning module and data analysis & visualisation. All these services are hosted in the cloud and clinicians are able to access them remotely from any location. The data analysis and visualisation module utilises the sensor data while the ‘PMP data management’ module pulls all the patient
An IoT Based Epilepsy Monitoring Model
203
Fig. 5. IoT based epilepsy monitoring model
records from the personalisation modules and here the sensor data results are updated. Visualisation is a requirement for any such system as it is important for clinicians to be provided with user friendly GUIs so they can study the seizure data from the epilepsy sensor devices. The machine learning module is also proposed, this is a key aspect for future development and the idea is that by using algorithms the module will ‘learn’ when a patient is about to have seizure and warn them in advance. A pre-processing hardware and a platform are needed to communicate and transmit the sensor data which is collected using wearable sensors positioned on a patient’s body. The Microsoft Azure IoT platform [31] is proposed, since this cloud computing server is trusted and safe [32]. 4.6 Network Layer There are several ways the sensors can connect and send data to the cloud platform and since most of the sensor devices connect to a mobile phone they are served by Bluetooth or Bluetooth Low Energy (BLE) and use very little power. Nevertheless, each sensor-based device is provided with its own protocol and connectivity options, hence the type of IoT connectivity is determined generally by the distance that the data must travel, either short-range or long-range [33]. IoT platforms such as Azure use gateways to connect IoT devices to the cloud. The data collected from the devices moves through this gateway, gets pre-processed using in build modules (Edge) and then gets sent to the cloud. Data is protected by an additional layer of security provided by the Azure Application gateway and in addition connection security is enabled as each connected IoT device is given a unique identity key [31].
204
S. A. McHale and E. Pereira
5 Evaluation An evaluation was performed by taking two different epilepsy patients through the steps in the PMP framework. Two different ‘use case’ scenarios each with different patient profiles were tested by revealing their respective inputs and outputs. The results helped determine the effectiveness of the PMP framework and how it can be used as a tool for recommending the IoT device to an epilepsy individual patient. The principal contribution in this study was that with the prior ‘knowledge’ of individual patient characteristics drawn from the PPDL repository and ‘Cluster Groups’ together with the supplementary ‘proof of concept’ knowledge obtained in the experiments each epilepsy patient can be treated distinctly and recommended an appropriate sensor-based device thus forming a patient specific unique PMP (Personalised Monitoring Plan). Hence personalisation can be achieved.
6 Long Term Uses and Applicability in Other Domains The methods used in this study for ontology development and clustering analysis can be applied to any disease whereby recognised symptoms per patient can be individualised and be further put into sub-groups or categories. However, to fully utilise this personalised approach the application of the PMP framework can be particularly applied to patients whom have symptoms that can be monitored with different IoT sensor-based devices and personalised further by wearing the device on different body positions. In the future the following types of patients can be handled by the proposed PMP framework: (shown below together with latest progressive recommended sensor-based devices). • Diabetes: i.e. One such recommended device could be use of ‘flash glucose sensing’: A device which checks blood glucose levels by scanning a sensor worn on their arm will be (available on the NHS for people with type 1 diabetes) [34]. • Sick Infants: i.e. A recommended device could be use of a miniaturised, wireless oxygen sensor wearable device the size of a Band-Aid which would allow babies to be monitored from home and able to leave the hospital [35]. • Rehabilitation: i.e. The recommended device for rehabilitation could be a Force-based sensor which can be integrated with footwear to measure the interaction of the body with the ground during walking [36]. Due to the possibility of detecting not only physiological but also movement data wearable sensors have also acquired increasing importance in the field of rehabilitation [25].
7 Conclusion The sensors and techniques used in the experiment enables some assurance in long term remote monitoring. The use of such sensor-based device used in the experiment can reduce the frequency of visits to hospitals and improve daily management of epilepsy thus, these sensing techniques have shown that results can be achieved in the measurement of specific epileptic seizures based on observations.
An IoT Based Epilepsy Monitoring Model
205
As established through these experiments’ timely detection along with known patient characteristics is one of the keys to monitoring epilepsy. The integration of the components and technologies in the framework depicted in Fig. 4. PMP Framework aims at providing HCP’s dealing with epilepsy patients with an integrated tool that helps them in recommending the correct IoT sensor and position on the patient’s body. These decisions are made at the initial consultation and act as an ‘aid’ in personalising the condition of new incoming patients, and thus refine the predefined ‘patient record’ in order to obtain and validate a ‘Personalised Monitoring Plan’ which is in addition adapted to include the seizure monitoring of the patient during appointments. The PMP Framework is designed to provide a patient-empowering support in a way that the available knowledge is continuously personalised to the condition of the seizure type patient. The IoT based Epilepsy monitoring model has been proposed and can be adopted by the PMP framework in future developments.
References 1. Pediaditis, M., Tsiknakis, M., Kritsotakis, V., Góralczyk, M., Voutoufianakis, S., Vorgia, P.: Exploiting advanced video analysis technologies for a smart home monitoring platform for epileptic patients: Technological and legal preconditions, in Book Exploiting Advanced Video Analysis Technologies for a Smart Home Monitoring Platform for Epileptic Patients: Technological and Legal Preconditions, pp. 202–207 2. Moghim, N., Corne, D.W.: Predicting epileptic seizures in advance. PLoS ONE 9(6), e99334– e99334 (2014) 3. Bernard, S.C., Daniel, H.L.: Epilepsy. N. Engl. J. Med. 349(13), 1257–1266 (2003) 4. Hirtz, D., Thurman, D.J., Gwinn-Hardy, K., Mohamed, M., Chaudhuri, A.R., Zalutsky, R.: How common are the “common” neurologic disorders? Neurol. 68(5), 326–337 (2007) 5. Chen, L., et al.: OMDP: an ontology-based model for diagnosis and treatment of diabetes patients in remote healthcare systems. Int. J. Distrib. Sens. Netw. 15(5), 155014771984711 (2019) 6. Straten, A.F.V., Jobst, B.C.: Future of epilepsy treatment: integration of devices. Future Neurol. 9, 587–599 (2014) 7. Tentori, M., Escobedo, L., Balderas, G.: A smart environment for children with autism. IEEE Pervasive Comput. 14(2), 42–50 (2015) 8. Tamura, T., Chen, W.: Seamless healthcare monitoring. Springer, Berlin (2018) 9. Bonato, P.: Wearable sensors and systems. IEEE Eng. Med. Biol. Mag. 29(3), 25–36 (2010) 10. Magiorkinis, E., Diamantis, A., Sidiropoulou, K., Panteliadis, C.: Highights in the history of epilepsy: the last 200 years. Epilepsy Res. Treat. 2014, 1–13 (2014) 11. Cook, D.J., Schmitter-Edgecombe, M., Dawadi, P.: Analyzing activity behavior and movement in a naturalistic environment using smart home techniques. IEEE J. Biomed. Health Inform. 19(6), 1882–92 (2015) 12. Kane, R.L., Parsons T.D. (eds.) The role of technology in clinical neuropsychology. Oxford University Press (2017) 13. Aski, V.J., Sonawane, S.S., Soni, U.: IoT enabled ubiquitous healthcare data acquisition and monitoring system for personal and medical usage powered by cloud application: an architectural overview. In: Kalita, J., Balas, V.E., Borah, S., Pradhan, R. (eds.) Recent Developments in Machine Learning and Data Analytics. AISC, vol. 740, pp. 1–15. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-1280-9_1
206
S. A. McHale and E. Pereira
14. Direito, B., Teixeira, C., Ribeiro, B., Castelo-Branco, M., Sales, F., Dourado, A.: Modeling epileptic brain states using EEG spectral analysis and topographic mapping. J. Neurosci. Methods 210(2), 220–229 (2012) 15. Xie, S., Krishnan, S.: Wavelet-based sparse functional linear model with applications to EEGs seizure detection and epilepsy diagnosis. Med. Biol. Eng. Comput. 51(1–2), 49–60 (2013) 16. Ulate-Campos, A., Coughlin, F., Gaínza-Lein, M., Fernández, I.S., Pearl, P.L., Loddenkemper, T.: Automated Seizure Detection Systems and Their Effectiveness for Each Type of Seizure. W.B. Saunders Ltd, pp. 88–101 (2016) 17. EpDetect is a mobile phone application. Website available at: http://www.epdetect.com. Last Accessed 15 Jun 2021 18. Marzuki, N.A., Husain, W., Shahiri, A.M.: MyEpiPal: Mobile application for managing, monitoring and predicting epilepsy patient. In: Akagi, M., Nguyen, T.-T., Duc-Thai, V., Phung, T.-N., Huynh, V.-N. (eds.) Advances in Information and Communication Technology, pp. 383–392. Springer International Publishing, Cham (2017). https://doi.org/10.1007/ 978-3-319-49073-1_42 19. Fisher, R.S., Bartfeld, E., Cramer, J.A.: Use of an online epilepsy diary to characterize repetitive seizures. Epilepsy & Behavior 47, 66–71 (2015) 20. Irody, L.: Mobile Patient Diaries: Epidiary (2007). http://www.irody.com/mobile-patient-dia ries/ 21. Rukasha, T., Woolley, S.I., Collins. T.: Wearable epilepsy seizure monitor user interface evaluation: an evaluation of the empatica’embrace’interface. In: Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers (2020) 22. Empatica Medical-Grade Wearable Patient Monitoring Solutions, Jul. 2020, [online] Available: http://www.empatica.com/en-eu/ 23. Van de Vel, A., et al.: Non-EEG seizure detection systems and potential SUDEP prevention: State of the art: review and update. Seizure 41, 141–153 (2016) 24. Bruno, E., et al.: Wearable technology in epilepsy: the views of patients, caregivers, and healthcare professionals. Epilepsy Behav. 85, 141–149 (2018) 25. Kos, A., Umek, A.: Wearable sensor devices for prevention and rehabilitation in healthcare: Swimming exercise with real-time therapist feedback. IEEE Internet Things J. 6(2), 1331– 1341 (2018) 26. Ghamari, M.: A review on wearable photoplethysmography sensors and their potential future applications in health care. Int. J. Biosens. Bioelectron. 4(4), 195 (2018) 27. Jallon, P., Bonnet, S., Antonakios, M., Guillemaud, R.: Detection System of Motor Epileptic Seizures Through Motion Analysis with 3D Accelerometers. IEEE Computer Society, pp. 2466–2469 (2019) 28. van Elmpt, W.J.C., Nijsen, T.M.E., Griep, P.A.M., Arends, J.B.A.M.: A model of heart rate changes to detect seizures in severe epilepsy. Seizure 15(6), 366–375 (2006) 29. Varela, H.L., Taylor, D.S., Benbadis, S.R.: Short-term outpatient EEG-video monitoring with induction in a veterans administration population. J. Clin. Neurophysiol. 24(5), 390–391 (2007) 30. Viboud, C., Santillana, M.: Fitbit-informed influenza forecasts. Lancet Digital Health 2(2), e54–e55 (2020) 31. Copeland, M., et al.: Microsoft Azure. Apress, New York, USA (2015) 32. Ko, R.K.L., Lee, B.S., Pearson, S.: Towards achieving accountability, auditability and trust in cloud computing. In: Abraham, A., Mauri, J.L., Buford, J.F., Suzuki, J., Thampi, S.M. (eds.) Advances in Computing and Communications, pp. 432–444. Springer Berlin Heidelberg, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22726-4_45
An IoT Based Epilepsy Monitoring Model
207
33. IoT Connectivity Options: Comparing Short-, Long-Range Tech https://www.iotworldt oday.com/2018/08/19/iot-connectivity-options-comparing-short-long-range-technologies/. Accessed 15 Jun 2021 34. Tyndall, V., et al.: Marked improvement in HbA 1c following commencement of flash glucose monitoring in people with type 1 diabetes. Diabetologia 62(8), 1349–1356 (2019) 35. Worcester Polytechnic Institute. "Engineers creating miniaturized, wireless oxygen sensor for sick infants: Mobile, wearable device the size of a Band-Aid could allow babies to leave the hospital and be monitored from home." ScienceDaily. ScienceDaily, 14 November 2019. https://www.sciencedaily.com/releases/2019/11/191114154454.htm 36. Porciuncula, F., et al.: Wearable Movement Sensors for Rehabilitation: A Focused Review of Technological and Clinical Advances. Elsevier Inc., pp. S220–S232 (2018)
Biostatistics in Biomedicine and Informatics Esther Pearson(B) Lasell University, Auburndale, MA, USA
Abstract. Biostatistics, Biomedicine, and Informatics encounters and exposures in our daily lives occur due to the gathering of Big Data and its use in medical research. This intensive gathering of data or big data allows the use of statistics more than ever before as numbers are calculated, patterns are observed, and data becomes information as it is placed in context of a problem to be solved. This information has given rise to “Health Informatics” which is a combination of “Health Information Technology” as computing components, “Health Information” as collected data, and “Health Information Management” as the organizing and summarizing of information . The combination of these is used to perform and improve decision-making and value-based delivery of healthcare. Hardware, Software, Data, Procedures, and People are all key components of Health Informatics in the context of healthcare with Biostatistics harnessing large amounts of health data to accelerate decision-making. This paper looks at the application of statistical methods to the solution and decision-making in biological healthcare problems. Keywords: Biostatistics · Biomedicine · Informatics · Big Data
1 Introduction The topic of Biostatistics in Biomedicine and Informatics is an area that is growing and receiving more recognition of importance. With that in mind the introduction begins with definitions and descriptions of the key components of Biostatistics, Biomedicine, and Informatics in Health care. Below are the descriptions: Biostatistics the statistical processes and methods applied to the collection, analysis, and interpretation of biological data and especially data relating to human biology, health, and medicine. Biomedicine (i.e., medical biology) is a branch of medical science that applies biological and physiological principles to clinical practice. The branch especially applies to biology and physiology. Biomedicine also can relate to many other categories in health and biological related fields. Informatics healthcare professionals use their knowledge of healthcare, information systems, databases and information technology security to gather, store, interpret and manage the massive amount of data generated when care is provided to patients. The combination of these is used to perform and improve decision-making and valuebased delivery of healthcare. Decision-making at its foundation is the perspective of Statistics. In defining Statistics, a simplistic definition is the art and science of gathering, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 208–212, 2021. https://doi.org/10.1007/978-3-030-80129-8_16
Biostatistics in Biomedicine and Informatics
209
organizing, manipulating data through calculations and visual representations in order to make informed and data-supported decisions. Thus, as healthcare explodes with data or big data, the use of technology becomes more critical. It is not only critical for health-related practitioners, it is critical for everyone. This need for everyone to become more informed is based on the multitude of health-related information that is now available on television media and on the Internet. This availability of information has become the impetus for the average person to now make attempts to self-diagnose their health and then provide a self-prognosis, and even perform prescriptive reactions for self-care. So, with that in mind, Biostatistics knowledge can assist in leading to an understanding of healthcare and self-care.
2 Background in Health Informatics Kaufman (2017) [5] stated in the article, “Think You’re Seeing More Drug Ads on TV? You Are, and Here’s Why”, that while these kinds of ads have been running for 20 years, ever since the Food and Drug Administration approved them, it is not your imagination if you think you are seeing more of them these days. Lots more. This information is not coming to the average person through their physician or healthcare organization it is being presented to the general public for consumption. The general public has for many years been exposed to over-the-counter medications, but now the general public is being exposed to prescriptive medication information and data through the media. These exposures have turned the table from a physician diagnosing an ailment to the general public diagnosing their ailments and even recommending to their physician what medication should be prescribed to them. The drug manufacturers, who in the past sent a salesperson to the physician’s office with samples and marketing materials has now been supplemented by media and the general public helping to sell their drugs. Or at least piquing the curiosity of the general public to begin questioning their personal health and their physician prescribed treatments. While Biostatistics literacy is important in this media and data-intensive global society; knowledge of Health Informatics is also an important literacy. Health informatics consists of professionals using their knowledge of healthcare, information systems, databases, and information technology security to gather, store, interpret and manage the massive amount of data generated when care is provided to patients. Colopy (2020) [3] in Part 1 of a two-part episode discusses personalized probabilistic patient monitoring using intelligent computational methods. The discussion comprises the use of Machine Learning (ML) and Artificial Intelligence (AI) to assist healthcare professionals in monitoring patient collected data over a time-series. This on-going monitoring supplements a physician or healthcare workers one-time snapshot of a patient in which decisions are possibly made from a one-time a day blood pressure; and a one-time a day pulse and oxygen-content reading of a patient’s vitals. This snapshot can be misleading because it is episodic, wherein an individual patient’s data collection on-going over time would provide a more specific and precise guide in decision-making. The collected data could also through Machine Learning and Artificial Intelligence provide calculated values to a physician. The physician with
210
E. Pearson
the use of ML and AI would not then be challenged to perform software applications and statistical concept calculations based on their personal knowledge of both of these. ML and AI would take care of it. As well as a physician’s time-constraints based on patient load would not come into play. Therefore, Healthcare Informatics can provide continuity of analysis and continuity of monitoring of an individual patient thus, providing clinical value. These technology services and others are worth discussion and investigation. One such technology service is Infrastructure as a Service (IaaS). The use of Infrastructure as a Service (IaaS) is one possibility in which a large investment in utilizing machine learning and artificial intelligence could be mitigated. IaaS are offered on-demand in a secure environment. Such that the services are always available, but only taken advantage of as needed. It is similar, in terms of a comparison, to on-demand printing. You only print books as they are purchased. The books are not stock-piled or stored just in case a purchase is made. So, IaaS could be a time saver, money saver, and resource advantage for patient care. Although, it must be said that the use of technology does not and would not trump the human cognitive interaction and patient exposure and experience for decision-making. In other words, information gained through use of technology is just one part of the decision-making puzzle. Technology in the form of Health Informatics gathering and managing data, along with Biostatistics, are tools for the practice of Biomedicine. The use of human analysis for data interpretation and drawing conclusions is still needed. Only now the human analysis can participate in evidence-based practice (EBP). Wherein, decisions are based on current and relevant patient data. Evidence-based practice being defined as a combination of a practitioner’s experience along with best practices. This combination is carried out in what Woodside (2018) [11] lists as a five-step process which includes: 1) Asking a searchable question, 2) Searching for evidence, 3) Critically examining the evidence, 4) Changing practice as needed based on evidence, and 5) Evaluating effectiveness of the change (Hebda and Czar, 2013) [4]. The five-step process is listed with the thought it could be carried out by a practitioner, but consideration must be given to the recognition that Machine Learning and Artificial Intelligence could carry out for steps 1, 2. 3, and 5, if not all the steps as part of a Decision Support System (DSS). The practitioner’s role would then be to reject or fail to reject the conclusions drawn by Decision Support System (DSS) technology. This could be done with the assistance of a Biostatistician. Thus, the human personal touch is not lost while providing speed, repetition, recognition of patterns and practice. Machine Learning (ML) and Artificial Intelligence (AI) were interwind beginning in the 1950’s. Today that interaction is known as Anamatics or Health Anamatics in the healthcare system industry. Wherein, the intersection of data analytics and health informatics occurs and can be traced back to the history of ML and AI [1]. The electrical circuit model of how neurons work by neurophysiologist Warren McCulloch and mathematician Walter Pitts provided laid down the foundations for ML and AI. This preceding in the 1950’s Alan Turing’s [10] logical frameworks explorations in computers operations being compared with an understanding of how humans solve problems and make decisions. Also, in the 50’s breakthroughs by Arthur Samuel, Frank Rosenblatt, Bernard Widrow and Marcian Hoff’s computer program technology in learning and binary pattern recognition. But these were limited until Von Neuman’s
Biostatistics in Biomedicine and Informatics
211
computer memory architecture was viewed as useful in the same memory storage of data and instructions (“History of Machine Learning,” n.d.). These discoveries and research apex into what can be considered a history of Machine Learning (ML) and Artificial Intelligence (AI) [2]. The history of ML and AI continued with more practical uses being realized in the start of the 21st century. Wherein, at that time Big Data began to take root everywhere, especially amongst the big-name software technology vendors (e.g., Google, Amazon, Facebook, IBM, and others…). And practical uses were discovered demanding visibility in everyday life.
3 Insights in Biostatistics Examples of Biostatistics in Biomedicine and Informatics and connection to decisionsupport includes an article by Mitchell, Gerdin, Lindberg, Lovis, Martin-Sanchez, Miller, Shortliffe, & Leong, (2011) [8], “50 Years of Informatics Research on Decision Support: What’s Next?” which focuses on Decision Support Systems (DSS). The DSS technology in general has been around for many decades with a history that begins in the 1940’s and 1950’s. Although this technology has longevity it seems to have only come into practical use by the businesses with the onset of Big Data and Data Warehouse technologies. It is in Data Warehouse technologies that many disparate data sources come together and are unified to be come a source of practical information that can be used in business intelligence for decision making. But it is too narrow of a lens to only view the usefulness of DSS in terms of business decisions. It is also a valuable tool in the arena of Biomedicine, as defined as, application of biology and physiology in clinical practice. The formative article published by Ledley [9] and Lusted in 1959 was significant in introducing computational methods as having pioneering potential in patient diagnostics and therapeutics. This thought-leadership provided influence for Biostatistics in Biomedicine and Informatics (Matthews, 2016) [6]. The leadership of physician and pathologist, Pierre-Charles-Alexandre Louis, in development of numerical methods in medicine along with physician Jules Gauvarret’s advocation in the use of statistics in medicine provided the stepping-stones to today’s potentials of decision support systems in Medicine. These combined with the groundbreaking work of statistician Karl Pearson’s leadership in the development of Modern Statistics, all come together from the 17th century to today for Biostatistics pathways in Biomedicine with the use of Informatics (Mayo, Punchihewa, Emile, and Morrison, 2018) [7].
4 Conclusion It has long been the desire and public explanation that a turnkey system of inputs of disease symptoms and output of disease explanations with diagnosis would be possible using computers and software applications (Mitchell et. al., 2011) [8]. It is now with the utilization of statistical foundations such as Biostatistics and the capture of Big Data along with the use of Machine Learning and Artificial Intelligence that proposal and insights are now on the horizon of practical and economical possibilities.
212
E. Pearson
Matthews (2016) [6] explained, “the history of biostatistics could be viewed as an ongoing dialectic between continuity and change. Although statistical methods are used in current clinical studies, there is still ambivalence towards its application when medical practitioners treat individual patients” (“History of Biostatistics,” 2016) [6]. This ambivalence could be viewed as the consideration and assistance of non-practitioners such as statisticians in decision-making that involves something other than number-crunching. The practitioner appreciates the human-touch and individual clinical experience versus the statisticians’ sample inferences based on populations. But, with advances in biostatistic processes and methods being applied to the collection, analysis, and interpretation of biological data as it relates to human biology, health, and medicine the potential of great advances is on the horizon. Biostatistics in Biomedicine and Informatics applications is destined to become a larger part of biological and physiological principles in clinical practice. So, with this in mind, the health informatics professionals’ knowledge of healthcare, information systems, databases and information technology security offer a triangulation of knowledge and resources that can be generated when care is provided to individual patients.
References 1. AI in Radiology: History of Machine Learning. https://www.doc.ic.ac.uk/~jce317/historymachine-learning.html (n.d.) 2. Anyoha, R.: The History of Artificial Intelligence. http://sitn.hms.harvard.edu/flash/2017/his tory-artificial-intelligence/ (2017, August 28) 3. Colopy, G.W.: Patient Vital Sign Monitoring with Gaussian Processes. [Video] YouTube. https://www.youtube.com/watch?v=9c5lBUCWAfQ (2020, January 8) 4. Hebda, T., Czar, P.: Handbook of Informatics for Nurses & Healthcare Professionals, 5th edn. Pearson Education, Boston (2013) 5. Kaufman, J.: Think You’re Seeing More Drug Ads on TV? You Are, and Here’s Why. The New York Times. https://www.nytimes.com/2017/12/24/business/media/prescription-drugs-advert ising-tv.html (2017, December 24) 6. Matthews J.R.: History of biostatistics. J. EMWA Med. Writing Stat., 25(3) (2016) 7. Mayo, H., Punchihewa, H., Emile, J., Morrison, J.: History of Machine Learning. https:// www.doc.ic.ac.uk/~jce317/history-machine-learning.html (2018) 8. Mitchell, J.A., et al.: 50 Years of Informatics Research on Decision Support: What’s Next. Methods of Information in Medicine 50(06), 525–535 (2018). https://doi.org/10.3414/ME1106-0004 9. Ledley, R.S., Lusted, L.B.: Reasoning foundations of medical diagnosis; symbolic logic, probability, and value theory aid our understanding of how physicians reason. Science 130(3366), 9–21 (1959) 10. Turing, A.M.: Computing machinery and intelligence. Mind LIX(236), 433–460 (1950). https://doi.org/10.1093/mind/LIX.236.433 11. Woodside, J.M.: Applied Health Analytics and Informatics Using SAS. SAS Institute Inc, Cary, NC (2018)
Neural Network Compression Framework for Fast Model Inference Alexander Kozlov(B) , Ivan Lazarevich, Vasily Shamporov, Nikolay Lyalyushkin, and Yury Gorbachev Intel Corporation, Nizhny Novgorod, Turgeneva Street 30, 603024 Nizhny Novgorod, Russia [email protected] http://intel.com
Abstract. We present a new PyTorch-based framework for neural network compression with fine-tuning named Neural Network Compression Framework (NNCF) (https://github.com/openvinotoolkit/nncf) . It leverages recent advances of various network compression methods and implements some of them, namely quantization, sparsity, filter pruning and binarization. These methods allow producing more hardwarefriendly models that can be efficiently run on general-purpose hardware computation units (CPU, GPU) or specialized deep learning accelerators. We show that the implemented methods and their combinations can be successfully applied to a wide range of architectures and tasks to accelerate inference while preserving the original model’s accuracy. The framework can be used in conjunction with the supplied training samples or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations. Keywords: Deep learning · Neural networks · Filter pruning Quantization · Sparsity · Binary neural networks
1
Introduction
Deep neural networks (DNNs) have contributed to the most important breakthroughs in machine learning over the last ten years [12,18,24,28]. State-ofthe-art accuracy in the majority of machine learning tasks was improved by introducing DNNs with millions of parameters trained in an end-to-end fashion. However, the usage of such models dramatically affected the performance of algorithms because of the billions of operations required to make accurate predictions. Model analysis [21,23,29] has shown that a majority of the DNNs have a high level of redundancy, basically caused by the fact that most networks were designed to achieve the highest possible accuracy for a given task without inference runtime considerations. Model deployment, on the other hand, has a performance/accuracy trade-off as a guiding principle. This observation motivated the development of methods to train more computationally efficient deep c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 213–232, 2021. https://doi.org/10.1007/978-3-030-80129-8_17
214
A. Kozlov et al.
learning (DL) models, so that these models can be used in real-world applications with constrained resources, such as inference on edge devices. Most of these methods can be roughly divided into two categories. The first category contains the neural architecture search (NAS) algorithms [14,25,33], which allow constructing efficient neural networks for a particular dataset and specific hardware used for model inference. The second category of methods aims to improve the performance of existing and usually hand-crafted DL models without much impact on their architecture design. Moreover, as we show in this paper, these methods can be successfully applied to the models obtained using NAS algorithms. One example of such methods is quantization [4,11], which is used to transform the model from floating-point to fixed-point representation and allows using the hardware supporting fixed-point arithmetic in an efficient way. The extreme case of quantized networks are binary networks [10,20,30] where the weights and/or activations are represented by one of two available values so that the original convolution/matrix multiplication can be equivalently replaced by XNOR and POPCOUNT operations, leading to a dramatic decrease in inference time on suitable hardware. Another method belonging to this group is based on introducing sparsity into the model weights [13,16,19] which can be further exploited to reduce the data transfer rate at inference time, or bring a performance speed-up via the sparse arithmetic given that it is supported by the hardware. In general, any method from the second group can be applied either during or after the training, which adds a further distinction of these methods into post-training methods and methods which are applied in conjunction with finetuning. Our framework contains methods that use fine-tuning of the compressed model to minimize accuracy loss. Shortly, our contribution is a new Neural Network Compression Framework (NNCF) framework which has the following important features: – Support of quantization, binarization, sparsity and filter pruning algorithms with fine-tuning. – Automatic model graph transformation in PyTorch – the model is wrapped and additional layers are inserted in the model graph. – Ability to stack compression methods and apply several of them at the same time. – Training samples for image classification, object detection and semantic segmentation tasks as well as configuration files to compress a range of models. – Ability to integrate compression-aware training into third-party repositories with minimal modifications of the existing training pipelines, which allows integrating NNCF into large-scale model/pipeline aggregation repositories such as MMDetection [3] or Transformers [27]. – Hardware-accelerated layers for fast model fine-tuning and multi-GPU training support. – Compatibility with OpenVINOTM Toolkit [1] for model inference. It worth noting that we make an accent on production in our work in order to provide a simple but powerful solution for the inference acceleration of neural
Artificial Intelligence
215
networks for various problem domains. Moreover, we should also note that we are not the authors of all the compression methods implemented in the framework. For example, binarization and RB sparsity algorithms described below were taken from third party projects by agreement with their authors and integrated into the framework.
2
Related Work
Currently, there are multiple efforts to bring compression algorithms not only into the research community but towards a wider range of users who are interested in real-world DL applications. Almost all DL frameworks, in one way or another, provide support of compression features. For example, quantizing a model into INT8 precision is now becoming a mainstream approach to accelerate inference with minimum effort. One of the influential works here is [11], which introduced the so-called Quantization-Aware Training (QAT) for TensorFlow. This work highlights problems of algorithmic aspects of uniform quantization for CNNs with fine-tuning, and also proposes an efficient inference pipeline based on the instructions available for specific hardware. The QAT is based on the Fake Quantization operation which, in turn, can be represented by a pair of Quantize/Dequantize operations. The important feature of the proposed software solution is the automatic insertion of the Fake Quantization operations, which makes model optimization more straightforward for the user. However, this approach has significant drawbacks, namely, increased training time and memory consumption. Another concern is that the quantization method of [11] is based on the naive min/max approach and potentially may achieve worse results than the more sophisticated quantization range selection strategies. The latter problem is solved by methods proposed in [4], where quantization parameters are learned using gradient descent. In our framework we use a similar quantization method, along with other quantization schemes, while also providing the ability to automatically insert Fake Quantization operations in the model graph. Another TensorFlow-based Graffitist framework, which also leverages the training of quantization thresholds [22], aims to improve upon the QAT techniques by providing range-precision balancing of the resultant per-tensor quantization parameters via training these jointly with network weights. This scheme is similar to ours but is limited to symmetric quantization, factor-of-2 quantization scales, and only allows for 4/8 bit quantization widths, while our framework imposes no such restrictions to be more flexible to the end users. Furthermore, NNCF does not perform additional network graph transformations during the quantization process, such as batch normalization folding which requires additional computations for each convolutional operation, and therefore is less demanding to memory and computational resources. From the PyTorch-based tools available for model compression, the Neural Network Distiller [32] is the well-known one. It contains an implementation of algorithms of various compression methods, such as quantization, binarization,
216
A. Kozlov et al.
filter pruning, and others. However, this solution mostly focuses on research tasks rather than the application of the methods to real use cases. The most critical drawback of Distiller is the lack of a ready-to-use pipeline from the model compression to the inference on the target hardware. The main feature of existing compression frameworks is usually the ability to quantize the weights and/or activations of the model from 32 bit floating point into lower bit-width representations without sacrificing much of the model’s accuracy. However, as it is now commonly known [8], deep neural networks can also typically tolerate high levels of sparsity, that is, a large proportion of weights or neurons in the network can be zeroed out without much harm to model’s accuracy. NNCF allows to produce compressed models that are both quantized and sparsified. The sparsity algorithms implemented in NNCF constitute non-structured network sparsification approaches, i.e. methods that result in sparse weight matrices of convolutional and fully-connected layers with zeros randomly distributed inside the weight tensors. Another approach is the so-called structured sparsity, which aims to prune away whole neurons or convolutional filters [15]. NNCF implements a set of filter pruning algorithms for convolutional neural networks. The non-structured sparsity algorithms generally range from relatively straightforward magnitude-based weight pruning schemes [8,31] to more complex approaches such as variational and targeted dropout [7,17] and L0 regularization [16].
3
Framework Architecture
NNCF is built on top of the popular PyTorch framework. Conceptually, NNCF consists of an integral core part with a set of compression methods which form the NNCF Python package, and of a set of training samples which demonstrate capabilities of the compression methods implemented in the package on several key machine learning tasks. To achieve the purposes of simulating compression during training, NNCF wraps the regular, base full-precision PyTorch model object into a transparent NNCFNetwork wrapper. Each compression method acts on this wrapper by defining the following basic components: – Compression Algorithm Builder – the entity that specifies the changes that have to be made to the base model in order to simulate the compression specific to the current algorithm. – Compression Algorithm Controller – the entity that provides access to the compression algorithm parameters and statistics during training (such as the exact quantization bit width of a certain layer in the model, or the level of sparsity in a certain layer). – Compression Loss, representing an additional loss function introduced in the compression algorithm to facilitate compression. – Compression Scheduler, which can be defined to automatically control the parameters of the compression method during the training process, with updates on a per-batch or per-epoch basis without explicitly using the Compression Algorithm Controller.
Artificial Intelligence
217
We assume that potentially any compression method can be implemented using these abstractions. For example, the Regularization-Based (RB) sparsity method implemented in NNCF introduces importance scores for convolutional and fully-connected layer weights which are additional trainable parameters. A weight binary mask based on an importance score threshold is added by a specialization of a Compression Algorithm Builder object acting on the NNCFNetwork object, modifying it in such a manner that during the forward pass the weights of an operation are multiplied by the mask before executing the operation itself. In order to effectively train these additional parameters, RB sparsity method defines an L0 -regularization loss which should be minimized jointly with the main task loss, and also specifies a scheduler to gradually increase the sparsity rate after each training epoch. As mentioned before, one of the important features of the framework is automatic model transformation, i.e. the insertion of the auxiliary layers and operations required for a particular compression algorithm. This requires access to the PyTorch model graph, which is actually not made available by the PyTorch framework. To overcome this problem we patch PyTorch module operations and wrap the basic operators such as torch.nn.functional.conv2d in order to be able to trace their calls during model execution and execute compressionenabling code before and/or after the operator calls. Another important novelty of NNCF is the support of algorithm stacking where the users can build custom compression pipelines by combining several compression methods. An example of that are the models which are trained to be sparse and quantized at the same time to efficiently utilize sparse fixed-point arithmetic of the target hardware. The stacking/mixing feature implemented inside the framework does not require any adaptations from the user’s side. To enable it one only needs to specify the set of compression methods to be applied in the configuration file. Figure 1 shows the common training pipeline for model compression. During the initial step the model is wrapped by the transparent NNCFNetwork wrapper, which keeps the original functionality of the model object unchanged so that it can be further used in the training pipeline as if it had not been modified at all. Next, one or more particular compression algorithm builders are instantiated and applied to the wrapped model. The application step produces one or more compression algorithm controllers (one for each compression algorithm) and also the final wrapped model object with necessary compression-related adjustments in place. The wrapped model can then be fine-tuned on the target dataset using either an original training pipeline, or a slightly modified pipeline in case the user decided to apply an algorithm that specifies an additional Compression Loss to be minimized or to use a Compression Scheduler for automatic compression parameter adjustment during training. The slight modifications comprise, respectively, of a call to the Compression Loss object to compute the value to be added to the main task loss (e.g. a cross-entropy loss in case of classification task), and calls to Compression Scheduler at regular times (for instance, once per training epoch) to signal that another step in adjusting the compression
218
A. Kozlov et al.
algorithm parameters should be taken (such as increasing the sparsity rate). As we show in Appendix 6 any existing training pipeline written with PyTorch can be easily adapted to support model compression using NNCF. After the compressed model is trained we can export it to ONNX format for further usage in the OpenVINOTM [1] inference toolkit.
Fig. 1. A common model compression pipeline with NNCF. The original full-precision model is wrapped and modified to inject compression algorithm functionality into the trained model while leaving the external interface unchanged; this makes it possible to fine-tune the modified model for a pre-defined number of epochs as if it were a regular PyTorch model. Compression algorithm functionality is handled by a command and control over the specific compression algorithm parameters during training is made available through CompressionAlgorithmController objects.
4
Compression Methods Overview
In this section, we give an overview of the compression methods implemented in the NNCF framework. 4.1
Quantization
The first and most common DNN compression method is quantization. Our quantization approach combines the ideas of QAT [11] and PACT [4] and very close to TQT [22]: we train quantization parameters jointly with network weights using the so-called “fake” quantization operations inside the model graph. But in contrast to TQT, NNCF supports symmetric and asymmetric schemes for activations and weights as well as the support of per-channel quantization of weights which helps quantize even lightweight models produced by NAS, such as EfficientNet-B0. For all supported schemes quantization is represented by the affine mapping of integers q to real numbers r: q=
r +z s
(1)
Artificial Intelligence
219
where s and z are quantization parameters. The constant s (“scale factor”) is a positive real number, z (zero-point) has the same type as quantized value q and maps to the real value r = 0. Zero-point is used for asymmetric quantization and provides proper handling of zero paddings. For symmetric quantization it is equal to 0. Symmetric Quantization. During the training we optimize the scale parameter that represents range [rmin , rmax ] of the original signal: [rmin , rmax ] = [−scale ∗
qmin , scale] qmax
where [qmin , qmax ] defines quantization range. Zero-point always equal to zero in this case. Quantization ranges for activation and weights are tailored toward the hardware options available in the OpenVINOTM Toolkit (see Table 1). Three point-wise operations are sequentially applied to quantize r to q: scaling, clamping and rounding. r (2) q = clamp( ; qmin , qmax ) s qmax scale clamp(x; a, b) = min(max(x, a), b) s=
where · denotes the “bankers” rounding operation. Table 1. Integer quantization ranges for the symmetric mode at different bit-widths. qmin
qmax
bits−1
Weights −2 + 1 2bits−1 − 1 bits−1 −2 2bits−1 − 1 Signed activation 2bits − 1 Unsigned activation 0
Asymmetric Quantization. Unlike symmetric quantization, for asymmetric we optimize boundaries of floating point range (rmin , rmax ) and use zero-point (z) from (1). clamp(r; rmin , rmax ) +z (3) q= s rmax − rmin 2bits − 1 −rmin z= s
s=
220
A. Kozlov et al.
In addition we add a constraint to the quantization scheme: floating-point zero should be exactly mapped into an integer within quantization range. This constraint allows efficient implementation of layers with padding. Therefore we tune the ranges before quantization with the following scheme: l1 = min(rmin , 0) h1 = max(rmax , 0) −l1 ∗ (2bits − 1) z = h 1 − l1 z − 2bits + 1 z h 2 = t ∗ l1 h1 l2 = t ⎧ ⎪[l1 , h1 ], z ∈ {0, 2bits − 1} ⎨ [rmin , rmax ] = [l1 , h2 ], h2 − l1 > h1 − l2 ⎪ ⎩ [l2 , h1 ], h2 − l1 0.5] and u ∼ U (0, 1) where zj is the stochastic binary gate, zj ∈ {0, 1}. It can be shown that the above formulation is equivalent to zj being sampled from the Bernoulli distribution B(pj ) with probability parameter pj = sigmoid(sj ). Hence, sj are the trainable parameters which control whether the weight is going to be zeroed out at test time (which is done for pj > 0.5). On each training iteration, the set of binary gate values zj is sampled once from the above distribution and multiplied with network weights. In the Monte Carlo approximation of the loss function in [16], the mask of binary gates is generally sampled and applied several times per training iteration, but single mask sampling is sufficient in practice (as shown in [16]). The expected L0 loss term was shown to be proportional to the sum of probabilities of gates zj being non-zero [16], which in our case results in the following expression
Artificial Intelligence
⎛ Lreg = ⎝
|θ| sigmoid(sj ) j=1
|θ|
225
⎞2 − (1 − SL)⎠
(8)
To make the error loss term (e.g. cross-entropy for classification) differentiable w.r.t sj , we treat the threshold function t(x) = [x > c] as a straight-through estimator (i.e. dt/dx = 1). Table 6. Sparsification+Quantization results measured in the training framework (PyTorch) – accuracy metrics on the validation set for the original and compressed models. “RB” stands for the regularization-based sparsity method and “Mag.” Denotes the simple magnitude-based pruning method. Model
Dataset
Metric type FP32 Compressed
ResNet-50 INT8 w/60% of sparsity (RB)
ImageNet
top-1 acc.
76.13 75.2
Inception-v3 INT8 w/60% of sparsity (RB)
ImageNet
top-1 acc.
77.32 76.8
MobileNet-v2 INT8 w/51% of sparsity (RB) ImageNet
top-1 acc.
71.8
70.9
MobileNet-v2 INT8 w/70% of sparsity (RB) ImageNet
top-1 acc.
71.8
70.1
SSD300-BN INT8 w/70% of sparsity (Mag.) VOC07+12 mAP
78.28 77.94
SSD512-BN INT8 w/70% of sparsity (Mag.) VOC07+12 mAP
80.26 80.11
UNet INT8 w/60% of sparsity (Mag.)
CamVid
mIoU
72.5
UNet INT8 w/60% of sparsity (Mag.)
Mapillary
mIoU
56.23 54.30
ICNet INT8 w/60% of sparsity (Mag.)
CamVid
mIoU
67.89 67.53
4.4
73.27
Filter Pruning
NNCF also supports structured pruning for convolutional neural networks in the form of filter pruning. The filter pruning algorithm zeroes out the output filters in convolutional layers based on a certain filter importance criterion [26]. NNCF implements three different criteria for filter importance: i) L1-norm, ii) L2-norm and iii) geometric median. The geometric median criterion is based on the finding that if a certain filter is close to the geometric median of all the filters in the convolution, it can be well approximated by a linear combination of other filters, hence it should be removed. To avoid the expensive computation of the geometric median value for all the filters, we use the following approximation for the filter importance metric [9]: ||Fi − Fj ||2 (9) G(Fi ) = i∈{1,...,n},i=j
where Fi is the i-th filter in a convolutional layer with n output filters. That is, filters with the lowest average L2 distance to all the other filters in the convolutional layer are regarded as less important and eventually pruned away. The same proportion of filters is pruned for each layer based on the global pruning
226
A. Kozlov et al.
rate value set by the user. We found that the geometric median criterion gives a slightly better accuracy for the same fine-tuning pipeline compared to the magnitude-based pruning approach (see Table 7). We were able to produce a set of ResNet models with 30% of filters pruned in each layer and a top-1 accuracy drop lower than 1% on the ImageNet dataset. The fine-tuning pipeline for the pruned model is determined by the userconfigurable pruning scheduler. First, the original model is fine-tuned for a specified amount of epochs, then a certain target percentage of filters with the lowest importance scores is pruned (zeroed out). At that point in the training pipeline, the pruned filters might be frozen for the rest of the model fine-tuning procedure (this approach is implemented in the baseline pruning scheduler). An alternative tuning pipeline is implemented by the exponential scheduler, which does not freeze the zeroed out filters until the final target pruning rate is achieved in the model. The initial pruning rate is set to a low value and is increased each epoch according to the exponentially increasing profile. The subset of the filters to be pruned away in each layer is determined every time the pruning rate is changed by the scheduler. Table 7. Filter pruning results measured in the training framework – accuracy metrics on the validation set for the original and compressed models. The pruning rate for all the reported models was set to 30%. Model
Criterion
Dataset
Metric type FP32 Compressed
ResNet-50 ResNet-50 ResNet-34 ResNet-34 ResNet-18 ResNet-18
Magnitude Geometric median Magnitude Geometric median Magnitude Geometric median
ImageNet ImageNet ImageNet ImageNet ImageNet ImageNet
top-1 top-1 top-1 top-1 top-1 top-1
acc. acc. acc. acc. acc. acc.
76.13 76.13 73.31 73.31 69.76 69.76
75.7 75.7 75.54 72.62 68.73 68.97
Importantly, the pruned filters are removed from the model (not only zeroed out) when it is being exported to the ONNX format, so that the resulting model actually has fewer FLOPS and inference speedup can be achieved. This is done using the following algorithm: first, channel-wise masks from the pruned layers are propagated through the model graph. Each layer has a corresponding function that can compute the output channel-wise masks given layer attributes and input masks (in case mask propagation is possible). After that, the decision on whether to prune a certain layer is made based on whether further operations in the graph can accept the pruned input. Finally, filters are removed according to the computed masks for the operations that can be pruned.
Artificial Intelligence
5
227
Results
We show the INT8 quantization results for a range of models and tasks including image classification, object detection, semantic segmentation and several Natural Language Processing (NLP) tasks in Table 3. The original Transformerbased models for the NLP tasks were taken from the HuggingFace’s Transformers repository [27] and NNCF was integrated into the corresponding training pipelines as an external package. Table 8 reports compression results for EfficientNet-B0, which gives best combination of accuracy and performance on the ImageNet dataset. We compare the accuracy values between the original floating point model (76.84% of top-1) and a compressed one for different compression configurations; top-1 accuracy drop lower than 0.5% can be observed in most cases. To go beyond the single-precision INT8 quantization, we trained a range of models quantized to lower bit-widths (see Table 4). We were able to train several models (ResNet50, MobileNet-v2 and SquezeNet1.1) with approximately half of the model weights (and corresponding input activations) quantized to 4 bits (while the rest of weights and activations were quantized to 8 bits) and a top-1 accuracy drop less than 1% on the ImageNet dataset. We also trained a binarized ResNet-18 model according to the pipeline described in Sect. 4.2. Table 5 presents the results of binarizing ResNet-18 with either XNOR or DoReFa weight binarization and scale-threshold activation binarization (see Eq. 6). We also trained a set of convolutional neural network models with both weight sparsity and quantization algorithms in the compression pipeline (see Table 6). To extend the scope of trainable models and to validate that NNCF could be easily combined with existing PyTorch-based training pipelines, we also integrated NNCF with the popular mmdetection object detection toolbox [3]. As a result, we were able to train INT8-quantized and INT8-quantized+sparse object detection models available in mmdetection on the challenging COCO dataset and achieve a less than 1 mAP point drop for the COCO-based mAP evaluation metric. Specific results for compressed RetinaNet and Mask-RCNN models are shown in Table 10. Table 8. Accuracy Top-1 results for INT8 quantization of EfficientNet-B0 model on ImageNet measured in the training framework Model
Accuracy drop
All per-tensor symmetric All per-tensor asymmetric Per-channel weights asymmetric All per-tensor asymmetric w/31% of sparsity
0.75 0.21 0.17 0.35
228
A. Kozlov et al.
The compressed models were further exported to ONNX format suitable for inference with the OpenVINOTM toolkit. The performance results for the original and compressed models as measured in OpenVINOTM are shown in Table 9. R Table 9. Relative performance/accuracy results with OpenVINOTM 2020.1 on Intel R Xeon Gold 6230 processor
Model
Accuracy drop (%) Speed up
MobileNet v2 INT8 ResNet-50 v1 INT8 Inception v3 INT8 SSD-300 INT8 UNet INT8 ResNet-18 XNOR
0.44 −0.34 −0.62 −0.12 −0.5 7.25
1.82x 3.05x 3.11x 3.31x 3.14x 2.56x
Table 10. Validation set metrics for original and compressed mmdetection models. Shown are bounding box mAP values for models trained and tested on the COCO dataset.
6
Model
FP32 Compressed
RetinaNet-ResNet50-FPN INT8
35.6
RetinaNet-ResNeXt101- 64x4d-FPN INT8
35.3
39.6
39.1
RetinaNet-ResNet50-FPN INT8+50% sparsity 35.6
34.7
Mask-RCNN-ResNet50-FPN INT8
37.2
37.9
Conclusions
In this work we presented the new NNCF framework for model compression with fine-tuning. It supports various compression methods and allows combining them to get more lightweight neural networks. We paid special attention to usability aspects and simplified the compression process setup as well as approbated the framework on a wide range of models and tasks. Models obtained with NNCF show state-of-the-art results in terms of accuracy-performance trade-off. The framework is compatible with the OpenVINOTM inference toolkit which makes it attractive to apply the compression to real-world applications. We are constantly working on developing new features and improvement of the current ones as well as adding support of new models.
Artificial Intelligence
229
Appendix Described below are the steps required to modify an existing PyTorch training pipeline in order for it to be integrated with NNCF. The described use case implies there exists a PyTorch pipeline that reproduces model training in floating point precision and a pre-trained model snapshot. The objective of NNCF is to simulate model compression at inference time in order to allow the trainable parameters to adjust to the compressed inference conditions, and then export the compressed version of the model to a format suitable for compressed inference. Once the NNCF package is installed, the user needs to introduce minor changes to the training code to enable model compression. Below are the steps needed to modify the training pipeline code in PyTorch: – Add the following imports in the beginning of the training sample right after importing PyTorch: import nncf
# Important - should be imported # directly after torch from nncf import create_compressed_model , NNCFConfig , \ register_default_init_args
– Once a model instance is created and the pre-trained weights are loaded, the model can be compressed using the helper methods. Some compression algorithms (e.g. quantization) require arguments (e.g. the train loader for your training dataset) to be supplied to the initialize() method at this stage as well, in order to properly initialize compression modulse parameters related to its compression (e.g. scale values for FakeQuantize layers): # I n s t a n t i a t e your u n c o m p r e s s e d model from torchvision . models . resnet import resnet50 model = resnet50 () # Load a c o n f i g u r a t i o n file to specify c o m p r e s s i o n nncf_config = NNCFConfig . from_json ( " resnet50_int8 . json " ) # Provide data loaders for c o m p r e s s i o n # algorithm initialization , if n e c e s s a r y nncf_config = r e g i s t e r _ d e f a u l t _ i n i t _ a r g s ( nncf_config , train_loader , loss_criterion ) # Apply the s p e c i f i e d c o m p r e s s i o n a l g o r i t h m s to the model comp_ctrl , c o m p r e s s e d _ m o d e l = c r e a t e _ c o m p r e s s e d _ m o d e l ( model , nncf_config )
where resnet50 int8.json in this case is a JSON-formatted file containing all the options and hyperparameters of compression methods (the format of the options is imposed by NNCF). – At this stage the model can optionally be wrapped with DataParallel or DistributedDataParallel classes for multi-GPU training. In case distributed training is used, call the compression algo.distributed() method
230
A. Kozlov et al.
after wrapping the model with DistributedDataParallel to signal the compression algorithms that special distributed-specific internal handling of compression parameters is required. – The model can now be trained as a usual torch.nn.Module to fine-tune compression parameters along with the model weights. To completely utilize NNCF functionality, you may introduce the following changes to the training loop code: 1) after model inference is done on the current training iteration, the compression loss should be added to the main task loss such as cross-entropy loss: compression_loss = comp_ctrl . loss () loss = cross_entropy_loss + compression_loss
2) the compression algorithm schedulers should be made aware of the batch/epoch steps, so add comp ctrl.scheduler.step() calls after each training batch iteration and comp ctrl.scheduler.epoch step() calls after each training epoch iteration. – When done finetuning, export the model to ONNX by calling a compression controller’s dedicated method, or to PyTorch’s .pth format by using the regular torch.save functionality: # Export to ONNX or . pth when done fine - tuning comp_ctrl . export_model ( ‘ ‘ compressed_model . onnx " ) torch . save ( compressed_model . state_dict () , ‘‘ compressed_model . pth " )
References 1. OpenVINO Toolkit. https://software.intel.com/en-us/openvino-toolkit 2. Avron, H., Toledo, S.: Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM (JACM) 58(2), 1–34 (2011) 3. Chen, K., et al.: MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019) 4. Choi, J., Wang, Z., Venkataramani, S., Chuang, P.I.-J., Srinivasan, V., Gopalakrishnan, K.: Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 (2018) 5. Dong, Z., Yao, Z., Cai, Y., Arfeen, D., Gholami, A., Mahoney, M.W., Keutzer, K.: Hawq-v2: Hessian aware trace-weighted quantization of neural networks. arXiv preprint arXiv:1911.03852 (2019) 6. Gale, T., Elsen, E., Hooker, S.: The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574 (2019) 7. Gomez, A.N., Zhang, I., Swersky, K., Gal, Y., Hinton, G.E.: Learning sparse networks using targeted dropout. arXiv preprint arXiv:1905.13678 (2019) 8. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, pp. 1135–1143 (2015)
Artificial Intelligence
231
9. He, Y., Liu, P., Wang, Z., Hu, Z., Yang, Y.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349 (2019) 10. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Advances in Neural Information Processing Systems, pp. 4107–4115 (2016) 11. Krishnamoorthi, R.: Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342 (2018) 12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097– 1105. Curran Associates, Inc. (2012) 13. Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015) 14. Liu, C., et al.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34 (2018) 15. Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 (2017) 16. Louizos, C., Welling, M., Kingma, D.P.: Learning sparse neural networks through l 0 regularization. arXiv preprint arXiv:1712.01312 (2017) 17. Molchanov, D., Ashukha, A., Vetrov, D.: Variational dropout sparsifies deep neural networks. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2498–2507. JMLR.org (2017) 18. van den Oord, A., et al.: WaveNet: a generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016) 19. Park, J., et al.: Faster CNNs with direct sparse convolutions and guided pruning. arXiv preprint arXiv:1608.01409 (2016) 20. Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A.: XNOR-net: Imagenet classification using binary convolutional neural networks. In: European Conference on Computer Vision, pp. 525–542. Springer, Heidelberg (2016). https://doi.org/10. 1007/978-3-319-46493-0 32 21. Rodr´ıguez, P., Gonzalez, J., Cucurull, G., Gonfaus, J.M., Roca, X.: Regularizing CNNs with locally constrained decorrelations. arXiv preprint arXiv:1611.01967 (2016) 22. Wu, M., Jain, S.R., Gural, A., Dick, C.H.: Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks (2019) 23. Shang, W., Sohn, K., Almeida, D., Lee, H.: Understanding and improving convolutional neural networks via concatenated rectified linear units. In: International Conference on Machine Learning, pp. 2217–2225 (2016) 24. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014) 25. Tan, M., et al.: MnasNet: platform-aware neural architecture search for mobile. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019) 26. Wen, W., Wu, C., Wang, Y., Chen, Y., Li, H.: Learning structured sparsity in deep neural networks. In: Advances in Neural Information Processing Systems, pp. 2074–2082 (2016) 27. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. ArXiv, arXiv-1910, (2019)
232
A. Kozlov et al.
28. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016) 29. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-319-10590-1 53 30. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016) 31. Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017) 32. Zmora, N., Jacob, G., Zlotnik, L., Elharar, B., Novik, G.: Neural network distiller, June 2018 33. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710 (2018)
Small-World Propensity in Developmental Dyslexia After Visual Training Intervention Tihomir Taskov and Juliana Dushanova(B) Institute of Neurobiology, Bulgarian Academy of Sciences, 23 Acad G. Bonchev Str, 1113 Sofia, Bulgaria [email protected]
Abstract. Altered functional networks in children with developmental dyslexia have been found in previous electroencephalographic studies using graph analysis. Whether learning with visual tasks can change the semantic network in this childhood disorder is still unclear. This study of the local and global topological properties of functional networks in multiple EEG frequency bands applies the method of small-world propensity in visual word/pseudo-word processing. The effect of visual training on the brain functional networks of dyslexics compared with controls was also studied. Results indicated that the network topology in dyslexics before the training is more integrated, compared to controls, and after training it becomes more segregated and similar to that of the controls in the theta, alpha, beta, and gamma bands for three graph measures. The pre-training dyslexics exhibited a reduced strength and betweenness centrality of the left anterior temporal and parietal regions in the θ, α, β1, and γ1-frequency bands, compared to the controls. In the brain networks of dyslexics, hubs have not appeared at the left-hemispheric/or both hemispheric temporal and parietal (α-word/γ-pseudoword discrimination), temporal and middle frontal cortex (θ, α-word), parietal and middle frontal cortex (β1-word), parietal and occipitotemporal cortices (θ-pseudoword), identified simultaneously in the networks of normally developing children. After remediation training, the hub distributions in theta, alpha, and beta1-frequency networks in dyslexics became similar to control ones, which more optimal global organization was compared to the less efficient network configuration in dyslexics for the first time. Keywords: EEG · Functional connectivity · Developmental dyslexia · Frequency oscillations
1 Intorduction Developmental dyslexia is a disorder of children with normal intellectual abilities, which is related to difficulties in learning reading, writing, and spelling skills. Its early diagnosis [1] can prevent the problems in their social and mental development and help to apply appropriate actions to protect their life from such a negative outcome. Despite extensive research of dyslexia on a behavioral level [1–7], there is no clarity about the reasons for this developmental disorder. In the “double route” model [8], familiar words can be © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 233–258, 2021. https://doi.org/10.1007/978-3-030-80129-8_18
234
T. Taskov and J. Dushanova
automatically recognized as members of the lexicon through the lexical route by their visual form and connected with verbal semantic representations [8]. In the sub-lexical route, unfamiliar words are broken down into their constituent letters and correspondent phonemes by grapheme-phoneme correspondence rules [8]. Reading difficulty patterns can be explained by impairment in either of these routes [9]. In phonological dyslexia [10], the children with deficits in phonological skills use the lexical pathway, compensating the sub-lexical pathway [11], as name irregular words but not pseudo-words. In surface dyslexia [13, 14], the children can process pseudo-words but not irregular words through the sub-lexical route, compensating the problems in the lexical route [12]. Various perception deficits of coherent motion [6, 7], speed discrimination [2, 15], direction coding [1, 3], and contrast stimulus sensitivity to low-/high-spatial frequency sinusoidal gratings in noise environment [4, 5] are selectively associated with low accuracy or slow presentation in reading sub-skills [16], with vision problems of letters and their order, orientation and focusing of visual-spatial attention [1, 17]. The inability to decode letters [3, 18, 19] is related to the established deficits to coherent motion perception in the magnocellular pathway, mediated the motion perception and object location [20] through projections to the motion-sensitive visual area and posterior parietal cortex. Visual input to the dorsal pathway, which guides visual attention and eye movements [1], is the magnocellular system. The efficacy of visual intervention efforts [21–26] on the magnocellular system of children with reading difficulties has been proved in coherent motion detection, saccadic eye movements, reading accuracy, and visual errors [12], also in the lexical decision and reading accuracy at it’s higher visual levels [22], detecting progressively faster movements in the coherent motion discrimination. The dyslexic readers also decreased the phonological errors after magnocellular intervention using figure-ground discrimination [24, 25]. The improvement in functioning levels of the dorsal stream leads to less phonological errors, visual timing deficits, and better phonological processing, good reading fluency, attention, and working memory. The localization of abnormally functioning regions and the deterioration of connections with other brain regions in dyslexia is the basis for the creation of new methods for early diagnosis [1]. Children with severe phonological and those with less pronounced phonological deficits rely more on the ventral (occipital-temporal lexical [10]) brain regions and more on the dorsal (occipitoparietal sublexical) pathway, respectively. The dorsal stream has forwards connections through three pathways to the prefrontal and premotor cortex, to the medial temporal lobe with direct and indirect courses through the posterior cingulate and retrosplenial cortices, projected feedback to the visual cortex [27]. The multisynaptic ventral pathway has projections from the striate cortex to the anterior temporal cortex and feedforward from the rostral inferior temporal to the ventral prefrontal cortex.The dyslexics with fewer phonological deficits over recruit the slower attention-dependent sublexical route to compensate partially the deficits in lexical one. Partially distinct neural substrates underlie sublexical and lexical processes as a dorsal and a ventral route, respectively [1, 12]. The posterior middle temporal gyrus, left temporal language area, and inferior frontal gyrus are involved in the lexicosemantic route, while the left superior temporal area, supramarginal gyrus, and the opercular part of the inferior frontal gyrus are involved in the sublexical route [28], but controversies about their anatomical substrates still exist [28, 29]. Typically developing maturation [30] and
Small-World Propensity in Developmental Dyslexia
235
multisensory development occurring quite late with age [31] can be stimulated by visual rehabilitation [32] mediated by neuroplastic mechanisms and synaptic reorganization among unisensory areas based on learning mechanisms. Not only stimulation of visual and oculomotor functioning [32], but further, the training contributes to the multisensory integration in semantic memory link the semantic content with lexical features [33]. Network integration and segregation are often used to study brain networks in terms of their functioning, characterized by graph measures. Locally processed information within interconnected adjacent brain regions describes their functional segregation ability, whereas combined information from distinct brain regions characterizes their functional integration ability [34]. Node’s set, node’s connections (links, edges), connectivity strength’s calculations and node-node neighboring describe the networks in the Graph theory. Measures as the clustering coefficient, the global efficiency, and the characteristic path length are involved to calculate the small world of the network [35]. The betweennetwork comparisons are hampered by the sensitivity to many parameters as the connection’s number or their density in the network, the node’s number, or their weight’s distribution. These problems are solved by a novel approach called the small-world propensity method. A Graph theoretical metric as a small-world propensity (SWP) provides an assessment of small-world networks structures with different densities, conformed to density’s variations [36], quantifies the degree of small-world structures between the networks, and compares the topological network structure between specific groups by network density effects described in the method section. The small-world propensity method [36] was proposed for the assessment of real unbiased network structures with varying densities to avoid the methodological limitations in the comparison of different brain networks, [37]. At different stages in development, the comparison of small-world structure in different brain networks is limited from statistical disadvantages of the other graph theory methods related to the network-density dependence, neglected connection strengths. The contribution of strong and weak connections in weighted age-related maturation processes [38, 39] is that they help to differentiate the overall function of the network [40]. Recently identified potential biomarkers in pathologies are the weak connections [40] ignored due to commonly applied threshold techniques. Density and connection strength sensitivities are the main advantages of the network statistic as SWP. Different brain regions in a heteromodal semantic network support human semantic knowledge coordinated by semantic hubs as the anterior temporal lobe (ATL), the posterior inferior parietal lobe (especially the angular gyrus, AG), the middle temporal gyrus, and the inferior frontal gyrus (IFG) [41, 42], which roles have been studied by neuro-computational modeling [43] as well as experimentally [42]. In the present study, further, the SWP algorithm, applied to the groups (healthy and dyslexic) will be focused on the differences between their EEG functional connectivity networks. Can screening of developmental dyslexia using graph theoretical methods clarify the neurophysiological mechanisms for the effectiveness of visual training intervention [12, 16, 22, 26]? The hypothesis is that the visual training intervention of children with dyslexia can lead to brain functional networks similar to those of typical reading children. Their network changes may be related mainly to the dorsal pathway, which in turn will affect the functioning of the semantic network. The aim is to determine: (1) the difference in the hub distribution in functional neural networks between controls and children with
236
T. Taskov and J. Dushanova
DD during a visual lexical-semantic task, (2) the reorganization of neural networks in children with DD after remedial training. After a brief overview of the SWP advantages to the other graph approaches in a method section, we describe its application, visual non-verbal training effects of semantic integration/segregation. Valuable insight into the semantic integration stage provides important information for multisensory integration. The clusters of connected regions give also insight into functional brain segregation. This description serves as a motivation for the remainder of the paper, divided into three sections: (i) global SWP metrics for studies of between-groups and within-groups comparisons of the variables; (ii) local SWP characteristics for studies within-participant and between-participant manipulations of the variables; (iii) permutation statistical tests identify the regions important for comparison of the semantic integration in pre-training and post-training dyslexics, as well as in post-training experimental and control groups. A functional brain network is studied during semantically congruent visual stimuli presented in a central visual field. Nodal measurements such as strength and betweenness centrality are analyzed, identifying important network nodes for the integration and transfer of information.
2 Materials and Methods 2.1 Participants Children with developmental dyslexia were involved in a longitudinal study. In a study with visual non-verbal intervention, the dyslexics were followed out over time for the outcome from the training and its contribution to the compensation of this childhood disorder. After study approval by the Ethics Committee of the Institute of Neurobiology, BAS, and signed informed consent of the parents and children following the Helsinki Declaration, reliable electrophysiological (EEG) experiments were conducted in schools for forty-three age-matched children (12 boys and 10 girls) with dyslexia and typically reading children (11 boys and 10 girls; 8–9 years old). The groups were with the same socio-demographical background as dyslexics and there was no report of co-occurring language disorders and dyslexia for the controls. Bulgarian is the first language of the children. They were with right-hand preference [44], higher of 98 non-verbal intelligence scores [45], and corrected-to-normal or/a normal vision. It was applied the next assessment tests: 1) neuropsychological [46]; for developmental dyslexia: a DDE-2 test battery [47, 48]; for phonological awareness [49]; for reading and writing skills: psychometric tests [49]; for nonverbal perception: Girolami-Boulinier’s “Different Oriented Marks” test [50, 51]; and for nonverbal intelligence: Raven’s Progressive Matrices test [45]. Children with reading difficulties were with below-norm of speed or accuracy performance. The norm was defined as one standard deviation from standardized control DDE-2 battery results and “Reading abilities” psychometric results in within-norm of speed and accuracy reading. 2.2 Experimental Paradigm The visual stimuli were presented on a white screen for 800 ms in a pseudo-random sequence of 2/3-syllable words/pseudo-words (5.4 ± 0.7 black letters, Microsoft Sans
Small-World Propensity in Developmental Dyslexia
237
Serif font, an angular size ~ 1° of each letter) at a viewing distance of 57 cm and an interstimulus interval of 1.5–2.5 s. The screen resolution and the refresh rate were 1920 × 1080 pixels and 60 Hz. The mean duration of aloud-pronounced words with the same content as the visual stimuli defined its duration on the screen as well as this of the pseudo-word stimuli formed by the replacement of all vowels in the same words. The age-appropriate different speech parts with different frequency use and a balance of their frequency characteristics (low and high frequency, [52]) were selected as nouns (10), verbs (12), adjectives (2), adverbs (8), numerals (2), pronouns (4) and prepositions (2) without phonological and orthographic neighborhoods. The EEG experimental sessions included daily two to four blocks, contained 40 words/40 pseudo-words in each block. The instructions were not to blink during the word/pseudo-word stimuli to reduce the EEG contamination with the blinked artifacts, a button press with a right hand for a word, and a different button press with a left hand for a pseudoword. The percentage of corrected identified words/pseudowords and the reaction time were assessed for the groups. The training with five visual programs was examined by statistical assessment of graph metric changes in the neural semantic network for dyslexic children. For this purpose, an EEG session was recorded during a visual word/pseudoword task one month later after the visual training. An intensive three-month training procedure with an arbitrary order of the visual non-verbal tasks was composed by individual sessions presented twice a week lasting 45 min. The visual non-verbal interventions on the dyslexics included training with five programs, stimulated the magnocellular function (coherent motion, low-spatial frequency double illusion) parvocellular function (high-spatial frequency double illusion), visualspatial memory (cue color square), visual area MT/V5 (low-/high speed discrimination) without including any phonological content in the stimuli. The program for discrimination of up and downwards vertical coherent motion of white dots (a size of 0.1° and a speed of 4.4°/s of each dot) in a random-motion cloud with a 20° eccentricity and time life of 200 ms was presented on a black screen on a distance of 57 cm [6, 7]. The threshold of coherent motion dots was 50% with an interval between changes of the directions 1.5–2.5 s (ISI). The instruction included a left-side button press for the upwards dots motion and a right-side button press for in a downwards coherent motion. Pairs of radial expanded optic flows with 14° eccentricity appeared for 300 ms sequentially one after other on a screen a cue flow with a speed of 4.5°/s and after 500 ms, a second item - with a speed of 5.0°/s or 5.5°/s [53] with an interval between pairs of 1.5 and 3.5 s. The instructions of this training task included speed discrimination and a right-side keypress for a slow speed of the second flow in the pairs or left-side keypress for a higher speed in the pair, comparing to the constant speed of the cue flow. In the magnocellular task, the contrasts of low-spatial frequency sinusoidal gratings (2 cpd) were discriminated with contrast levels ~ 6% and 12% of the contrast threshold, defined in previously psychophysical experiments [17]. In the parvocellular task, the contrast discrimination was between high-spatial frequency sinusoidal gratings (10 cpd) with contrast levels ~ 3% and 6% of the threshold. In a center on a grey screen with a fixation cross for both tasks, a pseudo-randomized sequence of gratings (ISI = 1.5– 3.5 s) were presented for 200 ms and vertically flicking with 15 reversals/s in external noise [4, 5] with a visual angle of 2.7° × 2.7° on a distance of 210 cm from the child.
238
T. Taskov and J. Dushanova
The instructions for both tasks were a left-side button press for low-contrast stimuli and a right-side button press – for high contrast. The visual-spatial task included an array with four color squares (each with size 3° × 3°), which changed the color or preserved the color of a square [54] in a black frame, which remained on the screen during the array presentation, but appeared as a cue 300 ms before the color array. A color square has been appeared in the cue for 200 ms horizontally or vertically in either left or right visual field on a white screen. The instructions were pressed a left-side keypress for two consecutive colors in the cue preserve, or a right-side keypress for different colors in the sequence with ISIs of 1.5–2.5 s. The 80 trials included 40 trials for each condition and task. Previous works described the design of experiments and parameters [17, 26]. The training programs reported the percentage of correct responses and reaction time. 2.3 EEG Recording and Signal Pre-processing The electroencephalogram was recorded by sensors (Brain Rhythm Inc., Taiwan; [55]) with head positions according to the 10–20 system: F3–4, C3–4, T7–8, P3–4, O1–2; Fz, Cz, Pz, Oz, and the 10–10 system: AF3–4, F7–8, FT9–10, FC3–4, FC5–6, C1–2, C5–6, CP1–2, CP3–4, TP7–8, P7–8, PO3–04, PO7–08, as well as on both processi mastoidei and the forehead with a skin impedance < 5 k. The sampling rate of the in-house developed Wi-Fi EEG system was 250 Hz. The band-pass filters for the data were in the frequency ranges: δ = [1.5 ÷ 4]; θ = [4 ÷ 8]; α = [8 ÷ 13]; β1 = [13 ÷ 20]; β2 = [20 ÷ 30]; γ1 = [30 ÷ 48]; γ2 = [52 ÷ 70] Hz, where 50 Hz was removed by a notch filter. The EEG trials were synchronized to stimulus onset and with a duration corresponding to word/pseudoword presentation. Trials, containing artifacts > ±200 μV of the EEG amplitude, were rejected. The rest of the trails were verified for the signal to noise level [56]. The average number was 30 of correct/artifact-free trials with high SNR per subject/condition, whereas the smallest number was 20 trials provided per condition. 2.4 Functional Connectivity The phase Lag index (PLI), characterizing the phase synchronization between a pair of time series, is applied separately for each frequency band to determine the functional connectivity [57, 58]. The asymmetry of the distribution of the instantaneous phase differences of two signals with instantaneous phases, estimated by the Hilbert transformation of each signal, determines the lag of one signal from the other in the range between 0 and 1. A phase difference of two signals, centered at 0 mod π, shows that the value of PLI is 0 and both signals are not phase-locked. When the phase difference is different from 0 mod π, the signals are perfectly locked and the PLI is 1. The independence of the PLI from the signal amplitude and its low sensitivity to brain volume conduction and false correlations due to common sources are the main advantages of the Phase Lag Index [57, 58]. 2.5 Small-World Propensity An adjacency matrix is constructed by a PLI between all pairs of sensors, represented in the rows and columns of the graph, in which all nodes are connected, to determine the
Small-World Propensity in Developmental Dyslexia
239
functional network connectivity. The small-world properties of the weighted networks are characterized by the small world propensity (SWP: φ), comparing the clustering coefficient and characteristic path length with those of regular lattice and random graphs with the same node’s number and probability power distribution of degrees of all nodes [36]. The deviations (ΔL, ΔC) of these observed characteristics (Lobs, Cobs) from the regular lattice model and the random model determine the SWP: φ = 1 − ((2 C + 2 L) 2)1/2 , ΔC = (Clatt – Cobs)/(Clatt – Crand); where the observed clustering coefficient is determined by the triangle’s number around a node relative to the number of neighbors of all nodes [59, 60], ΔL = (Lobs – Lrand)/(Llatt – Lrand), where the Lobs, related to the segregation of the network [35, 61], is the average min number of edges between all node’s pairs. The appropriate reference random network (Crand, Lrand) and lattice network (Clatt, Llatt) are needed to determine φ in a real brain network. The assumption in these models is that near-nodes in the brain networks have greater strength of connections and stronger edges than distant nodes. In the weighted lattice model, the weights of the observed edges are arranged by assigning the largest weights to the edges with the smallest Euclidean distance between the nodes. The connections with N highest weights are randomly distributed between the edges in a one-dimensional lattice with a single distance, i.e. the observed edge weights are arranged according to the strength decrement. Among the edges of the lattice, the distribution of the next edges with the largest weights is in a 2-dimensional lattice. This arrangement continues until the total edge’s number of the real network is placed in a lattice. After this process, the near edges have the largest strength than the remote edges and a diagonal of the adjacent matrix with the largest strength of the edges. The lattice network has a short characteristic path length and a large coefficient of clustering. The network is rewired at a maximum deviation of the clustering coefficient from the null model. The shortest path between randomly distributed weighted edges between N nodes in the random lattice is small and the distribution of connections is random throughout the adjacency matrix. The values of the clustering coefficient of the random network are small and those of the characteristic path length - large. Due to the lack of locally significant clustering between the nodes in the random network, it is highly integrated and not segregated. The observed network SWP is calculated based on these reference networks. The real-world networks can have clustering coefficients and path lengths exceded those of the lattice or random networks. In these cases, the measures are set to 1. When the measures are very small, they are set to 0. The φ of the small-world network is around 1, when its characteristics are high but with small deviations from its null models (small ΔC, ΔL). The φ is small when the deviations of the characteristics are larger from the null models, then the network has a less small-world structure with the largest clustering and path length. The φ is a network statistic that quantifies the small world in the network, sensitive to the strength of node connections. The φ quantifies also the degree of small-world structure among the networks. Brain neural networks with large φ exhibit the small-world properties. Networks have a high φ (close to 1) when their small-world characteristics are high
240
T. Taskov and J. Dushanova
with equally contributed short path length and large clustering (short ΔL, small ΔC). Networks with high φ can have moderate path length and high clustering (moderate ΔL, small ΔC), or low path length and moderate clustering (short ΔL, moderate ΔC). A network with less small-world structure (low φ) has larger deviations of the clustering and path length from the null models [36]. The φ, ΔL, ΔC were estimated by 40 × 40-weighted adjacency matrices in the δ, θ, α, β1, β2, γ1 and γ2 frequency bands for dyslexic and control groups on MATLAB scripts (Brain Connectivity toolbox [36]). The averaged φ, ΔL, ΔC across the trials for each child and frequency band and condition were included in the group comparisons. Local measures of the nodes in the network are strength and betweenness centrality BC [35, 61, 62]. The strength, node BC, edge BC are estimated on Matlab scripts (Brain connectivity toolbox [35]). By converting the weights into distances in the adjacency matrix is defined the betweenness centrality (BC) of the nodes represents the fraction of all shortest paths that pass through a given node in the network. The fraction of all shortest paths, which contains a given edge in the network, defines the BC of the edge. The sum of the weights of connections for a given node defines the strength. Their values are normalized towards the mean local characteristics across all network nodes. An important role in processing information has nodes located on many shortest paths that have high BC or strength in a graph, which is more integrated [60–62]. Hubs are the most important nodes in the network. The hub characteristics are at least one standard deviation above the mean BC/strength. The most important BC edges are at least one standard deviation above the mean BC edge. The figures are presented by BrainNet Viewer version 1.63 [63]. 2.6 Statistical Analysis The behavior parameters (the reaction times and accuracy) were compared between pairs of groups (pre-D and post-D; pre-D and controls; post-D and controls) for word and pseudoword discrimination by nonparametric test (a Kruskal Wallis test, [KW]). The global characteristics φ, ΔL, ΔC between pairs of groups were compared by non-parametric procedure bootstrapping 1000 permutations for all frequency ranges, separately [64, 65]. The significance level was corrected by a Bonferroni adjustment p (α/3) = 0.017 for all statistics. The local node’s measures between pair of groups were also compared for the frequency ranges, separately, by cluster-based non-parametric permutations with multiple comparison corrections, depending on the node-selected threshold [64]. The threshold criteria (one std of mean node strength/BC on a group level) selected the significant nodes by their cluster statistics, which will be localized and later defined as cluster-hubs. The significant clusters were identified by a critical value for the max-cluster statistics with multiple comparison corrections. The cluster indices in the histograms were with brain hemispheric-sensitive median differences. The Bonferroni adjustment of these statistics was p (α/2) = 0.025. All statistic procedures were on MATLAB scripts. The same procedure was applied to the edge BC. The significant hubs/links were presented in the figures.
Small-World Propensity in Developmental Dyslexia
241
3 Results 3.1 Behavioural Parameters The correct responses and reaction times were compared between pairs of groups for each condition by the KW test (Table 1). In both conditions, the success rate of the dyslexic groups was significantly lower than the controls (p < 0.0001). Their performance was slower than the controls (p < 0.0001). The post-training dyslexics improved their achievement than the pre-training group in both task conditions (p < 0.05), but their reaction time did not change significantly (p > 0.05). Table 1. Statistics (p of Kruskal-Wallis Test) of the Behavior Parameters (Correct Responses, Mean ± s.e., %) and Reaction Time (Mean ± s.e., s) for 1st Condition (Word Discrimination) and 2nd Condition (Pseudo-Word Discrimination) between Group Pairs: 1st (controls vs preTraining Dyslexics), 2nd (Controls vs Post-Training Dyslexics), 3rd (Pre-Training vs Post-Training Dyslexics). Task discrimination
Controls
Pre-training dyslexics
Post-training dyslexics
1st p
2nd p
3rd p
1st condition, %
94.7 ± 1.15
69.8 ± 2.75
78.6 ± 2.99
|Q|−1 or ck / k=1 ck > 2|Q|−1 , then the budget allocated to query qk in the MWEM algorithm would always be smaller than that in CIPHER. In other words, the selection probability for a query needs to at least double the average selection probability (1/|Q|) in order to that query to receive more privacy budget in MWEW than the amount of budget it receives in CIPHER. In addition, our own experiences from running the MWEM algorithm suggest that choosing the “right” number of iterations T for MWEM can be challenging. T too small is not sufficient to allow the empirical distribution to fully capture the signals summarized in the queries; and T too large would lead to a large amount of noises being injected as the privacy budget has to be distributed across the T iterations, eventually leading to a useless synthetic data set as each iteration costs privacy. PrivBayes offers similar benefits on data storage and memory, similar to CIPHER, given that PrivBayes is built in the framework of Bayesian network that is is known for its ability of saving considerable amounts of memory over full-dimensional tables if the dependencies in the joint distribution are sparse. On the other hand, PriBayes starts with model building that costs privacy budget. It is also well known
CIPHER
957
that the approximate structure learning of a Bayesian network is NP-complete. In addition, Bayesian networks would force attributes in a data set to be in a causal relationship. Finally, PrivBayes proposes a surrogate function for mutual information, on which the quality of the released data replies, requires some effort for efficient computation. In comparison, the underlying analytical and computational techniques for CIPHER are standard and require nothing than joint probability decomposition and solving linear equations. 2.3
Example: CIPHER for the 3-Variable Case
We illustrate the CIPHER procedure with a simple example. Suppose the original data contain 3 variable (p = 3), denote the 3 variables by V1 , V2 , V3 with K1 , K2 and K3 levels, respectively. Let Q = {T (V1 , V2 ), T (V2 , V3 ), T (V1 , V3 )} that contains all the 2-way contingency tables. Therefore, p0 = 2 in Algorithm 1. WLOG, suppose V3 is X0 in Algorithm 1. We first write down the relationships among the probabilities, which are Pr(V3 |V1 ) = V2 Pr(V3 , V2 |V1 ) = V2 Pr(V3 |V1 , V2 ) Pr(V2 |V1 ) . Pr(V3 |V2 ) = V1 Pr(V3 , V1 |V2 ) = V1 Pr(V3 |V1 , V2 ) Pr(V1 |V2 ) We now convert the above relationships into the equation set b = Az. Specifically, b = (Pr(V3 |V1 ) \ Pr(V3 = K3 |V1 ), and Pr(V3 |V2 ) \ Pr(V3 = K3 |V1 ))T is a known vector of dimension (K1 + K2 )(K3 − 1), z = Pr(V3 |V1 , V2 ) \ Pr(V3 = K3 |V1 , V2 ) is of dimension K1 K2 (K3 − 1), A is a known diagonal matrix with K3 − 1 identical blocks, and each block is a (K1 + K2 ) × (K1 K2 ) matrix comprising the coefficients (i.e. Pr(V1 |V2 ), Pr(V2 |V1 ), or 0) associated with z. After z is solved from b = Az, the joint distribution of Pr(V1 , V2 , V3 ) is calculated by z · Pr(V1 , V2 ). The experiments in Sect. 3 contain more complicated applications of CIPHER.
3
Experiments
We run experiments with simulated and real-life data to evaluate CIPHER, and benchmark its performance against MWEM and the FDH sanitization in this paper. Both MWEM and FDH inspired our work, the former conceptually as stated in Sect. 2.2, while the latter motivated us to develop a solution with decreased storage cost for sanitized queries. In addition, more complex algorithms are unlikely to beat a simpler and easier-to-deploy flat algorithm such as FDH, per conclusions from previous studies [2,10] when n, , or p is large. In addition, both MWEM and FDH are straightforward to program and implement. Our goal is to demonstrate in the experiments that CIPHER performs better than MWEM in terms of the utility of sanitized empirical distributions and synthetic data, and delivers non-inferior performance compared to FDH with significant decreased storage costs. When comparing the utility of synthetic data generated by different procedures, we not only examine the degree to which the original information is
958
E. C. Eugenio and F. Liu
preserved on descriptive statistics such as mean and lq (q > 0) distance, we also examine the information preservation in statistical inferences on population parameters. Toward that end, we propose the SSS assessment. The first S refers to the Sign of a parameter estimate, and the second and third S’ refer to the Statistical Significance of the estimate against the null value in hypothesis testing. Whether the sign and statistical significance in the estimate between the original and synthetic data are consistent leads to 7 possible scenarios (Table 1). Between the best and the worst scenarios, there are 5 other possibilities. II+ and I+ indicate an increase in Type II (false negative) and Type I (false positive) error rates, respectively, from the original to the sanitized inferences, so do II− and I−, but the latter two also involve a sign change from the original to the sanitized inferences. Table 1. Preservation of Signs and Statistical Significance on an estimated parameter (the SSS assessment) Parameter estimate
Best Neutral II+ I+ II− I− Worst
matching Signs between non-private and sanitized? Y Y N
Y
Y N
N N
non-private Statistical Significance?
Y N N
Y
N Y
N Y
sanitized Statistical Significance?
Y N N
N
Y N
Y Y
3.1
Experiment 1: Simulated Data
The data were simulated via a sequence of multinomial logistic regression models with four categorical variables and two samples size scenarios at n = 200 and n = 500, respectively. For the FDH sanitization, there are 36 (2 × 2 × 3 × 3) cells in the 4-way marginals. For the CIPHER and MWEM algorithms, we consider 3 different query sets Q: (1) Q3 contains all 3-way marginals, leading to 32 cells (88.9% of the 4-way); (2) Q2 contains all six 2-way marginals, leading to 20 cells (55.6% of the 4-way). Five privacy budget scenarios = (e−2 , e−1 , 1, e1 , e2 ) were examined. To account for the uncertainty of the sanitization and synthesis in the subsequent statistical inferences, m = 5 synthetic data sets were generated. We run 1,000 repetitions for each n and scenario to investigate the stability of each method. When examining the utility of the differentially private data, we present the cost-normalized metrics wherever it makes sense. The cost is defined as the number of cells used to generate a differentially private empirical distribution (36 for FDH; 32 for CIPHER 3-way and MWEM 3-way; and 20 for CIPHER 2-way and MWEM 2-way). In the first analysis, the average total variation distance (TVD) between the original and synthetic data sets was calculated for the 3-way, 2-way and 1-way marginals, respectively. Figure 2 presents the results. After the cost normalization, CIPHER 2-way performs the best overall, especially for small , and delivers
CIPHER
959
similar performances as the FDH sanitization at large . CIPHER 3-way is inferior to CIPHER 2-way. There is minimal change in the performance of MWEM across .
Fig. 2. Cost-normalized total variation distance (Mean ± SD over 1,000 repeats) (top: n = 200; bottom: n = 500)
In the second analysis, we examine the l∞ error for Q2 and Q3 , the results of which are given in Fig. 3 for at n = 200 (the findings are similar at n = 500 and are available in the supplementary materials). CIPHER 2-way and the FDH sanitizations are similar with the former slightly better at = e−2 . CIPHER 3-way is second-tier behind CIPHER 2-way and the FDH sanitization. The performance of MWEM does not seem to live up to the claim that it yields the optimal l∞ error for the set of queries that are fed to the algorithm [9]. This could be due to the fact that T , which is not an easy hyperparameter to tune, was not optimized in a precise way (though roughly using independent data) in our implementation of MWEM. In the third analysis, we fitted the multinomial logit model with a binary attribute as the outcome and the others as predictors. The inferences from the m = 5 synthetic data sets were combined using the rule in [13]. The bias, root mean square error (RMSE), coverage probability (CP) of the 95% confidence interval (CI) were determined for each regression coefficient in the model. We present the results at n = 200 and = e−2 , 1, e2 in Fig. 4 and those at n = 500 and for e−1 and e at n = 200 are listed in the supplementary materials. The observations are n = 500 are consistent with n = 200, and those at = e−1 and
960
E. C. Eugenio and F. Liu
Fig. 3. Cost-normalized l∞ (Mean ± SD over 1,000 repeats) at n = 200
e when n = 200 are in between e−2 and 1, and between 1 and e2 , respectively in Fig. 4. The 10 parameters from the model are listed on the x-axis. There is not much difference among CIPHER 2-way, CIPHER 3-way, and the FDH sanitization in cost-normalized bias or CP, but CIPHER 2-way delivers better performance in terms of cost-normalized RMSE (smaller) at all compared to CIPHER 3-way, and at small compared to the FDH sanitization. CIPHER and the FDH sanitization deliver near-nominal CP (95%) across all the examined and both n scenarios while MWEM suffers severe under-coverage on some parameters especially at small . MWEM has the smallest cost-normalized RMSE for ≤ 1, but the RMSE values for CIPHER and the FDH sanitization catch up quickly and approach the original values as increases. The results for the SSS assessment on the regression coefficients from the logistic regression are provided in Fig. 5, un-normalized for the storage cost. A method with the longest red bar (best-case scenario as defined in Table 1) and the shortest purple bar (the worst-case scenario) would be preferable. The two inflated type II error (i.e., decreased power types) (the II+/orange bar and the I−/green bar) and neural (the gray bar) are acceptable, and the two inflated type I error types (I+/yellow bar and I−/blue bar) would preferably be of low probability. Per the listed criteria above, first, it is comforting to see the undesirable cases (purple+blue bars) are the shortest among all the 7 scenarios for each method. Second, the inferences improve quickly for CIPHER and the FDH sanitization and rather slowly for MWEM as increases. Third, the FDH sanitization is the best performer in preserving SSS, especially for the medium valued , followed closely by CIPHER. Finally, even for CIPHER and the full table sanitization, there are always non-ignorable proportions of II+ (and II− when is small) especially when is as large as e2 , suggesting the sanitization decreases the efficiency of the statistical inferences, which is the expected price paid for privacy protection. The cost-normalized log-odds of sanitized parameter estimates falling in the “best” category are provided in Fig. 6 on the model parameters. The larger the odds, the more consistent the sanitized and the original inferences are on
CIPHER
961
Fig. 4. Cost-normalized bias and root mean square error (rmse), and coverage probability (un-normalized) at n = 200
these parameters. Overall, the odds are similar for CIPHER 3-way, CIPHER 2-way, and the FDH sanitization, with CIPHER 3-way being the best at small . MWEM performs similarly as the other methods at small , but does not improve as increases. 3.2
Experiment 2: Qualitative Bankruptcy Data
The experiments runs on a real-life qualitative bankruptcy data set. The data were collected to help identify the qualitative risk factors associated with bankruptcy and is downloadable from the UCI Machine Learning repository [5]. The data set contains n = 250 businesses and 7 variables. Though the data set does not contain any identifiers, sensitive information (such as credibility or bankruptcy status) can still be disclosed using the pseudo-identifiers left in the data (such as industrial risk level or competitiveness level), or be used to be linked to other public data to trigger other types of information disclosure. The supplementary materials provides a listing of the attributes in the data.
962
E. C. Eugenio and F. Liu
n = 200
n = 500
Fig. 5. The SSS (Signs and Statistical Significance) assessment on the estimated regression coefficients for n = 200 and n = 500, un-normalized for storage cost
Fig. 6. Cost-normalized log(odds of the “Best” category) in the SSS assessment on the estimated regression coefficients
Q employed by the CIPHER and MWEM procedures contains one 4-way marginal, six 3-way marginals, and three 2-way marginals, that were selected based on the domain knowledge, and computational and analytical considerations when solving the linear equations in CIPHER, without referring to the actual values in the data. More details are provided in the supplementary materials on how Q was chosen. The size of Q (the number of cell counts) is 149, which is about 10% of the number of cells counts (1,458 cells) in the FDH sanitization. On the synthetic data generated by the three procedures, we ran a logistic regression model with “Class” as the outcome variable (bankruptcy vs nonbankruptcy) and examined its relationship with the other six qualitative categorical predictors [12]. We applied the SSS assessment to the estimated parameters from the logistic regression and the results are presented in Fig. 7.
CIPHER
963
Fig. 7. The SSS assessment on the logistic regression coefficients in experiment 2
The figure suggests that all three methods perform well in the sense that the probability that they produced a “bad” estimate (the worst, II−, and I− categories) is close to 0, and the estimates are mostly likely to land in the “best” or the “neutral” categories. The FDH sanitization has the largest chance to produce estimates in the “best” category for ≥ e−1 , at a much higher storage cost (∼8 folds higher) than CIPHER and MWEM. MWEM has slightly better chance (50%) landing in the “best” category when ≤ e−1 but does not improve for as increases. We also performed an SVM analysis to predict “Class” given the other attributes on a testing data (a random set of 50 cases from the original). Per Table 2, CIPHER is the obvious winner at ≤ 1 with significantly better prediction accuracy than the other two and the accuracy is roughly constant. FDH is better than CIPHER at > e, but with a ∼8 -fold increase in the storage cost. The prediction accuracy remains ∼50% for MWEM across all examined values, basically not much better than a random guess on the outcome. Table 2. Prediction accuracy (%) on “Class” via SVM in experiment 2
CIPHER MWEM FDH sanitization −2
e
67.8
50.0
e−1 64.7
51.3
41.1 55.5
1
68.5
51.0
63.8
e
77.8
47.2
85.7
e2
90.3
47.3
98.8
964
4
E. C. Eugenio and F. Liu
Discussion
We propose the CIPHER procedure to generate differentially private empirical distributions from a set of low-order marginals. Once the empirical distributions are obtained, individual-level synthetic data can be generated. The experiment results implies that CIPHER delivers similar or superior performances to the FDH sanitization, especially at low privacy cost after taking into account the storage cost. CIPHER in general delivers significantly better results on all the examined metrics than MWEM. Both the CIPHER and MWEM procedures have multiple sources of errors in addition the sanitation randomness injected to ensure DP. For CIPHER, it is the shrinkage bias brought by the l2 regularization; and for MWEM, it is the numerical errors introduced through the iterative procedure with a hard-to-choose T . The asymptotic version of both CIPHER and MWEM is the FDH sanitization when the low-order marginals set contains only one query – the full-dimensional table. We demonstrated the implementation of CIPHER for categorical data. The procedure also applies to data with numerical attributes, where the input would be a set of low-dimensional histograms. This implies the numerical attributes will need to be cut into bins first before the application of CIPHER. After the sanitized empirical joint distribution is generated, values of the numerical attributes can be uniformly sampled from the sanitized bins. For future work, we plan to investigate the theoretical accuracy for CIPHER using some common utility measures (e.g. l∞ or l1 errors); to apply CIPHER to data of higher dimensions in terms of both p and the number of levels per attribute to examine the scalability of CIPHER; and to compare CIPHER with more methods that may also generate differentially private empirical distributions from a set of low-dimensional statistics, such as PrivBayes and the Fourier transform based method, in both data utility and computational costs. Supplementary Materials. The supplementary materials are posted at https://arxiv.org/abs/1812.05671. The materials contains additional results from experiment 1, more details on the data used in experiment 2 and how Q is chosen, and the mathematical derivation of the linear equations sets Ax = b for the three-variable and four-variable cases. Acknowledgments. The authors would like to thank two anonymous reviewers for their comments and suggestions that helped improve the quality of the manuscript.
References 1. Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., Talwar, K.: Privacy, accuracy, and consistency too: a holistic solution to contingency table release. In: Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 273–282. ACM (2007) 2. Bowen, C.M., Liu, F.: Comparative study of differentially private data synthesis methods. Stat. Sci. 35(2), 280–307 (2020)
CIPHER
965
3. Chen, R., Xiao, Q., Zhang, Y., Xu, J.: Differentially private high-dimensional data publication via sampling-based inference. In: Proceedings of the 21th ACM SIGKDD, pp. 129–138. ACM (2015) 4. Culnane, C., Rubinstein, B.I.P., Teague, V.: Health data in an open world (2017). arXiv preprint arXiv:1712.05627v1 5. Dheeru, D., Taniskidou, E.F.: UCI machine learning repository (2017) 6. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79228-4 1 7. Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Halevi, S., Rabin, T. (eds.) TCC 2006. LNCS, vol. 3876, pp. 265–284. Springer, Heidelberg (2006). https://doi.org/10.1007/11681878 14 8. Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Found. Trends® Theor. Comput. Sci. 9(3–4), 211–407 (2014) 9. Hardt, M., Ligett, K., McSherry, F.: A simple and practical algorithm for differentially private data release. In: Advances in Neural Information Processing Systems, pp. 2339–2347 (2012) 10. Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., Zhang, D.: Principled evaluation of differentially private algorithms using dpbench. In: Proceedings of the 2016 International Conference on Management of Data, pp. 139–154. ACM (2016) 11. Hay, M., Rastogi, V., Miklau, G., Suciu, D.: Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endowment 3(1–2), 1021– 1032 (2010) 12. Kim, M.-J., Han, I.: The discovery of experts’ decision rules from qualitative bankruptcy data using genetic algorithms. Expert Syst. Appl. 25(4), 637–646 (2003) 13. Liu, F.: Model-based differential private data synthesis (2016). arXiv preprint arXiv:1606.08052 14. Liu, F.: Generalized gaussian mechanism for differential privacy. IEEE Trans. Knowl. Data Eng. 31(4), 747–756 (2019) 15. Liu, F., Zhao, X., Zhang, G.: Disclosure risk from homogeneity attack in differentially private frequency distribution (2021). arXiv preprint arXiv:2101.00311v2 16. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., Vilhuber, L.: Privacy: theory meets practice on the map. In: IEEE ICDE IEEE 24th International Conference, pp. 277–286 (2008) 17. McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: 48th Annual IEEE Symposium on Foundations of Computer Science, 2007. FOCS 200, pp. 94–103. IEEE (2007) 18. McSherry, F.D.: Privacy integrated queries: an extensible platform for privacypreserving data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. ACM (2009) 19. Narayanan, A., Shmatikov, V.: How to break anonymity of the netflix prize dataset (2006). CoRR, abs/cs/0610105 20. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: IEEE Symposium on Security and Privacy, 2008. SP 2008, pp. 111–125. IEEE (2008) 21. Sweeney, L.: Matching known patients to health records in washington state data (2013). CoRR, abs/1307.1370 22. Tikhonov, A.N.: On the solution of ill-posed problems and the method of regularization. Doklady Akademii Nauk 151(3), 501–504 (1963). Russian Academy of Sciences
966
E. C. Eugenio and F. Liu
23. Tikhonov, A.N., Goncharsky, A., Stepanov, V.V., Yagola, A.G.: Numerical Methods for the Solution of Ill-Posed Problems, vol. 328. Springer, New York (2013). https://doi.org/10.1007/978-94-015-8480-7 24. Tockar, A.: Riding with the stars: Passenger privacy in the nyc taxicab dataset (2014). https://research.neustar.biz/author/atockar/ 25. Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. In: Proceedings of the 2014 ACM SIGMOD, SIGMOD 2014, pp. 1423–1434 (2014)
Secondary Use Prevention in Large-Scale Data Lakes Shizra Sultan(B) and Christian D. Jensen Department of Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark {shsu,cdje}@dtu.dk
Abstract. Large-scale infrastructures acquire data from integrated data lakes and warehouses managed by diverse data owners and controllers, which is then offered to a large variety of users or data processors. This data might contain personal information about individuals, which if not used according to the data collection purposes can lead to secondary use that may result in legal ramifications. The significance of data often increases with different transformations and aggregations when new linkages and correlations are revealed, making it valuable for users. However, with continuous transformation and new emerging data requirements of different users, often it is difficult for controllers to monitor resource usage closely, and collection purposes are overlooked. Hence, in order to limit secondary use in large-scale distributed environments, the collection purposes for the resources need to be preserved for data through the different transformations that they may undergo. We, therefore, propose to record the collection purposes as part of the resource metadata or provenance. This way it can be preserved and maintained through different data changes and can be used as a deciding factor in limiting the exposure of personal information for different users or data processors. This paper offers insight into how collection purposes can be described as a provenance property, and how is it used in an access control mechanism to limit secondary use. Keywords: Purpose · Secondary use · Privacy · Data lakes · Provenance · Access control
1 Introduction The concept of Data Lakes (DL) embodies collections of data from one or multiple sources (data owners or data controllers) that are kept in a shared repository based on the expectation that they will eventually be utilized in some way. Unlike traditional data warehouses, DL stores data in different forms and formats (raw, semi-structured, and structured) without a priori objectives regarding their usage, and data are not constrained by predefined schemas when stored [2]. This makes DL both valuable and flexible because they offer great possibilities of transforming and aggregating data from several sources in different formats to derive a lot of new information. However, if any of the involved sources contain personal information in any form then it also carries an inherent © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 967–985, 2021. https://doi.org/10.1007/978-3-030-80129-8_64
968
S. Sultan and C. D. Jensen
risk of data misuse, which may lead to privacy violations like secondary use of data, i.e., data is used for a different purpose than it is actually intended for [4]. Any data or resource that may contain personal information is ideally collected for a certain purpose, i.e., the ‘collection purpose’, which is understood and respected by all the involved entities, such as the Data Controllers (DC), Data Processors (DP), and the Data Subjects (DS). However, due to the data-driven economy, data collection, transformation, and analysis is a continuous and persistent process, which causes data to repeatedly change form. Often data is collected by a DC under certain circumstances at some point in time and is then processed or transformed under different conditions by several different DP per their authorized requirements or ‘access purpose’, thus altering the form of data. Such data aggregation and transformations may hide the original ‘collection purpose’ of the involved resources, which, if not preserved, can potentially lead to secondary use. Data used in any way other than the ‘collection purposes’ is considered a privacy violation, which according to data protection legislation, like GDPR or CCPA, may lead to legal ramifications [6, 7]. Hence, to prevent secondary use of data, a DL needs to preserve the ‘collection purpose’ of the resources (with personal information), because it will otherwise transform the data lake into a data swamp, with lots of valuable data yet missing the information that is required to legally/ethically use it [5]. Generally, the DC collecting such data (with personal information) is also responsible for documenting the associated ‘collection purposes’ (recorded separately from the resource) and then (preferably) instructs the DP about them, i.e., assigns a set of permissions for different resources per ‘collection purpose’ against a set of DP’s requirements or access purpose’. However, in large-scale DL-based infrastructures, there may be multiple DC managing various resources, which are often aggregated to generate new resources. Similarly, the DPs may also be managed by different DC, and they may request any transformed resource from the distributed-DL for their own ‘access purpose’. Hence, different DL-contributing DC rarely have sufficient knowledge about data requirements or ‘access purpose’ of all the existing DP (managed by all the existing DC) and any new DP that may emerge over time. This makes it difficult for any DC to strictly inform the existing DP about all the data changes and to ensure that a resource is only used according to its ‘collection purpose’ [2]. Therefore, if rather than recording ‘collection purpose’ separately from the resource, they are recorded and preserved as part of the resource (i.e. via its metadata), then it will become easy for the DC to inform DP about the changes in data and its ‘collection purposes’. Furthermore, to preserve the ‘collection purpose’ of a data resource through different transformations and aggregations, it is necessary to catalog all the changes that the data goes through since its origin (or inductance in the DL) until it is requested by any DP. Lastly, it is also crucial that the preserved ‘collection purpose’ reflects and accommodates the impact of those transformations in a way that does not render the usefulness of data yet maintains data privacy. The notion of DL to store diverse data without predefined schemas makes it vital for data sources to have a comprehensive set of metadata describing the content, characteristics, and origin of the different resources [8, 9]. One such type of metadata is known as data provenance, which records the origin and transformation history of the resource. It usually contains details like the resource’s origin, characteristics, transformation or augmentation changes, compliance rationales, etc. [23]. Hence, if the DL or involved
Secondary Use Prevention in Large-Scale Data Lakes
969
DC can ensure that every resource preserves a consistent and credible provenance from when the resource was originally stored in the DL and keep it accurate through different transformations, then the resource provenance can also be used to store ‘collection purpose’. This way, the ‘collection purpose’ becomes part of the resource’s metadata and will be available whenever the particular resource is requested for access. The resource’s ‘collection purpose’ may record the necessary directives such as why the data was collected and how it may be used, the legal base supporting it, conditions/requirements necessary for sharing, and disclosure of personal information with different transformations, etc. [6, 7]. Similarly, if any DP wants to access any resource associated with a specific ‘collection purpose’, then it must present a compatible ‘access purpose’, i.e., permissions or authorizations that allow the DP to access resources with a given ‘collection purpose’. In the case of transformed or aggregated resources, a partial ordering for the ‘collection purpose’ of all the involved resources can be created, which will allow these to be compared against the DP’s ‘access purpose’. This way, DC will not be required to actively manage every DP, but the provenance property ‘collection purpose’ can be evaluated against the DP’s ‘access purpose’ in making access control decisions that decide whether the resource should be accessible by the DP or not, and if so, then to what extent [1, 5]. It is also convenient for DL as DC does not know all DPs in advance, and cannot create pre-defined authorizations for all the future DPs. Hence, collection purpose’ recorded as a resource provenance property will eliminate the need for actively creating and managing DP’s authorizations while providing pointers to the resource usage limitation and constraining secondary use. The rest of this paper is organized in two sections: Sect. 2 describes how ‘collection purpose’ can be described as a provenance property, while Sect. 3 introduces an access control framework describing how ‘collection purpose’ as a provenance property can be used to restrict secondary use in DL-based infrastructure. The paper is concluded with the evaluation of the proposed framework with an example of a DL-based smart city.
2 Collection Purpose and Provenance Any piece of data with personally identifiable information (PII) that can be bound to a natural entity or DS is ‘personal information’. Any resource with personal information will also have a ‘collection purpose’ based on the terms of the agreement (ideally explicit) between the DC and DS, about ‘why the personal data is being collected and how will it be used’ [16]. The ‘collection purpose’ is usually stored separately from the data itself, e.g. in a legal contract or a privacy policy on a webpage, but it is a determining factor in how much of the personal information is made accessible to a DP. The DC is the regulating authority here and it is her responsibility (primarily) that data is not being misused and that the ‘collection purpose’ is observed. In large-scale or DL-based integrated infrastructures, the significance of data often increases with different transformations, forming new linkages and correlations, making it valuable for various DP [10]. Data undergoes different transformations, expands and takes different forms, can exist in multiple forms for different purposes, etc. Hence, once data changes multiple folds against different requests, the ‘collection purpose’ is often lost, overlooked, or loosely tracked by DC, which may be misunderstood by DP leading to secondary use
970
S. Sultan and C. D. Jensen
[5]. Therefore, we propose to add and preserve the ‘collection purpose’ as a part of data provenance (i.e. resource metadata), so it can be traced and regarded even when transformed, and DC can always track its usage.
Fig. 1. Provenance and key concepts
Data provenance mainly represents a lineage of different activities that data goes through and the information about involved entities since it is first inducted in a DL to ensure that data is reliable [24]. Provenance helps DP understand the type of data that the resource contains, along with the ownership and retention information about the resource. The provenance originally has three key concepts: Agent, Activity, and Entity, as shown in Fig. 1. The activity is an action or any type of processing activity that creates, modifies, or deletes an entity. An entity is a data resource or object in any format i.e. structured (tables), unstructured (pictures, videos), or semi-structured (files, social media feeds). Lastly, the agent here is a DC or DP, which initiates or triggers an activity to be performed on the entities [23]. In the case where different resources are aggregated together, their provenances can (ideally) be stitched together to create a deep provenance trace, so no record is missing about how the entity was used by the agents [27]. Put simply, the provenance keeps records of which agent performed a particular activity over a given entity. In this paper, we propose to add ‘collection purpose’ as a fourth key concept in provenance, in order to also record the reason why an agent is allowed to perform a certain activity over an entity that contains personal information. The ‘collection purpose’ records the terms of the agreement in which personal information enclosed in data (entity) is to be used (activity) by an authorized DP (agent). The DC with explicit (in some cases implicit1 ) agreement with the relevant DS initially constructs the ‘collection purpose’, which can later be modified as data transforms or aggregates with other sources. Per reviewing the basic requirements of different dataprotection legislation and obligations of DC to instruct DP about data usage without 1 An example of an implicit agreement is the notice that CCTV is in operations that shops must
display at the entrance, so that people who do not wish to be recorded on video can avoid entering the shop; anyone entering is assumed to have accepted being recorded.
Secondary Use Prevention in Large-Scale Data Lakes
971
being continuously involved (via provenance), we conclude that ‘collection purpose’ should have the following characteristics defining it, as shown in Fig. 1 and described below: 2.1 Data Description (Structure and Properties) Every resource (entity) has a data description or data definition, which defines the structure of the data-resource without revealing the actual values; similar to a class or type definition. It shows a set of data properties that represents the different types of information, whose type and structure can vary based on the data format, stored in the content of the data [3]. Some of them may store or represent personal information (entity attributes), i.e. information that can be linked to a unique natural individual. For instance, a structured resource holding financial-transaction has distinct data properties such as account name and number, depositor, receiver, location, date, or time of when a transaction was made, etc. describing personal information. Alternatively, an unstructured resource i.e. image or video may represent data identified as different objects (humans, vehicles, buildings) that can be linked directly to a DS. Here, we suggest that DC explicitly mentions the data properties that contain personal information in the said resource (entity), so it indicates that these data properties need to be handled cautiously, as agreed upon. 2.2 Purpose-Property Matching A resource can have more than one purpose and each purpose may refer to a different subset of personal data properties. For example, if the resource (entity) is a healthcare record, it can have multiple personal data properties (entity attributes) such as patientname, patient-social security number, age, gender, disease, diagnosis, doctor’s notes, prescription, etc. There can be different purposes that are bound to a different number or combination of these data properties. For example, if the purpose is ‘doctor appointment’, then it only requires (name, address), or if the purpose is ‘research’ then it may only require (age, gender, disease), etc. Hence, not every purpose requires all personal data properties to be revealed, so it is a reasonable approach to bind or map agreed-upon purposes against a specific or required set of personal data properties. Ideally, to ensure data minimization DC should define a purpose for all possible combinations of personal data properties (entity attributes) subsets, but often could be left undefined for future use, or may need to be redefined or remapped against different data transformations or aggregations with other data sources. 2.3 Compliance Policy Once a purpose is matched to a specific subset of properties (entity attributes), it can be further mapped to a specific (agent) or (activity) or a combination of (agent and activity), indicating its usage pattern to describe a compliance policy. For example, a purpose ‘research’ bound to a subset (age, gender, disease) can be further mapped to an activity (read-only) or an agent (Researcher) to indicate that only a certain activity can be performed for the given purpose on bound properties subset. There can be more than
972
S. Sultan and C. D. Jensen
one attribute or roles/permissions or some specific requirements that may be required to describe (agent) or (activity), and thus can be mentioned here. For instance, in a largescale video surveillance system, where video recording is an entity, one of the purposes can be to ‘detect traffic violation’, which is mapped to a subset of entity-attributes (vehicle-license-plate, vehicle’s owner name, social security number). A DP (agent) with certain requirements (entity attributes) of a ‘traffic officer’ can be mapped to the purpose ‘detect traffic violation’ that will allow her to access properties subset (entity attributes) (vehicle-license-plate, vehicle’s owner name, social security number). A purpose can also be bound to an activity i.e. ‘traffic officer’ (agent) can only view (activity) the personal data properties (entity attributes), and not edit, modify or aggregate (activity attributes) it with another resource (entity). Compliance policy is the key characteristic of ‘collection purpose’ as it can be helpful to assist DL-based systems with access control decisions to limit secondary use. It can be described in terms of conditions, permissions, usage policies, context parameters, or assertions about how a resource should or should not be used, etc. depending upon the nature of the resource, DC requirements, and DS preferences. 2.4 Aggregation Limitations If a resource (entity) can create new or enhance existing personal information (entity attributes), when aggregated with another resource (entity) with some particular attributes or data properties, then it should also be recorded here. Different resources may have their subset of properties which when combined generate new personal information requiring new ‘collection purposes’ to be defined so the aggregated result can be accessed. For example, a dataset with information about facial identities if aggregated with video surveillance data can be used to track the activities of the individual. These two datasets have different types of personal information, the former containing (name, social security number, age, facial mapping), while the latter containing (objects (individuals), associated actions), which when combined reveal current and contextual personal information about the identified individuals. Thus, any (activity) that requires more than one resource, or complements the resource in question, should be mentioned explicitly, so the exposure to new and more revealing (entity attributes) can be managed. Moreover, if the aggregations or their limitations are not mentioned here, then in case of any new or undefined aggregation, the DC and DS responsible for the existing personal information should be notified, so appropriate measures can be taken to avoid any privacy breach or legal ramification. The aggregation limitations can further be bound to different DP (agent) i.e. whether a particular agent is allowed the specific aggregation (activity) on a given resource (entities), or if there are rules to limit how particular agents can use the aggregated data properties (entities and activities). For example, a DP (agent) such as a ‘police officer’ may be allowed to aggregate resources (video surveillance data with facial recognition database), while a DP (agent) such as a ‘traffic monitoring officer’ may not. Alternatively, in the case of aggregation, if the target resource also has its own set of ‘collection purposes’, how will they be combined, will their provenances be stitched together? Or, a DP requiring access to the aggregated subset must have the authorization to access the source and target subset individually i.e. any agent requiring access (entity attributes) of
Secondary Use Prevention in Large-Scale Data Lakes
973
the aggregated resource may require to present DL with an authorization for either the combined purposes or distinct authorization for both purposes separately. A partial-order hierarchy can also be created here with different agents and their authorized activities over different entities to derive new purposes or relationships to evaluate if agents are allowed certain activities on the aggregated entities, or not. Hence, describing how different resources can be aggregated with the given resources is important to control the exposure of newly generated information or entities. 2.5 Legal Base The ‘legal basis’ is the foundation for the lawful (personal) data processing required by different data protection legislation. It means that whenever DC collects and processes personal data for whatsoever ‘purpose/s’, there should be specific legal grounds to support it. A legal-base is a set of different laws/rules that grants DS rights about how their personal information should be managed. The legal base also binds the DS, DC, and DP, with their respective rights and obligations towards each other. Some examples of valid legal bases supported by the GDPR are Consent (explicit permission to use data), Contract (formal contract to which the DS is a party). Legitimate Interest (often followed by consent or contract, in which the DC already has the data), Public Interest (processing data in an official capacity for the public interest), Legal Obligation (Data processing complies with law (local, federal, global)), and Vital Interest (Data processing in order to save someone’s life) [22]. Not all legal basis grant the same rights to DS, and differ in situation, as mentioned in Table 1. Table 1. Legal base and DS rights Legal base/right to
Consent
Contract
Legitimate interest
Public interest
Vital interest
Legal obligation
Access
✓
✓
✓
✓
✓
✓
Erasure
✓
✓
✓
×
✓
×
Withdraw
✓
×
×
✓
×
×
Object
×
×
✓
✓
✓
×
Informed
✓
✓
✓
✓
✓
×
Portability
✓
✓
×
×
×
×
Human intervention
✓
✓
✓
✓
✓
✓
Restrict handling
✓
✓
×
×
×
×
Rights DS is entitled to a set of rights over their personal information, to which the DC is legally obligated to comply. These are, the right to be informed, right of access, erasure,
974
S. Sultan and C. D. Jensen
withdrawal, object, rectification, restrict processing, portability, and human intervention, as shown in Fig. 2. Obligation The rights of DS are obligations for DC, as shown in Fig. 2. Primarily the obligation falls on the DC, however, DP is obligated to DC to process such requests from DS and abide by their rights. The DC and DP must always anticipate which rights might be applicable for DS considering their data-protection legislation and supported legal base. In case joint-controllers or multiple DC manages resources with personal information, and the DS wants to exercise one of her rights, the obligation falls onto the DC who collected data from the DS in the first place, except if agreed otherwise. The legal base must be a part of a ‘collection purpose’ property, as it plays a significant role in deciding what DS is entitled to, and even when DC is not directly regulating DP, this property can help DP (agent) to process due to requests from DS over their personal information. Moreover, it is common for one resource (entity) to have different legal bases for different purposes or activities.
Fig. 2. Rights, obligations and legal base
Secondary Use Prevention in Large-Scale Data Lakes
975
This section described different properties that should be described while designing a ‘collection purpose’ for resources with personal information. Moreover, ‘collection purpose’ should be added as a part of resource metadata i.e. as a provenance property, and should be followed and preserved through different transformations and aggregations. The ‘collection purpose’ as a provenance property can be used as a deciding factor in limiting the exposure of personal information per different purposes for different DPs that will eventually help to limit secondary use and preserve privacy through an access control mechanism. The next section will discuss how ‘collection purpose’ as a provenance property can be used to preserve privacy in the DL-based systems.
3 Privacy and Access Control in Data Lakes Privacy in large-scale distributed systems is concerned with how a DP is using a particular resource, and an access control mechanism (ACM) helps in regulating access to those resources (per defined policies) to different users (DP). The ACM takes into accounts the properties of users (agents) and resources (entities) and authorizes actions (activity) over the resources in any given system. For every resource or a type of resource, there is a defined resource policy i.e. set of conditions that the DP must meet to obtain access to a resource. Moreover, once the user is authorized to access a certain resource, the ‘collection purpose’ of that resource is followed to limit the exposure of the resource to any DP according to the shared agreement between the DC and DS. Hence, privacy is defined as ‘using an authorized resource solely for agreed-upon ‘collection purpose/s’. Here, we will differentiate between the ‘collection purpose’ of the resource and the ‘access purpose’ of DP. The former describes the terms of the agreement between DS and DC, while the latter describes the term of resource usage between DC and DP. Mostly, the ‘access purpose’ should be a subset of the ‘collection purpose’, as the latter may cover a broader scope of all the agreed activities or transformations allowed to be performed on the resource, while the former is more DP-specific. Hence, privacypreserving ACMs need to ensure that DP’s authorization or ‘access purpose’ complies with the resource’s ‘collection purpose’ while making an access control decision. The next subsection summarizes some of the contemporary purpose-based ACMs proposed for large-scale infrastructures as well as ACMs designed for data lakes. 3.1 Related Work The tremendous increase in data sharing and the rise of distributed systems in the last decade has put a lot of focus on privacy. Data is stored, processed, and shared through large-scale infrastructures like relational databases, unstructured data repositories, DL to accommodate the different forms of data at one place, etc. All these infrastructures have different requirements for access control mechanisms to regulate access to their stored data to different DPs for different purposes. Here, we will briefly summarize how different large-scale data infrastructures containing resources with personal information are managed through different ACMs. To preserve privacy in relational databases, a hierarchical purpose-tree-based ACM is proposed [11–15]. The system or DC maintains a purpose-tree, where each node represents a ‘collection purpose’ while each edge
976
S. Sultan and C. D. Jensen
represents a hierarchal relationship between the parent and the child node ‘collection purpose’. ‘Collection purpose’ is bound to a set of data elements or columns in a relational database. In parallel, there is also a role hierarchy, and roles are assigned different ‘access purpose’ (a subset of purpose-tree nodes) based on their authorized data requirements based on role-based access control (RBAC). More than one ‘access purpose’ can be assigned to one role, or more than one role can have the same purpose. A DP or authorized user is to present an ‘access purpose’ as a role attribute when she requests a certain resource. The system then evaluates the ‘access purpose’ against the central purposetree and makes an access decision [11]. Another solution based on conditional purpose and dynamic roles is proposed in [12]. The purposes are categorized into three groups: conditional purpose, allowed purpose, and prohibited purpose, which are assigned to different conditions of dynamic roles based on different user and contextual attributes. In this paper, resources are dynamically assigned different ‘collection purposes’ during the access decision, and if the ‘access purpose’ has the authorized values of contextual or dynamic attributes then the relevant resource against verified ‘collection purpose’ is allowed [13]. Most of the state-of-the-art purpose-based ACMs know about the structure and nature of the data and policies are designed for specific users as their data requirements are known in advance, so it is easy to associate a ‘collection purpose’ to the entire table, or a few column, or tuples. However, this is difficult in DL due to the lack of a priori knowledge about a large number of dynamic DP and their ‘access purposes’ as well as the ‘collection purposes’ of transformed resources. Moreover, purposes defined in such methods rely heavily on DC’s knowledge about resources and their usage, while if multiple DCs regulate access to distributed resources, then it requires ‘collection purposes’ to be described consistently with characteristics that are accepted by different DCs, so they can design ‘access purposes’ of their DPs accordingly. Diversity in data formats and management approaches by different DC along with the dynamic access to that data makes traditional access control methods difficult to implement in data lakes. To address the diversity data challenge and provide uniform access across a DL, a concept of Semantic Data Lake has been proposed [27]. It presents a middleware framework that requires data sources to prepare data on certain criteria before injecting data in data lakes, and then the middleware can derive mapping between resource’s data attribute and semantic DL’s ontology (i.e. structure of entity attributes) of different data concepts for better access control. This can provide a sense of homogeneity, and data in different formats can be queried based on the formal concepts designed by the semantic data lake middleware. An attribute-based ACM is also proposed for the commonly used Hadoop framework to implement distributed data lakes [26]. Users, resources, and the Hadoop environment have their defined attributes, which are considered when making an access control decision. Resources with similar attributes are grouped in a cluster, which is then assigned a set of permissions based on the operations that can be performed on these resource clusters. These permissions or policies include different (uniform) tags that are associated with these resource clusters as part of a distributed data lake. These tag-based policies are then assigned to different roles. Users are assigned different roles and based on similar roles they are assigned to different groups and these groups are then defined in a hierarchy to manage a large number of users or DPs in a DL. Users assigned to a
Secondary Use Prevention in Large-Scale Data Lakes
977
parent or higher group gets all the roles of its junior groups, which in turn also gets the assigned tag-based permissions to resource clusters. This group hierarchy is beneficial for efficient role management for large-scale DL but is not very useful in limiting secondary use due to inheriting policy-based access without considering individual access control requirements of individual users or DPs. In another approach, authors have proposed a purpose-based auditing solution, where ‘collection purpose’ is bound to different business processes and then by using formal methods of inter-process communications to verify those purposes against different policies to show that they comply with certain data protection legislation like GDPR [25]. Hence, ACMs designed for large-scale DL are generally based on deriving homogenous semantic concepts for different data formats and then use these mappings to assign authorizations to different DPs. Other solutions use hierarchical authorizations where permissions or ‘access purpose’ are assigned to a large set of users against a cluster of resources, and DP (with similar requirements) inherits permissions to these resources. These approaches do provide efficient management of DPs and resources but do leave a potential gap for secondary use due to not considering ‘collection purposes’ at a finegrained or individual resource level while making an access control decision [26]. One of the reasons is that it is hard for DC or the system to maintain ‘collection purpose’ for all resources and then communicate it to different DPs in case of data transformations. However, if DC can ensure that updated ‘collection purposes’ of the resources are available at the time of making an access control decision, and then it can be evaluated against the ‘access purpose’ of the DP. Therefore, we propose that DC add the ‘collection purpose’ as part of the resource provenance, so it will be available along with the resource, and ACMs can consider the ‘collection purpose’ from the resource while making an access control decision. 3.2 Provenance-Purpose-Based ACM Data Controller (DC) is the key entity with authority over ‘why’ and ‘how’ the resource is processed. It defines the ‘collection purpose’ for which the data is processed and may further bind different subsets of distinct responsibilities or the ‘access purpose’ to one or many DP. The DC records the ‘collection purpose’ as a provenance property and adds it to the resource metadata. The ‘collection purpose’ has different characteristics that will define how this data can be used as discussed in Sect. 2, such as data properties, purposeproperty mapping, aggregation limitation, compliance policy, etc. These resources are then introduced in distributed DLs, where DPs managed by different DC can request access to these resources. Here, we assume that even though a DC does not delegate access to all the existing or future DP, every DP will be authorized by one of the associated and verified DC of the distributed DL, this DC has the authority to define and authorize the basic DP requirements or their ‘access purpose’. Yet, DPs may have different dynamic and contextual factors affecting their data requirements, along with many cross-data transformations [19]. Therefore, it is often hard to impose direct and static permissions especially when DP is not aware of usage limitations or ‘collection purpose’ of the transformed data. In that case, if the resource has some information that can validate how can it be used for a certain ‘collection purpose’ and the DP has authorized requirements
978
S. Sultan and C. D. Jensen
to use the said type of resources for the same ‘access purpose’ (or a subset of it), then it can act as indirect permission from DC. To cater to this, the ACM can use the provenance of the requested resource to extract the ‘collection purpose’ of the resource and evaluate it against the ‘access purpose’ of the DP. Therefore, any authorized DP with a set of approved requirements (ideally, a subset of the collection purpose) can be allowed to access the allowed resource or its personal information per given access purpose, as shown in Fig. 3.
Fig. 3. Purpose and access control mechanism
A distributed DL will maintain an integrated or central purpose-tree hierarchy to list and order the ‘collection purpose’ for different resources injected in the DL. DC that is responsible for a resource will only be allowed to add and modify the ‘collection purpose’ nodes relevant to that resource in the tree. Other DCs can design ‘collection purpose’ for their authorized DPs, in case an aggregation is required. This is similar to role authorization concepts, but instead of assigning permissions for either specific resources or purposes, permissions are now bound to both purpose and resources. It means that generally, permissions against a resource are bound to a ‘collection purpose’, and DP is assigned a role authorized for that ‘collection purpose’. Following the presented approach, a DP who is authorized to access a resource also needs to have a compatible ‘access purpose’ against the ‘collection purposes’ of the resource. This provides a layered approach, as DP will be first evaluated if she is authorized to access a resource or not, then access to personal information will be controlled based on the compatibility between the collection purpose and access purpose, as shown in Fig. 4. The next section presents a case study describing how the presented approach can be. Evaluation: DL-Based Smart City Let us take an example of DL-based smart city infrastructure, as shown in Fig. 5. A smart city infrastructure represents a distributed processing and storage infrastructure, which accumulates data/resources from multiple (public-authority) DC and then offers it to thousands of DP in form of different services or applications [18]. The DL stores data in different formats from various resources such as Video Surveillance Systems
Secondary Use Prevention in Large-Scale Data Lakes
979
Access Purpose Collec on Purpose Personal Informa on
Fig. 4. Exposure of personal information per collection and access purpose
(unstructured), public transportation data (traffic signals, parking enforcement sensors), weather monitoring data, open navigation data (semi-structured), public administrative databases (structured), etc. [16]. Many of these resources contain personal information, which is collected under different legal bases and for various ‘collection purposes’. The typical smart city has thousands of DP, such as traffic law-enforcement systems, infrastructure and planning department, law enforcement officers, emergency services personnel, etc., who all need different types of information from the above-mentioned resources per their authorized requirements or ‘access purposes’ [17]. In order to access a certain resource or an aggregation of more than one resources, the DP sends requests to the ACM, which then evaluates their ‘access purposes’ against the ‘collection purpose’ of the resource/s (as one of the access control parameters) and if verified, the DP is granted access to that resource. It is important to note here that there are a lot of different types of user, resource, and contextual attributes involved in access control decisions in large-scale infrastructures [20]. However, here the paper aims to only show how ‘collection purpose’ as a provenance property can influence access control decisions. There are various DC that are responsible for contributing and managing different resources in a typical smart city, as shown in Fig. 5. For instance, the DC-A is responsible for resource video-surveillance data, and it has three ‘collection purposes’: public safety, traffic operation management, and infrastructure management [21]. There are other DC, which are collecting different resources for different ‘collection purposes’, but they can also have similar purposes such as resource vehicle- registration data also has traffic-operation management as one of its ‘collection purposes’. We will discuss traffic-operation management in detail and Table 2 shows the defined characteristics for the given ‘collection purpose’, as discussed in Sect. 2. The described ‘collection purpose’ will become part of the resource provenance, as well as a node in the central purpose-tree hierarchy, and will be updated if any transformation or aggregation modifies the content of the resources, as discussed in Sect. 2.4. There can be different DPs who are authorized to access the smart-city resources (original, transformed, or aggregated with other resources) for different ‘access purpose’ like Traffic-Law-Enforcement System DP, whose requirements are shown in Table 3. When DP (Traffic-Law-Enforcement System) requests a resource (video surveillance
980
S. Sultan and C. D. Jensen
Fig. 5. Purpose and access control mechanism in smart city
Table 2. Collection purpose - traffic operations management Purpose
Traffic operation management
Resource description
Mass-video surveillance data collected at public locations by video cameras
Personal data properties
As videos have unstructured data, so it does not have defined data characteristics, yet based on the processing capability of extracting personal information, they can be identified as: Object-type (human, vehicles), object-descriptive-features (gender, color, estimated-age and height), objectidentification-features (face, gait, license-plate), geo-location data (Spatio-temporal position of any object at a specific time) Locations (highways and others along the road capturing traffic only) Devices (video cameras’ types and unique IDs) A timestamp of the recording
Personal data property-purpose mapping Vehicle’s License plate, driver’s face - > traffic light violation, Speeding vehicle, Wrong parking, Wrong turn, Driving in a bus lane, Junction-box violation) Vehicle’s License plate, Human face- > {Accident/ Vehicle collision, Seat belt, child detected without a child seat, etc.} *An exhaustive list should be defined for all the properties against their usage requirements
(continued)
Secondary Use Prevention in Large-Scale Data Lakes
981
Table 2. (continued) Purpose
Traffic operation management
Compliance policy
Will be used for public interest reasons: 1. To record, process, and store any event or object that demonstrates a Traffic operations or violation (traffic light violation, Speeding vehicle, Wrong parking, Wrong turn, Driving in a bus lane, Junction-box violation, Accident/ Vehicle collision, Seat belt, child detected without a child seat, etc.) 2. To record, process, and store any event or object that demonstrates passenger handling, incompliance to traffic regulations, hinders/stops the routine or smooth traffic operations 3. To record, process, and store events and object involved in routine traffic operations 4. To record, process, and store events and object involved in parking management *An exhaustive list should be defined for all the applied ‘purpose’ **Cannot be used for tracking any event or object that is not mentioned in ‘purpose’ unless otherwise authorized by another legal base or higher authorized DC
Aggregation limitations
The said resource when aggregated with any other resource that can Link a license-plate to a unique DS Link the descriptive-features of a human to the identification-features of a unique DS, Link the descriptive-features of a human to the geo-location features of a unique DS, Requires specific authorization from public-authority DC supported by a legal base Consent, Legal Obligation, Vital Interest
Legal base
Public-Interest
recording), then it sends a request (including ‘access purpose’) to the ACM, which indicates that she is authorized to use the requested resource for the given ‘access purpose’. The ACM then compares her ‘access purpose’ with the extracted ‘collection purpose’ of video surveillance recordings, which in this case is traffic-operations management. One of the stipulations of ‘traffic-operations management’ purpose states that this resource can be used for the requested ‘access purpose’ i.e. can be used for a public-interest reason to record, process, and store any event or object that demonstrates traffic operations or violation. If the same DP had requested the resource with an ‘access purpose’ of ‘detect or identify a pedestrian’, then the ACM would have denied the request, as it is not stated in the purpose, restricting secondary usage. This example just shows the basic conceptual framework of how ‘collection purpose’ as a resource provenance property can be used to affect access control decisions.
982
S. Sultan and C. D. Jensen Table 3. DP’s access purpose
DP
Traffic-law-enforcement system
Authorized resource
Video surveillance recordings, vehicle registration database, real-time updates from traffic-related sensors
Aim
To enforce traffic-laws by capturing and processing any incident that results in a traffic violation and issue fines and penalty-points on a license based on identification from the intended resources, where applicable
Authorized resource requirements 1. Detect and identify traffic event that is considered a violation either via video recording or senor-reading (e.g. speeding) 2. Identify the object-type vehicle (through its license plate or driver’s identification information) from video data, and in case of a detected traffic violation issue a fine and penalty points to the object-type driver, if applicable. *mention an exhaustive list of all the traffic operations and violations that are supported by given legal-base to obtain from available resource-objects DP authority period
Jan-1–2020 to Jan-1–2021
A DP can be authorized to access different resources managed by different DCs for similar or different ‘access purposes’. These ‘access purposes’ may allow DP to aggregate different resources to perform their tasks. For instance, DP (Traffic-LawEnforcement System) with an ‘access purpose’ of ‘issue a fine on traffic-violation’ is allowed to aggregate video surveillance recording of a detected traffic violation (to the view license plate of the involved vehicle) with vehicle registration database, which as shown in Fig. 6 is allowed under the ‘collection purpose’ of ‘traffic-operations management’.
Fig. 6. Collection purposes and access purposes
Secondary Use Prevention in Large-Scale Data Lakes
983
To sum up, in large-scale DL-based infrastructures, there are a lot of resources with several ‘collection purposes’, and thousands of DPs with different requirements or ‘access purposes’ to access one or more of these resources. Therefore, instead of collaborating resources’ ‘collection purposes’ to a large number of known and unversed DP’s ‘access purposes’ directly, the ‘collection purpose’ can be described as a resource provenance property. The DC can design access control policies based on resource ‘collection purpose’, rather than binding them to a fixed set of DP’s requirements. The ‘collection purpose’ is described with a broader scope of what is ‘possible’, and legally ‘allowed’ for the resource to be used. While DP’s ‘access purposes’ are more targeted and specific per her authorized tasks restricting her to only access data if she meets the requirements of the ‘collection purpose’. It also leaves room for the future DP with ‘access purposes’ that come under already defined ‘collection purpose’, so their roles or authorizations do not need to be updated explicitly while restricting secondary use.
4 Conclusion Integrated data lakes (DL) store data in various formats from different controllers (DC) and owners without pre-defined schemas and data processor (DP) oriented policies. If the data contains personal information then data protection regulation requires the DC to ensure that data access respects the ‘collection purpose’, i.e. to limit how the data can or cannot be used. In distributed DL-based infrastructures, such as smart cities, it is difficult for multiple DC to manage duteous resource authorizations for thousands of DP with diverse requirements, so often DP is not explicitly aware of resource or data usage limitations. This may lead to privacy violations like secondary use, therefore DC should ensure that DP is unable to access or use a resource other than what is agreed upon by the data subject (DS) or permitted by law. Thus, the ‘collection purpose’ of the resource should be available whenever the resource is requested, so it can be evaluated when an access control decision is made. To ensure the transparency and availability of ‘collection purpose’, it can be appended as a resource provenance property so it becomes part of the metadata while being constructed judiciously containing all the requisite characteristics required to express data usage limitations to a DP. This way, when a DP requests a resource, the access control mechanism can compare its authorized ‘access purpose’ with the recorded ‘collection purpose’ of the resource (even when the resource is transformed or aggregated with other resources), and only grant access if the purposes comply. This paper introduces a framework for a provenance-purpose-based ACM suitable for largescale and shared DL infrastructures to demonstrate that the ‘collection purpose’ as a resource provenance property can be used to regulate usage-limitation for resources with personal information, thus, preventing secondary use.
References 1. Derakhshannia, M., Gervet, C., Hajj-Hassan, H., Laurent, A., Martin, A.: Life and death of data in data lakes: preserving data usability and responsible governance. In: El. Yacoubi, S., Bagnoli, F., Pacini, G. (eds.) INSCI 2019. LNCS, vol. 11938, pp. 302–309. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34770-3_24
984
S. Sultan and C. D. Jensen
2. Hai, R., Geisler, S., Quix, C.: Constance: an intelligent data lake system. In: Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16) (2016) 3. Christoph, Q., Hai, R., Vatov, I.: GEMMS: A Generic and Extensible Metadata Management System for Data Lakes. CAiSE Forum (2016) 4. Van den Hoven, J., Blaauw, M., Pieters, W., Warnier, M.: Privacy and information technology. In: Zalta, E.N. (ED.) The Stanford Encyclopaedia of Philosophy, Winter 2019 Edition (2019) 5. Wenning, R., Kirrane, S.: Compliance using metadata. In: Hoppe, T., Humm, B., Reibold, A. (eds.) Semantic Applications, pp. 31–45. Springer Berlin Heidelberg, Berlin, Heidelberg (2018). https://doi.org/10.1007/978-3-662-55433-3_3 6. Responsibility of the controller (2018). https://gdpr-info.eu/chapter-4/ 7. Simon, et al.: Summary of Key Findings from California Privacy Survey. Goodwin Simon Strategic Research, CCPA (2019) 8. Nogueira, I.D., Romdhane, M., Darmont, J.: Modeling data lake metadata with a data vault. In: Proceedings of the 22nd International Database Engineering & Applications Symposium (IDEAS 2018) (201) 9. Baum, D.: Cloud Data Lakes For Dummies®, Snowflake Special Edition. John Wiley & Sons, Hoboken (2020). http://staging.itupdate.com.au/assets/snowflake/snowflake-clouddata-lakes-for-dummies-special-edition.pdf 10. He Li, L., He, W.: The impact of GDPR on global technology development. J. Glob. Inf. Technol. Manag. 22(1), 1–6 (2019). https://doi.org/10.1080/1097198X.2019.1569186 11. Bertino, E., Zhou, L., Ooi, B.C., Meng, X.: Purpose based access control for privacy protection in database systems. In: Database Systems for Advanced Applications. DASFAA (2005) 12. Kabir, M.E., Wang, H., Bertino, E.: A conditional purpose-based access control model with dynamic roles. Expert Syst. Appl. 38(3) (2011) 13. Kabir, M.E., Wang, H.: Conditional purpose-based access control model for privacy protection. In: Proceedings of the Twentieth Australasian Conference on Australasian Database , vol. 92 (ADC ’09) (2009) 14. Colombo, P., Ferrari, E.: Enhancing MongoDB with purpose-based access control. IEEE Trans. Depend. Secure Comput. 14(6), 591–604 (2017). https://doi.org/10.1109/TDSC.2015. 2497680 15. Wang, H., Sun, L., Bertino, E.: Building access control policy model for privacy preserving and testing policy conflicting problems. J. Comput. Syst. Sci. 80(8), 1493–1503 (2014). https:// doi.org/10.1016/j.jcss.2014.04.017 16. World Wide Web Consortium (W3C). A platform for Privacy Preferences (P3P). http://www. w3.org/P3P/ 17. Miller, J.A.: Smart Cities Are Harnessing the Power of Data Lakes for Social Good. Accessed 16 Aug 2019. www.nutanix.com/theforecastbynutanix/technology/smart-cities-harnessingpower-of-data-lakes-for-social-good 18. Li, W., Batty, M., Goodchild, M.F.: ’Real-time GIS for smart cities’. Int. J. Geogr. Inf. Sci. 34(2), 311–324 (2020) 19. Strohbach, M., Ziekow, H., Gazis, V., Akiva, N.: Towards a big data analytics framework for IoT and smart city applications. In: Xhafa, F., Barolli, L., Barolli, A., Papajorgji, P. (eds.) Modeling and Processing for Next-Generation Big-Data Technologies. MOST, vol. 4, pp. 257–282. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-09177-8_11 20. Moustaka, V., Vakali, A., Anthopoulos, L.: A systematic review for smart city data analytics. ACM Comput. Surv. 51(5), 1–41 (2019). https://doi.org/10.1145/3239566 21. Guidelines 1/2020 on processing personal data in the context of connected vehicles and mobility-related applications (2020). https://edpb.europa.eu/sites/edpb/files/consultation/ edpb_guidelines_202001_connectedvehicles.pdf 22. Lawfulness of processing (2018). https://gdpr-info.eu/art-6-gdpr/
Secondary Use Prevention in Large-Scale Data Lakes
985
23. Suriarachchi, I., Beth, P.: A case for integrated provenance in data lakes. In: Crossing Analytics Systems (2016) 24. Suriarachchi, I., Beth, P.: Crossing analytics systems: a case for integrated provenance in data lakes. In: 2016 IEEE 12th International Conference on e-Science (e-Science), Baltimore, MD, pp. 349–354 (2016) 25. Basin, D., Debois, S., Hildebrandt, T.: On purpose and by necessity: compliance under the GDPR. In: Meiklejohn, S., Sako, K. (eds.) FC 2018. LNCS, vol. 10957, pp. 20–37. Springer, Heidelberg (2018). https://doi.org/10.1007/978-3-662-58387-6_2 26. Gupta, M., Patwa, F., Sandhu, R.: An Attribute-Based Access Control Model for Secure Big Data Processing in Hadoop Ecosystem (2018) 27. Oliveira, W., de Oliveira, D., Braganholo, V.: Experiencing PROV-Wf for provenance interoperability in SWfMSs. In: Ludäscher, B., Plale, B. (eds.) IPAW 2014. LNCS, vol. 8628, pp. 294–296. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-16462-5_38
Cybercrime in the Context of COVID-19 Mohamed Chawki(B) International Association of Cybercrime Prevention (AILCC), 18 rue Ramus, 75020 Paris, France [email protected]
Abstract. The ongoing COVID-19 pandemic acts as a major cause of the attention of the whole world. All the aspects of life in our current world have been impacted by the advancement of information and communication technologies. Like governments, businesses, and transactions, most of the sectors in the world are done through online communications, and networking. This increased reliance on online activities has triggered a surge in various kinds of illicit activities targeting online users. In regard to cyberspace, offenders use the Internet as a tool to reveal and access vulnerabilities in the security systems of Internet users, which enables them to inflict harm through means of theft and other illegal acts. Various new techniques and regulations have been implemented in order to counteract COVID-19 online scams, but the rise of cybercrime has given offenders room to perform their illegalities. The main objective of this article is to address the cause of the online COVID-19 scams and analyze them accordingly. Secondly, it seeks to provide the contemporary techniques used by the scammers in harvesting information from the victims. And finally, by taking the USA and European Union as a case study, the article will highlight the possible and efficient ways to tackle COVID-19 scams. The article concludes with a discussion on the plausible strategies and steps that are available to protect Internet users from such scams and mitigate the associated risks. Keywords: Coronavirus pandemic · COVID-19 · Cybercrime · Online scams · Cybersecurity
1 Introduction In December 2019, a new virus was discovered in China that caused atypical pneumonia [8]. A few days later, it manifested as a disease that would threaten the population of the entire world. Within three months the pathogen brought global society to a virtual standstill with an unprecedented reduction in activity (ibid). The virus, which was eventually designated as SARS-CoV2 (commonly, COVID-19), triggered a pandemic (ibid). It has already taken tens of thousands of lives [27]. More deaths have been reported since the pandemic began, and as of April 6, 2021, infection cases have reached 132,542,680, and death cases exceeded 2,876,200 (ibid). Some of the effects the pandemic has posed to the current world are more than causing job losses among members to causing a downwards spiral to the current economy; public © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 986–1002, 2021. https://doi.org/10.1007/978-3-030-80129-8_65
Cybercrime in the Context of COVID-19
987
health issues have been majorly caused by the pandemic [18]. Some of cybercriminals have used COVID-19 to their advantage to imposing their illegalities on the members of the community. Software as a Service (SaaS) providers and other tech vendors have been unable to respond effectively (ibid). Based on the information from 36% of CNBC managers, online cyber securities have rapidly increased since the beginning of the pandemic and forced the employees to work from home (ibid). Since the COVID-19 pandemic began, airline organizations and authorities like the WHO have frequently reported criminal impersonations and some cases like the fraud of personal protective types of equipment and offers of fake COVID-19 cure [14]. The main target for the criminals conducting the cybercrimes is people working at home and the general public at large. Working from home platforms has caused a high rise in cybersecurity like it has never been seen before. Those who practice online cyber securities have used the opportunity to grow their attacks using classical deceit that preys on the worries, stress, and heightened anxiety that face people (Ibid). Also, working from home has made most people prove the vulnerabilities of the software vendors, especially those that regard the security part (Ibid). The information from Recorded Future, a threat-intelligence firm found that most of those who performed illegalities online did it by creating fake websites, sending fake emails, and using logos and brands from organizations with public health that have a good reputation [9]. Notably, the most impersonated organizations are the health care sectors like the CDC and WHO departments that are based in the U.S., also government offices and reputable authorities that are responsible for the welfare of the population’s health have also been impersonated, an instance in countries like Italy where the pandemic hit the most, people stated that they received emails with coronavirus title sent by their employees at WHO. The emails were attached by a word document that had a Trojan horse virus whose main purpose was fetching people’s personal information (ibid). The new form of cybercrime has led to accentuation between real-world crime and electronic cyber-fraud. There are some noticeable challenges between real-world fraud and cyber-fraud: (a) there are complications in identifying cybercriminals [24]; and (b) even though researches have been conducted in the last twenty years, the same and instances of cybercrime has proven to be far greater than what is currently happening and seen in the current world (ibid). Computer science is always trying to identify the possible threats and forms by which cybercrimes are conducted to formulate the possible solutions to eliminate the threats. On the other hand, sociological and criminal studies have also been trying to determine the main cause of the victimization risks that happen in online society at a micro or individual level. Despite all the effort that re-put in place to determine the cause and possible solutions to the root of cybercrime, there is still a big gap to be filled in the field, especially the illegal behavior concerned with cybercrime. Considering the information above is generalized, and the statistical data are based on the individuals scammed in 2020, it is difficult to determine the possible ways of determining the solutions to eliminate the ongoing cybercrime issues. In order to address these limitations, this article seeks to address and analyze the following issues: Firstly the problem of COVID-19 online scams and the factors facilitating them. Secondly it provides an analysis of the recent techniques used by scammers to
988
M. Chawki
harvest the personal information of victims. Finally, an analysis of the existing legislative and regulatory framework and their efficiency in combating this form of organized crime will be provided, taking the European Union and the United States as a case study. The article concludes with a discussion on the plausible strategies and steps that are available to protect Internet users from such scams and mitigate the associated risks.
2 The Problem of Covid-19 - Related Online Scams COVID-19 - related online scams pose an enormous threat to an already stressed socioeconomic scenario across the globe. Such scams employ a diverse set of methods and operate at different scales spanning from the illegal use of a credit card to online identity theft [2]. Law enforcement agencies find it hard to arrest cybercriminals [12]; the use of (ICTs) by cybercriminals to mask their identities and their true locations has been identified as a major challenge faced by law enforcement agencies [10]. Despite the apparent relief from office life that working from home offers, the need to remain inside means Internet users will spend more time reviewing emails, reading news, and shopping [3]. Regardless of the means of accessing the Internet, whether through a personal device or a company laptop, more time spent online means more exposure to scams, malware, and phishing. Cybercriminals attempt to exploit our curiosity and basic necessities, such as the need to buy medical or personal hygiene equipment and other goods (ibid). They prey on the common desire to learn more about the symptoms and how to survive potential exposure to COVID-19 by sending phishing emails that promise exclusive information in the form of attachments and links to protective gear at deeply discounted prices (ibid). Many even ask for Bitcoin donations and claim they will support research in the search for a COVID-19 vaccine (ibid). As the WHO, the CDC, and other reputable agencies related to health command high authenticity and authority among the general public, impersonation of offices and officials from these agencies has been a focus of cybercriminals in the recent past [25]. To lure the victims (ibid), downloadable documents or website URLs disguised as a COVID19 safety document released from a reputable institute, such as the Johns Hopkins Center for Systems Science and Engineering (CSSE), have been reported in recent cyberattacks. The activities of cybercriminal forums also support the assertion that the reputable healthcare agencies are the prime target in COVID-19 related scams. In February 2020, when the outbreak was not as global as it is now, a thread was initiated in the Russian language cybercriminal forum XSS to advertise a piece of COVID-19 outbreak-themed phishing software. The advertiser claimed that the software could be used for malware distribution via an email attachment disguised as an outbreak-related map with real-time information from the WHO. The map used for this purpose was the impersonation of an authentic map from the CSSE (ibid). In another scam, an official WHO email was impersonated for phishing attacks. The email contained a link to a document providing information about the methods to prevent the spread of the outbreak. However, on clicking the link, the victims were redirected to a malicious domain for credential harvesting (ibid). The risk of unlawful impersonation is not restricted to the WHO and other similar organizations. There has been a significant increase in the number of COVID-19-related
Cybercrime in the Context of COVID-19
989
domain registrations since January 2020. According to Digital Shadows, more than 1,400 domains had been registered by March 9, 2020 (ibid). According to that blog, some of these domains can be used for cybercrimes such as hosting phishing pages and fraudulent impersonation of brands and items (ibid).
3 Factors Facilitating COVID-19 Online Scams Governments throughout the world are putting extraordinary efforts into containing and controlling the spread of COVID-19. Boosting public health systems, maintaining public order and safety, and minimizing the adverse impacts on the economy are the major aspects of global efforts in this direction [23]. However, some of these measures have presented an opportunity for the people associated with organized crime and other illicit activities (ibid). Criminals have quickly exploited the crisis by adapting their operational modes and expanding their ranges of illicit activity (ibid). Some of the critical factors that have contributed to the increase in cybercrime during the outbreak are as follows (ibid). • Rising in demand in buying of protective clothing, ventilators, and other medicalrelated products. • Movements and traveling have reduced among people in all sectors significantly. • Digital solution dependency has increased considering people have to stay at their homes for an extended period and telecommunicate. • People have become vulnerable psychologically to physical exploitations due to pandemic-related anxieties, stress, and fear. • The supply of illicit goods has dropped.
4 The Scope of the Problem From our understanding of the various factors that contribute to the increase in the online scam due to the pandemic, it’s important to determine recent statistics on COVID-19 online scams. Based on Forbes’s information, it was found out that victims in the United Kingdom had lost an amount of $1 million to the scammers [15]. Reports from the national reporting centers in the United Kingdom indicate an increase in cybercrime and fraud cases reported scams in March 2020 with an approach of £970,000 (ibid). RiskIQ, which is a firm for threat assessment, indicated that there was an increase in COVID-19 related domains, which spiked more than 34,000 in a single day (ibid). An instance of Amazon eliminated more than a million fake products related to COVID-19 in their marketplace (ibid). The domain registrations have relatively increased since the beginning of the COVID19 pandemic (ibid). The chart below (Fig. 1) indicates the increase in domain registration that is related to COVID-19.
990
M. Chawki
Fig. 1. Graphical representation of the increase in COVID-19 related domain registrations [12]
Since March 1, 2020 several reports have indicated an increase in phishing emails (ibid). There was an indication of the increase in 667% phishing attacks since March 1, 2020 as cybercriminals try to cash in on the COVID-19 outbreak (ibid). F-Security also stated that most of the phishing emails were sent in doc, pdf, or Zip formats, which accounted for 85% of the types of the used attachments in spam (ibid). Threat-intelligence firm called “The Mimecast” states they had received more than 24 million phishing emails intending to target victims looking for coronavirus information by use of email security systems [4]. That was slightly more than 16% of the more than 150 million emails the company had scanned during the same period of the previous year (ibid). It was found out that the COVID-19 phishing constitutes the same percentage as other spams which were found out during the five days as workweek which was to end on Friday, March 20, 2020, they found out a 15% of the whole population despite the email phishing appeared to be of higher percentage than all other phishing in the United States. And the rates keep on increasing as from the recent research on March 18, 2020, and there was an increase to 18% and March 20, 2020, to 22% (Fig. 2).
Fig. 2. Percentage of total spam rejection containing COVID [3]
Cybercrime in the Context of COVID-19
991
Where do all those email phishing clicks lead to? According to Mimecast, there was a 234% increase in the daily registrations of new COVID-19-related web domains and subdomains from March 9 to March 20, 2020; that is, more than 6,100 per day. Although a relatively small number of the 60,000+ sites registered are legitimate, the vast majority are not (ibid). In March 2020, 21 cases of COVID-19 outbreak-related frauds, which involved an amount of more than £800,000, were reported in the United Kingdom [13]. These cases were reported by the National Fraud Intelligence Bureau (NFIB) of the United Kingdom; there was a specific mention of the sale of face masks and COVID-19 infection maps by cybercriminals using the bitcoin payment network (ibid). According to Trend Micro [26], there have been more than 900,000 threats across files, URLs, and emails. In the chart below (Fig. 3), the author lists the top ten countries where victims have accessed websites with references to COVID-19 (ibid).
Fig. 3. Countries with the highest number of victims that visited fraud URLs due to COVID-19 [19]
5 Information Privacy and COVID-19 Online Scams With the advent of social networking sites and smartphones, the world is more connected now. Consequently, a huge amount of personal information is available online. This information can be harvested by cybercriminals and disclosed to a third parties without the consent of the victim [11]. Indeed, Internet users are not often aware of such activities. Cybercriminals mostly look for sensitive personal information, but other types of data are also harvested by them. The most commonly targeted data types are as follows (ibid).
992
M. Chawki
• Social security number (SSN): It’s a number that the government agencies mainly use to capture users’ useful information like earnings, tax payments, and benefits the person receives; when this number reaches the cybercriminals, it can cause damage to the user. • Date of birth (DOB): The date of birth seems like basic information, but when it can land in the hand of unauthorized users like cybercriminals, they can use it for identity theft and other offences. • Addresses and contact numbers: Contact information plays a vital role in transaction verifications. When it lands on the wrong hand, it can completely compromise the security information of the victim. • Information on current and previous employment: Criminals can use this information to identify the victims’ social and economic position. • Financial account information: This is the information that includes all user’s financial information details like the bank account, and these are the targets for any cybercriminal to any victim. • Mother’s surname: Most people use their mother’s name as a way of retrieving their passwords, or even some use the surname of their mother as the password; the availability of information such as this makes work easy for cybercriminals. • Other personal information: Email information such as passwords and others acts as a crucial part in protecting users’ information, and they should not be disclosed at any cost.
6 How COVID-19 Impacts on Cybercrime Spread Most unlawful groups with unlawful practices such as money laundering and online deceits have increased rapidly since the lockdown began because the criminals are adopting new ways to the COVID-19. Since selling and buying of fake medical products are common, the OECD has recorded an estimate of $4.4 billion of the sold of fake pharmaceuticals that have been pirated in 2016, with India and China as the source of pirated main victim’s products [5]. The pirated health products were headed to Africa, the U.S., and Europe as the main market, while countries like Hong Kong and Singapore act as the transit economies. The fraud of pharmaceutical products has increased mainly due to the COVID-19 pandemic. The rise in demand for personal protective products has made the criminals even gain more money in the fraud business (ibid). As we move to the current century, most businesses worked out and made sure they updated their security measured regarding the IT sector. Still, criminals have always been changing their tactics and moving to the social sector, unlike the technical ones [7]. Today the criminals steal data through the vulnerabilities of human beings. This number of fraud has been increasing since the outbreak of the virus, and most of the cybercriminals have been using humans as the weakness. Some of the organization’s security protocols have dropped down, giving them room to perform their practices. The following examples resemble the fraud used before (ibid): • Spyware: This malware is created to give out the victims’ information, such as financial and medical information.
Cybercrime in the Context of COVID-19
993
• Social Engineering: This is a tactic used when the attacker sends an email to the victim. It carries malicious information that affects the victims’ file in such a way they produce back vital personal information. Since the pandemic began, there has been an increase in this form of attacks. They include: social, physical, technical, and social-technical attacks. This kind of attack involves sending an email attachment about coronavirus to the victims to harvest their personal information. This attack comprises two types of methods that define it: the computer-based and social-based social engineering attack [1]. Engineering Attack that involves humans do it with the purpose of drawing crucial information to the victims at hand [16]. • Ransomware: This type of attack happens, so the attacker sends some form of code to the victim’s computer, so it withholds his personal and vital information. Most of the attackers always demand ransoms to remove the code so that the victim can access the data. • Just like COVID-19 has impacted the whole world. Also, cybercrime has done the same thing, and most of the cyberattacks are mainly focused on the countries that are hit the most by the pandemic because this facilitates efficiencies in the cybercrime (ibid)
7 Mechanisms of COVID-19 Online Scams The Internet makes it possible for COVID-19 scammers to gather personal information they can use against unwary consumers. The information thus obtained is available online for those who are willing to pay the price. Accordingly, this section offers an overview of the most recent online COVID-19 scams. 7.1 Phishing Scams Cybercriminals frequently use phishing to target victims [13]. Reports of email phishing campaigns that relied on “COVID-19” to lure victims surfaced in January 2020 almost immediately after the number of confirmed infections began to increase (ibid). As they are internationally acclaimed health organizations, the WHO and the CDC have been the main targets of impersonation by offenders. These organizations and their officials have high authority and authenticity perceptions among the general public. They have therefore been reported to have been impersonated by cybercriminals to lure victims with fraudulent website links or file downloads related to the outbreak (ibid). Due to the huge amount of information given each day, companies and individuals cannot detect phishing emails and fraud messages [19]. Some of the offenders promised not to attack health organizations, but some have been continuously practicing the illegal act, which has affected most organizations [22]. The most popular topic of discussion on online forums by cybercriminals has been COVID-19. In February 2020 a hacker put some COVID-19 related phishing software for sale on an online forum. The software, which was based on the latest information from the WHO (ibid), was claimed to be a malware email attachment disguised as a real-time map of the outbreak, the offender offered the malware at $200 for a “private
994
M. Chawki
build,” and the version with a Java code signing certificate incurred an additional cost of $500 (ibid). An additional method described by Sophos involved the impersonation of email correspondence from the WHO. The email contained a link directing the victims to a malicious domain through which personal information is harvested. This email contained several format errors that can be used by cybercriminals to bypass spam filters and more specifically target their victims (ibid). 7.2 Online Fraud and Counterfeit Goods The rise of the pandemic has created good room for cybercriminals to take advantage of the market demand for health products by selling fake and pirated products [23]. The so called OLAF, which is the European Anti-Fraud Office, has opened an investigation in collaboration with custom administrations to prevent the illegal importations of pirated health or COVID-19 products and items entering the European Union (ibid). Since the epidemic stroke the globe, OLAF has been working hard to prevent the illegal importation of fake products made in the name of combating the COVID-19 pandemic, such as PPEs, sanitizers, and other health products. In April 2020, they were able to captor a 39 years old victim who’s was laundering money from going about as a clinical firm that will disperse FFP2 careful covers and hand sanitizers on the web is a fake [23]. The French police informed Europol that they presume cheated the petitioner of e6.64 million. After the drug firm wired assets to a Singapore bank, the litigant got uncontactable, and the items were rarely transported. On March 25, 2020 experts in Singapore opened an examination concerning the case, effectively obstructing a segment of the installment and distinguishing and capturing the culprit. 7.3 COVID-19 Misinformation Cyberspace is filled with a large amount of incorrect information on the nature, extent, and severity of the outbreak. Social media and private messaging platforms are the main sources of the rapid and wide-scale transmission of misinformation on COVID19. Although the spread of misinformation on online platforms does not yield tangible financial benefits, its adverse impact on the social and personal space can be severe. Such types of misinformation can increase the levels of anxiety, stress, and xenophobia in the community, as well as at the personal level, which thereby puts additional stress on the available medical and community support systems [13]. The WHO has recognized the harmful impact of the COVID-19 related misinformation proliferation in the online space and has responded by putting dedicated efforts into countering its spread, and it has labeled such activity as an “infodemic” (ibid). Social media companies are also working on reducing the spread of misinformation on COVID-19. Labeling posts as illegal, imposing forwarding restrictions, and hiring organizations to check such online activities are some of the major steps in that direction (ibid). The giant search engine Google is also putting in efforts to alert users regarding the false information on COVID-19. The “SOS alert” feature of Google is aimed at making authentic information accessible during times of crisis. The “SOS alert” on COVID-19
Cybercrime in the Context of COVID-19
995
provides authentic, vetted, and useful information and links, which include the official information from the WHO (ibid). 7.4 Malicious Applications Malicious software is a malware software that is made and transmitted to alter, delete, or spy information from the user without his/her consent. The malware can be spread from one user’s computer system to another via internet connection or by using an external drive like a USB flash disk; some malware examples include Trojans, adware, bots, viruses, worms, ransomware, rootkit, and backdoor [6]. In this context, taking a Trojan horse is considered a software that comes in legal software [11]. A Trojan horse mainly comes as a popular game, downloadable file attachments, or even a pop-up window on the user’s screen that is downloadable; when the user downloads the Trojan, it automatically installs itself and runs every time the user opens his/her computer (ibid). To make someone install the Trajan horse is not that simple, and it requires the offender to have a basic knowledge of social engineering. The Trojan is very harmful, and for the offenders to work successfully, they work through the exploitation of someone’s guilt and weaknesses to lure the computer’s weakness (ibid). The offender has to convince his/her victim to make him/her install the malware by enticing and showing him/her things that are interesting. Once the victim has installed, the Trojan is made to overcome any firewall or the security defenses designed for the computer. It can proceed to its programmed purposes like uploading the user’s files automatically, viewing passwords, and viewing the victim’s screen in real-time. Trojan horses are opposed to worms because they don’t automatically duplicate themselves or be transmitted to another user’s computer without the user’s intervention (ibid). Some of the most common practices that the Trojan horse is made to do, involving disabling the hard disk and overwriting the master boot-up programs. Some examples of such malware include the ransomware virus, which has been attacking health care systems and has been increasing since the beginning of the pandemic [21]. 7.5 Vishing Between February and May 2020, there has been an increase in the number of attacks in financial institutions, which have been targeted mainly because of the pandemic. This raised questions and lead to the implementation of preventive measures that could reduce or succumb the ransomware. The attackers have been targeting large financial institutions, which have been interpreted as small businesses and companies adopting rote business models; also, their techniques have been improving since then [20]. In most cases, the attackers usually send emails to customers indicating that they had not received strain payments; they always demand them to visit the bank customer care and inform them of their issues, the emails that they send is always entailed with VoIP number which the customers call to entice the customer in most cases the fraudster always entice the customer how he/she will succumb great losses if he/she does not comply to the aid provided. Most of the attackers that call customers in this way always pretend to be working at financial institutions. This is the newest form of scamming people, which is called reverse phishing; it works in a way where it uses social media
996
M. Chawki
posts and emails to bait the customer or the victim into calling the fraud without his/her knowledge (Ibid). 7.6 Ransomware This form of attack is not that new consider it began 30 years ago. In this attack the attacker blocks data from the victims and demand ransoms in return to give them access to their data or resources; otherwise, without the ransom, the attacker won’t retrieve their information [17]. Currently, the ransomware invades the computer systems using three ways: through the crypto-ransomware, this is where the attacker makes the user’s data completely useless to him/her by encrypting his/her data algorithms. The second way is locker ransomware; in this form, the attacker completely blocks the users from accessing their systems. While combining these two methods, the crypto and locker ransomware makes the attacker even more harmful. The attackers also use HTML links to trap the victims to use malignant software which demands private keys to open; the victim has to pay for the key using bitcoins which is commonly a way that the attackers use to get away with crime because they cannot be traced by the authorities (ibid).
8 Global Harmonization Quest: Legislative Approaches and Regulatory Strategies It is essential to understand the scope of the problem, as well as its risks and consequences in order to devise a remedy. Following on from our detailed account of different aspects of COVID-19-related online scams in the previous sections, we will now focus on some possible legal solutions. In this section we shall focus on three forms of legal solutions: 8–1 National and Regional Strategies: The European Approach, 8–2 Prosecuting Identity Theft under Federal Criminal Laws and 8–3: International Strategies: The Council of Europe Convention on Cybercrime. 8.1 National and Regional Strategies: The European Approach Since 25 May 2018, European privacy legislation has changed fundamentally: The European Data Protection Regulation (EU) 2016/679 (GDPR) came into force and harmonized data protection law across the EU [29]. Through the use of the GDPR in Sect. 6, all the European members are needed to choose various data protective public authorities to be responsible for monitoring the application of the GDPR. The protective data authorities enforce the law through their power to perform investigations (ibid). According to article 33, the data processor is required to inform the data controller of any form of breach immediately. As per Article 34, any breach in information should be reported within 72 h since the incident happened, and in case of high information breach, the users should be informed individually. An information subject warning isn’t needed if the information regulator has carried out suitable operational and mechanical protections to guarantee that the information
Cybercrime in the Context of COVID-19
997
isn’t open to unapproved people by utilizing information encryption procedures (Article 34). Article 97(1) of the DS-GVO provides for an evaluation of the GDPR after 25 May 2020 at four-year intervals (free-group.eu, 2020). In view of the rapid technical development in the field of data processing, it appears necessary to shorten this evaluation interval to two years (ibid). Even if the legal framework is designed to be technologically neutral, it must react to technical developments as quickly as possible otherwise it will fast become obsolete (ibid). 8.2 Prosecution of Identity Theft Under the Federal Criminal Laws: American Approach Some authorities and bodies identify frauds and enforce the punishments to the people who deprive them; the authorities also make sure the victims who were succumbed to the fraud are properly compensated credit history looked upon. On October 30, 1998, Identity Theft Statute 18 U.S.C.1028 (a)(7) was adopted in the USA. The main purpose of this Act was to make sure that identity theft was looked upon and dealt with appropriately with matters relating to the illegal position of unauthorized information. According to Section 1028(a)(7), a person is incriminated if deliberately uses or transfers, without the given mandate to do so, a medium of identification of another individual to commit, aid or abet any illegal undertaking that involves violating Federal law or any other applicable state or local law. The Act updated the Section 1028(b) penalty condition to include offenses covered by the new Section 1028(a)(7) and to maintain more stringent fines for character robberies involving the substantial property. Section 1028(b)(1)(D) is clear that a person that exuded the transfer of $1000 or more within one year was subjected to a minimum of 15 years imprisonment. Section 1028(b)(2), then accommodates a greatest sentence of three years in prison. In addition, the Act amended § 1028(b)(3) and stated that the people who were previously subjected to Identity theft, violent crime, and participation in the business of drug trafficking were subjected or punished to no less than an imprisonment of 20 years. “Methods for identifiable evidence,” as described in Section 1028(a)(7), include any number or name that may be used, in combination with other data or alone, to build up a specific entity. It contains explicit versions, such as date of birth, federal retirement number, officially approved driver’s permit, and other numbers; unique biometric details, such as voiceprint, fingerprints, iris or retina image, or some other individual portrayal; and media communication recognizing data or access gadget. The Act directed the US Sentencing Commission to revise and audit the Sentencing Guidelines to determine the appropriate penalties for each felony in area 1028. In addition to the order, the Commission added U.S.S.G.1(b)(5), which provides the following: (5) Whether the wrongdoing in question is (a) the usage or possession of some gadget-making hardware: (i) the ill-conceived use or movement of some illegal methods for determining evidence to obtain or produce any other methods for identification; (ii) possessing [five] or more methods for identification were illicitly delivered from separate methods for identification or obtained by the use of other methods of identification, increase by 2 levels. If the resulting offense level is less than level 12, increase to level 12.
998
M. Chawki
These current rules consider that theft is legitimate negligence, regardless of what explicit monetary boundaries have been met; in certain misrepresentation offenses, the loss would be more than $70,000.00 for the following wrongdoing level to be level 12. The Sentencing Commission decided on a level 12 base crime for wholesale theft because the monetary impact of data fraud is impossible to assess and regardless of the misfortune, the wrongdoer can bear the blame. Data fraud offenses will almost often get a two-level increase when they often include more than one plan. Where numerous suspects are included, wholesale fraud offenses can bring about two to four-level higher authoritative position changes. As recently said, data fraud is frequently used to help in the Commission of different offenses, despite how it is regularly the criminal’s vital core interest. Schemes to commit identity theft may involve a number of other statutes including identification fraud (18 U.S.C. §1028(a)(1) - (6)), credit card fraud (18 U.S.C. §1029), computer fraud (18 U.S.C.§ 1030), mail fraud (18 U.S.C. § 1341), wire fraud (18 U.S.C. §1343), financial institution fraud (18 U.S.C. §1344), mail theft (18 U.S.C. § 1708), and immigration document fraud (18 U.S.C. § 1546). For example, where a stolen identity is used to fraudulently obtain credit online, computer fraud may be exacerbated by stealing identity information. When a suspect receives illegal access to another website or device to extract identification information, computer theft may be the primary method. 8.3 Strategies Used Internationally: The Council of Europe Convention on Cybercrime Some reforms and solutions have been implemented and installed following the growing global cybercrime since the 1980s, the growth of global online economic information, and the first growth of ICTs being subjected to greater risk of computer-related economic crimes. Most of the countries that are situated around the EU tried their best to eliminate the ongoing computer- related crimes and fraud, by enforcing new laws from the previous ones and also revising their current laws to go in conjunction with the current world [11]. In this context, it’s worth noting that the Council of Europe’s Budapest Convention on Cybercrime is yet another first step towards global cooperation on criminal and reformatory issues. Alerts are issued to prevent activities directed towards the integrity, classification, accessibility of computer systems, records, as well as the misuse of ICTs, by accommodating the criminalization of such demonstrations as outlined in the Convention. Countries within the E.U., as well as other signatories, have sought to create regulatory mechanisms to combat cybercrimes, by assisting in the effective investigation, identification, and prosecution of such criminal demonstrations at both the global and domestic levels (ibid). The Convention is broken up into four main sections, with each section consists of several articles. In the first section of the Convention, there is an outline of the specific criminal law firms that all formalizing countries must put in place to fight cybercrimes. The second part tries to find out the convention request prerequisites and details that every country in the world should follow. The third section lays out complete guidelines for multinational corporations that often consist of collaborative examinations of the
Cybercrime in the Context of COVID-19
999
cybercrime criminals depicted in section one. Finally, the fourth section covers provisions for the geographical use of convention notices, convention labeling, withdrawals, updates, and the provision of crucial federalism. To combat COVID-19 web fakes and defend people’s information and data on the internet, an appropriate administration guideline for combating online fraud is essential to administering new types of computer-related and monetary fakes like hacking and computer undercovers. In reaction to the current incidents of hacking that are happening, most of the European countries wanted to develop and implement new rules for protecting Internet users ‘personal information that can be retrieved by unauthorized users, forming a “formal circle of mystery” that include making it illegal to view information without permission or the user’s consent. As a result, in several European countries, such as Finland, France, and Denmark, rules governing access to communication systems, data preparations, and wiretapping have been adopted. We will reveal some insights into a couple of specialized methods and assurances that will restrict the dangers imposed by the coronavirus online scams and serve as inhibitory measures to the attacks that are happening online, after attending to a couple of administrative arrangements and activities that are intended in combating coronavirus sites, misrepresentations in Europe and protection of the individual details.
9 COVID-19 Online Scams Prevention With the extended refinement in the online impersonators, anticipating and avoiding scams is now more important than ever. The four approaches mentioned below will prevent Internet users from being defrauded [28]: 9.1 Financial and Personal Information Shouldn’t Be Shared Online Never give away your personal information, especially the one regarding finances, to anyone you don’t trust. Continuously continue to ban and charge card data to yourself to forestall being gone after by online scammers. Never part with confidential or monetary data to a more peculiar or a site you don’t perceive. To quit being gone after by digital fraudsters, consistently keep your banking and credit card subtleties secret. 9.2 Constantly Use Two-Factor Authentication Two-factor validation, also known as two-factor authentication or 2FA, adds stability to the user’s search period. Rather than merely requesting the hidden term and username, this is done by referencing facial recognition or a special mark and requesting a code sent to the email or flexible number; this essentially prevents fraudsters from accessing financial records.
1000
M. Chawki
9.3 Avoid Downloading Files and Clicking on Links from Unknown Sites Drift the cursor over the association with affirm the absolute URL to ensure that even the last letter of the link corresponds to a recognized authority location, such as WHO or the CDC. Avoid opening unknown email attachments because the cyber offenders could have added viruses such as malware, ransomware, or spyware that would deceive you. The majority of bank and government employees will not give you a link to verify your records. 9.4 Never Wire Money Never consent to wire cash employing services like Western Union or MoneyGram. With cash wiring services, you will not get back your cash once the exchange has been made. You are encouraged to utilize protected and effective installment applications like PayPal, or Google Wallet.
10 Conclusion This article has explored the various risks associated with COVID-19 online scams. However, we should remember that there is a strong relationship between the evolution and development of Internet-related applications, the evolution of ICTs, and the intensification of online scams. Technology is a defining characteristic of humankind’s quest for improvement and mastery; however, from time to time, technological advances have also been exploited by criminals to devise newer forms and methods of crime. The use of ICTs by criminals has significantly changed the traditional and classical dynamics of criminal behavior. Indeed, ICTs have opened new avenues of opportunity for cybercriminals. A recent case in point is the dangerous rise of COVID-19 online scams. In this article the author has elucidated the fundamental issues related to the current socioeconomic impact of the COVID-19 outbreak and the associated factors that have facilitated online scams based on the COVID-19 outbreak. The author has also discussed some of the potential solutions to mitigate the risk of such scams. The author has structured this article in two parts. The first addressed the COVID-19 online scams by analyzing the problem of these scams, the factors facilitating them, the scope of the problem, the most frequent types of data used in personal information harvesting, and the diverse mechanisms used by cybercriminals in committing their scams (e.g., phishing and other malicious applications). An analysis of the existing legislative and regulatory framework and their efficiency in combating this form of organised crime has been provided, taking the European Union and the USA as a case study. The second part of the article considered the potential solutions to the COVID-19 online scams. The general intention of this article has been to offer a modest contribution to the field of cybersecurity and cybercrime and thereby assist in the battle against COVID-19 online scams.
Cybercrime in the Context of COVID-19
1001
Acknowledgments. Not applicable.
Funding. Not applicable. Availability of Data and Materials. The data and materials used and/or analysed during this current review are available from the author on reasonable request.
References 1. Alzahrani, A.: Coronavirus social engineering attacks: issues and recommendations. Int. J. Adv. Comput. Sci. Appl. IJACSA 11(5) (2020) 2. Anscombe, T.: Beware Scams Exploiting Coronavirus Fears (2020). www.welivesecurity. com, Accessed 5 Dec 2020 3. Arsene, L.: Coronavirus Phishing Scams Exploit Misinformation (2020). www.hotforsecurity. bitdefender.com, Accessed 2 Dec 2020 4. Azzara, M.: Coronavirus Phishing Attacks Speed up Globally (2020). www.mimecast.com, Accessed 4 Apr 2020 5. Basquill, J.: Fraud, money laundering and cybercrime: how Covid-19 has changed the threat to banks (2020). www.gtreview.com, Accessed 9 Sept (2020) 6. Bhuyan, S.S., et al.: Transforming healthcare cybersecurity from reactive to proactive: current status and future recommendations. J. Med. Syst. 44(5), 1–9 (2020). https://doi.org/10.1007/ s10916-019-1507-y 7. Blanco, A.: The impact of COVID-19 on the spread of cybercrime. www.bbva.com, Accessed 16 Sept 2020 8. Bogdandy, A., Villarreal, P.: International Law on Pandemic Response: The First Stocktaking in Light of the coronavirus Crisis, p. 1. Max Planck Institute for Comparative Public Law and International Law (MPILM.P.I.L.search paper no. 2020–07 (2006) 9. Circulus, L.: Hackers, use fake WHO emails to exploit coronavirus fears (2020). www.politi co.eu, Accessed 1 Apr 2020 10. Chawki, M.: Nigeria tackles advance fee fraud. JILT (1) (2009) 11. Chawki, M., Wahab, M.: Identity theft in cyberspace: issues and solutions in lex electronica. 11(1) (2006) 12. Grimes, R.: Why It’s So Hard to Prosecute Cybercriminals (2016). www.csoonline.com, Accessed 2 Apr (2020) 13. Guirakhoo, A.: How Cybercriminals Are Taking Advantage of COVID-19: Scams, Fraud, and Misinformation. www.digitalshadows.com, Accessed 2 Apr 2020 14. Lallie, H., et al.: Cyber Security in the Age of COVID-19: A Timeline and Analysis of CyberCrime and Cyber-Attacks during the Pandemic (2020). www.researchgate.net, Accessed 6 Dec (2020) 15. Mellone, P.: Coronavirus and Phishing: All You Need to Know (2020). www.increasily.com, Accessed 2 Apr 2020 16. Naidoo, R.: A multi-level influence model of COVID-19 themed cybercrime. Eur. J. Inf. Syst. 29(3), 306–321 (2020). https://doi.org/10.1080/0960085X.2020.1771222 17. Rehman, H., Yafi, E., Nazir, M., Mustafa, K.: Security assurance against cybercrime ransomware. In: Vasant, P., Zelinka, I., Weber, G.-W. (eds.) ICO 2018. AISC, vol. 866, pp. 21–34. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-00979-3_3 18. Rosenbaum, E.: Phishing scams: Spam spikes as hackers use coronavirus to prey on remote workers, stressed I.T. sI.T.tems (2020). www.cnbc.com, Accessed 1 Apr 2020
1002
M. Chawki
19. Singh, C.: Meenu phishing website detection based on machine learning: a survey. In: 2020 6th International Conference on Advanced Computing and Communication Systems, ICACI.C.A.C.C.S.0, pp. 398–404 (2020) 20. Stevens, G.: Popular Techniques used by cybercriminals amid covid-19 (2020). www.carbon black.com, Accessed 12 Dec 2020 21. Weil, T., Murugesan, S.: IT risk and resilience—cybersecurity response to COVID-19. IT Prof. 22(3), 4–10 (2020). https://doi.org/10.1109/MITP.2020.2988330 22. Wirth, A.: Cyberinsights: COVID-19 and what it means for cybersecurity. Biomed. Instrum. Technol. 54(3), 216–219 (2020) 23. www.europol.europa.eu, Accessed 9 Apr 2020 24. www.ifcc.org, Accessed 10 Apr 2020 25. www.securitymagazine.com, Accessed 2 Apr 2020 26. www.trendmicro.com, Accessed 13 Apr 2020 27. www.worldmeters.info, Accessed 10 Apr 2020 28. www.freshbooks.com, Accessed 9 May 2020 29. www.privacy-europe.com, Accessed 2 June 2020
Beware of Unknown Areas to Notify Adversaries: Detecting Dynamic Binary Instrumentation Runtimes with Low-Level Memory Scanning Federico Palmaro1(B) and Luisa Franchina2 1
Prisma, Rome, Italy [email protected] 2 Hermes Bay, Rome, Italy
Abstract. Dynamic Binary Instrumentation (DBI) systems are being used more and more widely in different research fields, thanks to the advantages they offer compared to their ease of use. The possibility of monitoring and modifying the behavior of generic software during its execution, by writing a few lines of analysis code, is a great advantage especially if combined with the ability to operate without the target subject being aware of it, that is, in a transparent manner. This last peculiarity is the target of this work: we investigate how DBI systems may try to hide discernible artifacts from the software being analyzed by hiding in the shadows. We present a memory scanning technique that seeks for the trail the DBI engine leaves in the program address space. Keywords: Dynamic Binary Instrumentation · Dynamic analysis · DynamoRIO · Malware · Reverse engineering · Software security · Memory spatial safety
1
Introduction
Altering a program shipped in binary form to observe it or make it behave differently from what is expected during execution is a very important topic for the cybersecurity community. Generally there are two ways to do this type of operation: either by the editing to the code carried out in a static way—therefore before the software is executed—in order to make these changes permanent at each execution; or by the editing to the code dynamically, i.e. during its execution, thus monitoring the actions that the target software would like to perform and then deciding how to modify them to control the behavior of the program. Regarding these types of modifications, there are many analyses and studies in the literature that illustrate the advantages and disadvantages of the different techniques studied, showing how the first, carried out through binary rewriting, is useful if you want an executable to have a certain behavior at each execution, without the use of additional software, as if it had already been compiled to act c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 1003–1019, 2021. https://doi.org/10.1007/978-3-030-80129-8_66
1004
F. Palmaro and L. Franchina
in that particular way. On the contrary, the second is based precisely on the use of additional tools that allow the binary to be exploited, without touching the actual binary source code, but capturing the actions that it performs to provide manipulated information as an output. This second setting, therefore the dynamic modification of the software, has the great advantage of being easier in the eyes of those who have to make the modification, avoiding heavy waste in terms of time for the complex analysis of the source or other shortcomings when dealing with complex code. For this reason, in this article we will focus on this technique and in particular on the dynamic analysis paradigm called Dynamic Binary Instrumentation (DBI) [5,12,14,20,36,38]. This special technique, then embodied in DBI runtime engines, allows one to observe the behavior of an executable while it is running in the system, and gives the possibility to intercept the system calls it makes and the control flow that it carries up to the granularity of the single Assembly instruction. In this way it is possible to operate with full control over these operations that the target software carries out and in turn to modify them, i.e. add new ones or delete them. The peculiarity that makes these characteristics truly attractive, however, is the possibility of carrying them out in a completely transparent manner with respect to the program in execution, by means of special stratagems used by those who develop the DBI engines, which try to hide their existence by modifying some characteristics (e.g. outputs) of the functions called and the visible memory layout of the target process. The set of these techniques therefore leads to the creation of different products that are well structured and difficult to find from a software with the use of basic controls and with the use of known system calls, but in this article we will see how, by using traces left by a popular DBI engine it can be traced back to an accurate and specific memory scanning technique, ultimately in order to identify their presence during analysis. In order to work, these tools in fact need to be forcibly integrated into the memory of the target process, creating spaces reserved for them in order to place important data and code within them to be used during monitoring and manipulation of the execution, assisted by runtime components shipped as Dynamic Linked Libraries (DLL’s). Obviously these “intermediary” areas between the target process and the DBI software should be kept hidden from the process, which by trying to do a memory analysis would only see such sections as not accessible or even as free/junk memory. However, these camouflage techniques, as we will see in this paper, are not robust against targeted attacks that are able to move within the memory easily and read its characteristics by analyzing individual bytes, without relying on system or library calls whose output can be faked by DBI engines. To show how it is possible to find these artifacts within the memory, we will carry out specific tests for the detection of noise introduced by the use of this technique, proposing a possible implementation of the analysis, and also studying possible further facets of development through the analysis of the literature that comes to meet us on the subject of this paper.
Detecting DBI Runtimes with Low-Level Memory Scanning
1005
Contributions. The need to disguise one’s existence is a fundamental feature to obtain from these tools if one wants to think about using them in the field of computer security, and more in detail in the analysis of malware since nowadays malware strains are increasingly careful about modern analysis techniques that insiders use to try to find and analyze them. For this reason this paper wants to try to help the analyst to understand what are the limitations in the use of Dynamic Binary Instrumentation technology, and in particular in the DynamoRIO software, very popular among those DBI systems proposed in the literature today and available to the general audience. A thorough understanding of how the software you want to use to analyze works is essential if we want to keep up with the latest malware and understand how to deal with them. The memory analysis proposed here will therefore give light to important elements that would otherwise remain in the shadows and could only be used for malicious purposes, deceiving the analyst who would find themselves disoriented in that case by seeing their DBI-based tools fail.
2 2.1
Preliminaries Dynamic Binary Instrumentation
Supporting introspective attempts and manipulating the state of a program while it is running is the key feature of a DBI system, which provides the target software with the illusion that it is running on a native system, that is without special features or monitoring by external programs. To guarantee such transparency it is necessary for the DBI engine to show the target process the values of a real execution, such as the values of CPU registers and the correct memory addresses and their corresponding hosted values. In order to be able to put all these specifications into practice, the DBI engine needs to integrate itself in a concrete way with the target software, by inserting specific runtime components within the memory address space of the process under analysis. Based on this type of integration we can define a DBI system implementation as a sort of Virtual Machine that instead of emulating an entire system, emulates a single process, interpreting the entire Instruction Set Architecture (ISA) [18,19] specific to the platform for which it was compiled that program, thus supporting all the features that the paradigm described here should offer. The set of these specifications implemented within the DBI system therefore allows the user analysis code—written to monitor and manipulate program data—to interact with the data structures and variables of the target program, having complete maneuverability in its address space. In addition to the ability to inspect the memory of a target process in execution, the DBI engine also offers the ability to monitor the actions performed during execution, from the single assembly instruction to be executed up to the use of specific library functions or system calls for a certain use [36]. After this brief introduction to the specifications that a DBI engine should offer, let us go into more detail of what this paradigm is able to formalize. The ability to manipulate the architectural state of a process while such process is
1006
F. Palmaro and L. Franchina
running is the result of a complex chain of processes that allow users to take the binary code that is about to be executed and combine it with pieces of code written by them in the form of a DBI tool, a step necessary in order to produce an output that integrates the two parts. Inspecting the CPU registers, the memory status, and the control transfers are therefore the result of this complex chain of events that we try to simplify here in order to provide a possibly clear and concise explanation. Before the target program executes any action, its instructions are intercepted by the DBI engine which analyzes this intercepted sequence and compares it with the monitoring tool written for the DBI. If this tool specifies that the captured action is to be manipulated, it combines the original instruction sequence with the userdesigned code for analysis and feeds it to a just-in-time compiler (JIT for short). The arrival on the scene of the JIT element is essential as it is the one that deals with both merging the above-mentioned parts of code, and recompiling the output code in a way that it can directly execute on the underlying machine. This process is made possible thanks to an instruction fetcher component which reads the instructions that are about to be loaded on the CPU, notifying them to the DBI engine. Returning then to the JIT component, the unit it works on is the trace, that is a sequence of instructions which ends either with an unconditional transfer or with a predefined number of conditional branches. Once the JIT does its job and forms a compiled trace, this is inserted into another fundamental component, namely, the Code Cache, which always resides in the same address space as the original code. Once the trace has been transferred and stored in the cache, a controller component then takes care of transfers between cached traces ready for execution and new traces that have just been compiled. In the end the original program executes as a series of instructions belonging to compiled traces from the Code Cache, yet the generated traces massage data values so that the program perceives as if it were executing from its original code section(s) [20]. This whole process is synchronized with analysis code written by the user to correctly manipulate the target software. This is where the primitives offered by the DBI engine come into play, which allow users to write code in a much less rigorous and constrained format than patching the target source code directly. The engine then takes care of the transfer of the original code to the instrumented code (which is compiled as well with the JIT) and back. In detail, the capabilities offered by DBI engines comprise the insertion of hooks for the following aspects: – – – – –
instruction from the target program (by type or address) call to library function system call dynamic loading and unloading of code (e.g. libraries) exceptional control flow
We would like to remark that these monitoring features are not the sole capabilities offered by DBI. The engine can support the editing of values for some elements at runtime, such as specific CPU registers, the parameters of an invoked monitored function, and even the desired system call number. A feature not to be underestimated is also the possibility of connecting external monitoring
Detecting DBI Runtimes with Low-Level Memory Scanning
1007
and analysis software to the engine such as a debugger [35], which greatly expand what can be done with this type of software by hiding the noise that standard debuggers would introduce into the system. 2.2
Artifacts from the Presence of the DBI Engine
All these features described in the previous section obviously build on complex concrete implementations that strive to ensure correct execution with any type of target software, ranging from COTS programs to custom niche software. The main problem for security settings is that the introduction of subtle artifacts or execution discrepancies for rare code patterns is unavoidable. Various artifacts are necessary for the correct execution of the software; however, if not well hidden, a malicious program can detect the presence of the DBI engine in execution. These implementation flaws can be found on several fronts as detailed in a recent work [20]. Here we will describe the main ones that are more relevant for our purposes and can help us better understand the possible attack surfaces. Leaked Code Pointers. To allow the manipulated software to run, the DBI engine must be able to run the program from some memory addresses other than those that would be in a native execution, placing and executing the code within its own code cache. However, this requires to provide fictitious addresses to instructions and patterns that reveal the instruction pointer to the program, so that it believes it is executing from the real intended addresses instead that from the code cache. In this case, the problem common to many DBI implementations is maintaining consistency with the instruction pointer EIP when x87 FPU instructions are used, i.e. those that manage floating-point operations. For this reason, if we try to use an FPU instruction to perform any operation and then we call an instruction like fstenv or fsave, we will find that the output data structure provides an address to the program that is the real one that DBI uses for internal functions (hence an address belonging to the Code Cache) instead of the dummy expected one from the original code regions of the program [29]. Memory Content and Permissions. The mixture of the DBI engine runtime components with normal process memory is a mechanism that leads inexorably to the insertion of spurious allocated regions within the address space of the target process, which may then observe them with a memory scan. We will be addressing this aspect in detail through the next sections, using it to reveal the presence of a popular engine. Researchers also observed that some memory protection flags (e.g. the Non-eXecutable policy for writable pages, adapted by many applications and systems to hinder code injection attacks from buffer overflow and alike exploits) may also be erroneously discarded by the DBI engine, which could lead to an inconsistency during execution compared to a native run.
1008
3 3.1
F. Palmaro and L. Franchina
Analysis Hiding Artifacts
Maintaining consistency with what expected in a native execution may sound like a daunting task for DBI engine developers. For this reason in the implementation developers have often to put in place patches to successfully achieve the invisibility characteristic required for this kind of operations. We first present an overview of what criteria popular systems try to follow to hide their implementations from prying eyes. We then analyze one of these aspects (which can be seen as attack surfaces from the perspective of threat actors) in detail, and propose an analysis technique aimed at dismantling artifact cloaking countermeasures and expose as much information as possible about the underlying DBI system. 3.2
DBI Properties as Attack Surfaces
In order to function properly, a DBI system needs to enforce some transparency properties that have been addresses in the scientific literature, and are basically based on two directions. The first, which we can define as the most important feature for this type of software, is that a DBI engine must guarantee the target application a clean and transparent execution as if it were natively executed in the system and capable of interacting normally with the other applications present. In a seminal work Bruening et al. [12] described three principles to follow when implementing a DBI system with execution transparency in mind: – leave the code unmodified whenever possible; – in the case any modification is necessary, make sure that it is imperceptible to application; – avoid the use of conflicting resources. We observer that this kind of requirements is certainly very tight, sometimes so much that the developers of DBI systems have to turn to solutions that do not respect these guidelines in full, since compliance with them would cause too high overhead in the general case (compared to the proper handling of rare corner cases) or in some cases even be unattainable. From a security perspective, the second fundamental property that a dynamic analysis system (hence DBI but also alternative technologies) must follow is that of transparency towards an opponent that implements pieces of code within the target program aimed at detecting the presence of the monitoring systems and analyzing them. For this purpose Garfinkel and Rosenblum [32] have identified three specific requests to avoid the case of adversarial introspective attempts described above: isolation, inspection and interposition. The first represents the possibility that the code under analysis should not be able to read and modify the code of the analyzer, the second says that the analysis system must have access to all the states of the process under analysis, and the last one bases its foundations on the use of privileged instructions, hooks, and all those types of operations that the analyzer needs to do in order to interpose between the analyzed target program and the operating system.
Detecting DBI Runtimes with Low-Level Memory Scanning
1009
Fig. 1. High-level view of memory page and region analysis scheme.
3.3
Time to Break
We enter now the main part of the article, where we will focus one feature among those described above and try to understand how this is implemented and if it is vulnerable. Specifically, in the translation from theory to practice of the Sect. 3.2, the need to implement a generic solution may not contemplate the particular case that an attacker could create and exploit the system. Our focus will be the isolation property, i.e. the one for which the software under analysis must not be able to read or alter analyzer’s data which would otherwise represent a very important surface left uncovered. The article will focus on the DynamoRIO engine [12], one of the most famous and performant DBI systems to date. The technique is general and applies also to other engines that may try to hide regions like DynamoRIO does. A process memory analysis technique will be developed for scanning the sections using a standard operating system API call, then followed by an accurate inspection by means of memory analysis without recurring to the API, in order to verify if there had been some manipulation in the API results by the analyzer or engine. As depicted in Fig. 1, we will reveal the hidden presence of a DLL used by the runtime of DynamoRIO.
4 4.1
Proposing a Detection Attack Flow
We will now describe the technique that we designed for this article: we inspect the memory of a process from the inside and discover with a simple glance
1010
F. Palmaro and L. Franchina
(a) Memory Scan for Initial Regions
(b) Region Spoofed by DBI Engine
Fig. 2. Introspective Attempt with VirtualQuery on the Address Space to Detect Allocated Memory Regions. The DBI Engine is Hiding the Presence of One or More Regions by Marking the Highlighted Area as MEM PRIVATE and PAGE NOACCESS.
details that would have otherwise remained in the shadows. We write a simple evader program in C and compile it with the popular MinGW compiler for Windows targets. We execute it on a 32-bit Windows 7 SP1 machine1 running inside VirtualBox 5.2 and with an Intel i9-8950HK processor. We run our evader program both natively and under a recent DynamoRIO release (version 8.0.0 build 1). Specifically, we use bin32/drrun.exe to perform DBI execution with empty analysis, to avoid any risk of pollution from the effects of analysis code. First of all we have to find the sections and memory regions that allegedly compose the address space of the process, examining the result of the API call VirtualQuery. This special Windows API call gives us in output the detail of the memory regions mapped inside the address space, showing the range of addresses where they are located and the flags of the permissions they possess. As can be seen from Fig. 2a, after a complete scan of the address space where we extracted and printed many details on the regions present, we noticed how in Fig. 2(b) there was a mapped memory region which had the following flags as access permissions: MEM PRIVATE and PAGE NOACCESS. In detail, the first flag indicates that the memory pages within the region are private (i.e. not shared by/with other processes). Then the second flag indicates that all access to the committed region of pages are disabled and an attempt to read from, write to, or execute from the committed region will result in an access violation. This region captured our attention as it was not present in the native execution, but only under DynamoRIO. Furthermore, this combination of permissions and rights is not so frequent in COTS code. The suspicion that this region hides something becomes more concrete if we try to use simple scanning tools that discover the memory mapping and its detail
1
We remark that the technique is independent of the OS version.
Detecting DBI Runtimes with Low-Level Memory Scanning
(a) Memory Scan for Initial Regions
1011
(b) Region Spoofed by DBI Engine
Fig. 3. Information Retrieved from the System for the Process of Fig. 2. The Process Executing under DBI is evader32.exe. We can see how the Area Highlighted in Fig. 2b is Actually Composed of Different, Heterogeneous Regions; in More Detail, It belongs to a DLL Module from the DBI Engine Called dynamorio.dll.
externally, i.e., through a third process that queries the operating system about the target process. Such queries cannot be captured and faked by DBI engines. As we can see from Fig. 3, the suspicious addresses identified by our tool correspond to memory with commit as Image type, which usually refers to memory that hosts a file of type Dynamic Linked Library (DLL). This behavior is quite strange as a memory space hosting this type of file should not have this type of access permissions. At this point, again using a standard external inspection tool popular among malware analysts, we can see the DLLs loaded by the target process. In such list we observe one entry in particular called dynamorio.dll, a specific file of the runtime core of the DBI DynamoRIO engine. At this point in the analysis, the artifact that the creators of DynamoRIO wanted to keep hidden has been discovered. In order to complete this type of inspection, however, without using the external tool but directly from the internal code of the evader and without manual intervention, we are going to forcibly inspect the offending memory area in detail, looking also for the specific sections from the file mapped in memory and the functions it exports2 . As we can see from Fig. 4, by forcing the analysis of the memory region in question, by dissecting the header of the DLL module loaded inside, we can get the names and desired permissions of all the loaded sections, with the addition of the functions that it exports which match those necessary for the functioning of the DBI DynamoRIO engine (those are typically characterized by the dr prefix). At this point we can say that the protection system for this type of artifact based on altered VirtualQuery results has a problem if an adversary studies in detail the system they want to analyze: gaps are present in the presence 2
These functions assist the DBI runtime in its operation and can potentially be manipulated by a malicious program to subvert an analysis based on DBI, breaking also the efficacy of the interposition property that we defined in Sect. 3.2.
1012
F. Palmaro and L. Franchina
Fig. 4. Findings from Our Method. The Region Starting at Address 0x71000000 Contains the Magic MZ Sequence (bytes 5a 4d Swapped for Endianness) that Characterizes PE Windows Executables and Libraries. From the PE Header we can Locate the Sections with the Desired Displacement, Permissions, and Size from the Image Base (i.e. the Region’s start), Then the List of Exported Methods and Storage Symbols that the DBI Engine will Likely Access During the Execution.
of an opponent able to deepen the introspection techniques avoiding the usual inspection ways based on standard operating system APIs. 4.2
Algorithm
Following the explanation of the proposed detection flow, in this section we will detail the functioning of the memory scanning algorithm in order to provide the technicality useful for the correct reading in the acquisition process of the information contained in the process pages. The first step of the memory scanning process is the correct parameterization of the VirtualQuery library call, which is characterized by the prototype reported in the code snippet below: 1 2 3 4 5
SIZE T V i r t u a l Q u e r y ( LPCVOID lpAddress , PMEMORY BASIC INFORMATION l p B u f f e r , SIZE T dwLength );
The API populates a MEMORY BASIC INFORMATION data structure supplied as a pointer by the user. Starting from address 0x0, the multiple invocations of this call gradually reveal the different memory regions of the calling process, with basic information to follow that allow a first skimming. We can get the memory
Detecting DBI Runtimes with Low-Level Memory Scanning
1013
status and type (such as permissions and rights), as well as obviously the low and high boundary addresses that delimit the region itself. 1 2 3
M E M O R Y _ B A S I C _ I N F O R M A T I O N mem ; SIZE_T numBytes ; uintptr_t address = 0;
4 5 6 7 8 9 10
while (1) { numBytes = VirtualQuery (( LPCVOID ) address , & mem , sizeof ( mem )); if (! numBytes ) { printf ( " Invalid query to VirtualQuery !\ n " ); break ; }
11
uintptr_t startAddr = ( uintptr_t ) mem . BaseAddress ; SIZE_T size = mem . RegionSize ; address += size ;
12 13 14 15
// omitted : parse permissions and rights
16 17
}
The code listed above implements a scanning cycle for the whole userland address space of the application. As we mentioned in the previous section, the DBI engine can however massage the results returned by VirtualQuery to hide the presence of regions or provide fake information about them. We can however get around VirtualQuery and obtain information about valid regions of memory using a low-level memory scanning technique. The trick is to register an SEH exception handler to catch errors that arise when accessing invalid memory, and try to scan every page in memory. In fact, the organization of memory in pages by the memory subsystem of the machine allows us to scan memory at the granularity of a page, which on Windows systems is commonly of 4 KB. We can thus access addresses that are 4096 bytes far one from the other, and see if the enclosing page is accessible (i.e. no exception is triggered) or not. As we are looking for a DLL component, we can further speed up the memory scanning by borrowing a technique described in a recent work [22] on Returnoriented programming [43], which is a well-known code reuse technique used in memory exploitation attacks [1,6,16,17,48] and recently explored also for malware [8,39,49]. When seeking for addresses belonging to DLL code, the authors suggest to focus on the portion of the 32-bit address space between 0x50000000 and 0x78000000. The rationale is that a DLL file, under the ASLR (Address Space Layout Randomization) technology implemented by Windows releases, can be loaded randomly in 10240 different locations in the range specified above, maintaining a 64-KB alignment for addresses. As we can see from the next code snippet, by moving between pages that are 64-KB distant we can try to guess whether a DLL has been loaded there. Thus we need to execute our memory scanning sequence protected by an SEH handler only in the range reported above, and we can reduce accesses by 1/16 by checking subsequent addresses at every 65536 bytes instead of 1024. If the access does not cause an exception, we check the first two bytes that can be read from it and compare them against he magic MZ sequence that characterizes
1014
F. Palmaro and L. Franchina
every Portable Executable (PE) module on Windows, including the DLL used by DynamoRIO. When such bytes are found, we apply standard PE header parsing techniques to identify i) the list of memory-mapped sections of the DLL modules and ii) the list of exported API functions [26]. By doing so we obtain a rather rich picture of the structure and capabilities of the found region. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
int lookForLibrary ( uintptr_t addr ) { unsigned short MZ = NULL ; __seh_try { MZ = *(( unsigned short *) addr ); if ( MZ == 0 x5a4d ) { printf ( " Found MZ magic sequence at " " % x for unknown DLL ...\ n " , addr ); return 1; } } __seh_except ( info , context ) { // for debugging only // if ( info - > ExceptionCode == E X C E P T I O N _ A C C E S S _ V I O L A T I O N ) // printf (" Access violation exception raised .\ n "); } __seh_end
16 17 18
return 0; }
19 20 21 22
void lookForDLL () { uintptr_t start = 0 x50000000 ; uintptr_t end = 0 x78000000 ;
23 24
while ( start 1024) are selected in the array. Only the 256 cells that are not masked are kept for the key generation, they are the “good” cells with resistances away from the median value, and part of SCP. Again, this allows the formation of a buffer zone between the “0”s, and the “1”s which is effective to reduce CRP error rates. The client device will suspect that an intrusion has occurred if too many cells out of the 768 tested have lower than usual resistance values. The injected electric current of this last test is 50 nA, this should not damage any cell, however this will detect damaged cells.
4.3 Sensing Tampering Activities Any attempt by an intruder to read the resistance of the VCP’s cells would result in a permanent noticeable decrease in the electrical resistance of these cells. This irreversible effect is acting as a sensor used to detect attempts to tamper with the device. Protocols such as the one described in Sects. 3.2 and 4.2 use ternary cryptographic schemes to mask the addresses of the VCP cells from the SCP. A third party, acting as a manin-the-middle cannot send random handshakes to the client device to observe the key generation cycles without taking the risk to damage some VCP cells. In a similar way fault injection in the circuitry of the client device can damage the VCP cells. One attack that could stay effective is the full characterization of the entire array when the device is under the physical control of the opponent. The characterization will start by a systematic characterization at low injected current to identify the cells part of the VCP. The remedies of this type of physical attack, as it is done above in the proposed example of protocol, is to prevent the characterization of the 768 cells having M at “0” before generating the cryptographic keys from the 256 SCP cells. The intrusion detection scheme that we are proposing in this section is an additional tool that needs to be integrated into a wider cybersecurity protocol to make it more difficult to physically explore the PUF un-noticed.
1036
B. Cambou and Y.-C. Chen
Another example of software detecting tampering activity without key generation is the following:
This protocol can be part of the schemes proposed above in Sect. 3.2 and Sect. 4.2 to enhance the likelihood to detect intrusions. A simplified version where the client device randomly read 5000 cells at 50 nA is possible, but not necessarily desirable, as its can also be used by the opponent to characterize the array. This last method is not as sensitive because the potential density of shorts is lower when the population tested does not discriminate VCP and SCP.
5 Conclusion and Future Work In most cryptosystems successful hackings often follow un-noticed attacks in which the opponent slowly gain access to important security information. PUFs with ability to sense tampering activities can enhance security, as the cryptographic keys are only generated on demand. Based on the analysis of the physical properties of the ReRAM technology at low power, we propose the design of a tamper sensitive PUF-based cryptographic protocol that is summarized in the following way:
Tamper Sensitive Ternary ReRAM-Based PUFs
1037
• A network of client devices containing one ReRAM-based PUF each, which is secured by a server storing the image of these PUF in a look up table. • During the enrollment cycle, the cells of each ReRAM PUF is characterized at low injected currents. The cells with relatively high resistances carry a ternary state, they are part of the vulnerable cell population (VCP). The cells with relatively low resistance are part of the stronger cell population (SCP) that is characterized at various levels of higher injected currents. The results are stored by the server in a look up table for future authentication and key generation. • During authentication cycle, a ternary protocol allows the handling of both VCP and SCP cells. The server randomly selects 256 cells that are part of SCP and use their resistance values at a certain level of electric current to generate the cryptographic key. • The server secretly shares the information needed by the client devices, in particular the way to effectively mask the VCP, to independently generate the same keys from their ReRAM PUFs. • To detect intrusion, the client device can check if some of the cells that are part of VCP were damaged at suspicious levels. In this paper, we are not describing in detail important layers of security that are complementary, such as multi-factor authentication and generation of public-private key pairs for asymmetrical encryption. On an academic standpoint, we also see the need to study in detail the physics of various ReRAM technologies at low injected current. This is an area that is in general poorly understood compared with the physics at higher power, which is needed to build non-volatile memories. Going forward we intend to develop a full system implementation to protect cyberphysical systems. Using the work reported in this paper our research team supported the design of an Application Specific Integrated Circuit (ASIC) that will demonstrate the full cryptographic protocols using the code presented above. The ASIC incorporates a ReRAM array, and the driving circuitry to measure the resistance of each cell at various level of injected current. This will allow a comprehensive characterization of the technology in terms of error rates, sensitivity, and latencies. Acknowledgments. The authors are thanking Dr. Donald Telesca from the Information Directorate of the US Air Force Research Laboratory (AFRL) for his scientific contribution, and support of this research work. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of AFRL. The authors are also thanking the contribution the research team at Northern Arizona University, in particular Julie Heynssens, and Ian Burke, and the PhD graduate students Morgan Riggs, and Taylor Wilson. We are also thanking our industrial partners who provided outstanding ReRAM samples, with the support of Dr. John Callahan and Ankineedu Velaga from BRIDG, and Ashish Pancholi, Dr. Hagop Nazarian, Dr. Jo Sung-Hyun, and Jeremy Guy From Crossbar Inc.
References 1. Herder, C., Yu, M.-D., Koushanfar, F., Devadas, S.: Physical unclonable functions and applications: a tutorial. Proc. IEEE 102(8), 1126–1141 (2014)
1038
B. Cambou and Y.-C. Chen
2. Jin, Y.: Introduction to hardware security. Electronics 2015(4), 763–784 (2015). https://doi. org/10.3390/electronics4040763 3. Gao, Y., Ranasinghe, D.C., Al-Sarawi, S.F., Kavehei, O., Abbott, D.: Emerging PUF with nanotechnologies. IEEE (2018). https://doi.org/10.1109/ACCESS.2015.2503432 4. Rahman, M.T., Rahman, F., Forte, D., Tehranipoor, M.: An aging-resistant RO-PUF for reliable key generation. IEEE Trans. Emerg. Top. Comp. 4(3), 335–348 (2016) 5. Cambou, B.: Enhancing secure elements—technology and architecture. In: Bossuet, L., Torres, L. (eds.) Foundations of Hardware IP Protection, pp. 205–231. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-50380-6_10 6. Becker, G.T., Wild, A., Güneysu, T.: Security analysis of index-based syndrome coding for PUF-based key generation. In: IEEE HOST (2015) 7. Maes, R., Verbauwhede, I.: Physically unclonable functions: a study on the state of the art and future research directions. In: Sadeghi, A.R., Naccache, D. (eds.) Towards HardwareIntrinsic Security, pp. 3–37. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-64214452-3_1 8. Delavar, M., Mirzakuchaki, S., Ameri, M.H., Mohajeri, J.: PUF based solution for secure communication in advanced metering infrastructure. ACR Publication (2014) 9. Cambou, B., Telesca, D.: Ternary computing to strengthen information assurance, development of ternary state based public key exchange. In: IEEE SAI Computing Conference (2018) 10. Cambou, B., Flikkema, P., Palmer, J., Telesca, D., Philabaum, C.: Can ternary computing improve information assurance? Cryptography 2, 6 (2018) 11. Chen, T.I.B., Willems, F.M., Maes, R., Sluis, E., Selimis, G.: A robust SRAM-PUF key generation scheme based on polar codes. arXiv:1701.07320 [cs.IT] (2017) 12. Maes, R., Tuyls, P., Verbauwhede, I.: A soft decision helper data algorithm for SRAM PUFs. In: 2009 IEEE International Symposium on Information Theory (2009) 13. Holcomb, D.E., Burleson, W.P., Fu, K.: Power-up SRAM state as an identifying fingerprint and source of TRN. IEEE Trans. Comp. 57(11), 1–14 (2008) 14. Christensen, T.A., Sheets, J.E.: Implementing PUF utilizing EDRAM memory cell capacitance variation. Patent No.: US 8,300,450 B2 (2012) 15. Prabhu, P., et al.: Extracting device fingerprints from flash memory by exploiting physical variations. In: McCune, J.M., Balacheff, B., Perrig, A., Sadeghi, A.R., Sasse, A., Beres, Y. (eds.) Trust 2011. LNCS, vol. 6740, pp. 188–201. Springer, Heidelberg (2011). https://doi. org/10.1007/978-3-642-21599-5_14 16. Plusquellic, J., Swarup, B.: Systems and methods for generating PUF’s from non-volatile cells. US patent 10,216,965 (2019) 17. Vatajelu, E.I., Di Natale, G., Barbareschi, M., Torres, L., Indaco, M., Prinetto, P.: STT-MRAMbased PUF architecture exploiting MTJ fabrication-induced variability. ACM Trans. 13, 1–21 (2015) 18. Chand, U., Huang, K., Huang, C., Tseng, T.: Mechanism of nonlinear switching in HfO2based crossbar RRAM with inserting large bandgap tunneling barrier layer. Trans. Electr. Dev. 62, 3665 (2015) 19. Hudec, B., et al.: 3D resistive RAM cell design for high-density storage class memory—a review. Sci. Chin. Inf. Sci. 59(6), 1–21 (2016). https://doi.org/10.1007/s11432-016-5566-0 20. Waser, R., Aono, M.: Nanoionics-based resistive switching memories. Nat. Mater. 6, 833 (2007) 21. Cappelletti, P.: Non-volatile memory evolution and revolution. In: IEEE International Electron Devices Meeting (IEDM), p. 10.1 (2015) 22. Philip Wong, H.-S.: Metal–oxide RRAM. Proc. IEEE 100(6), 1951–1970 (2012) 23. Chang, Y., et al.: eNVM RRAM reliability performance and modeling in 22FFL FinFET technology. In: IEEE International Reliability Physics Symposium (IRPS), pp. 1–4 (2020)
Tamper Sensitive Ternary ReRAM-Based PUFs
1039
24. Wouters, D., Waser, R., Wuttig, M.: Phase-change and redox-based resistive switching memories. Proc. IEEE 103(8), 1274–1288 (2015) 25. Hsieh, C., et al.: Review of recently progress on neural electronics and memcomputing applications in intrinsic SiOx-based resistive switching memory. In: Memristor and Memristive Neural Networks. IntechOpen (2017) 26. Mostafa, R., et al.: Complementary metal-oxide semiconductor and memristive hardware for neuromorphic computing. Adv. Intell. Syst. 2(5), 1900189 (2020) 27. Kim, J., et al.: A physical unclonable function with redox-based nanoionic resistive memory. arXiv:1611.04665v1 [cs.ET], 15 November 2016 28. Adam, G., Nili, H., Kim, J., Hoskins, B., Kavehei, O., Strukov, B.: Utilizing I-V non-linearity and analog state variations in ReRAM-based security primitives. IEEE (2017). ISBN 978-15090-5978-2/17 29. Afghah, F., Cambou, B., Abedini, M., Zeadally, S.: A ReRAM PUF-based approach to enhance authentication security in software defined wireless networks. Int. J. Wireless Inf. Netw. 25, 117–129 (2018) 30. Nili, H., et al.: Highly-secure physically unclonable cryptographic primitives using nonlinear conductance and analog state tuning in memristive crossbar arrays. arXiv:1611.07946v1 [cs.ET] (2016) 31. Govindaraj, R., Ghosh, S.: A strong arbiter PUF using resistive RAM. IEEE (2016). ISBN 978-1-5090-3076-7/16 32. Beckmann, K., Manem, H., Cady, N.: Performance enhancement of a time-delay PUF design by utilizing integrated nanoscale ReRAM devices. IEEE Trans. Emerg. Top. Comput. 5, 304–316 (2017) 33. Rose, G., McDonald, N., Yan, L-K., Wysocki, B.: A write-time based memristive PUF for hardware security applications. In: IEEE/ACM International Conference on Computer-Aided Design (2013) 34. Kavehei, O., Hosung, C., Ranasinghe, D., Skafidas, S.: mrPUF: a memristive device based physical unclonable function. arXiv.org > cond-mat > arXiv:1302.2191 (2013) 35. Gao, Y., Ranasinghe, D.C., Al-Sarawi, S.F., Kavehei, O., Abbott, D.: mrPUF: a novel memristive device based physical unclonable function. In: Malkin, T., Kolesnikov, V., Lewko, A.B., Polychronakis, M. (eds.) ACNS 2015. LNCS, vol. 9092, pp. 595–615. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-28166-7_29 36. Cambou, B., Orlowski, M.: Design of PUFs with ReRAM and ternary states. In: Proceedings of the Cyber and Information Security Research Conference, Oak Ridge, TN, USA (2016) 37. Cambou, B., Afghah, F., Sonderegger, D., Taggart, J., Barnaby, H., Kozicki, M.: Ag conductive bridge RAMs for PUFs. In: IEEE HOST (2017) 38. Chuang, K-H., Degraeve, R., Fantini, A., Groeseneken, G., Linten, D., Verbauwhede, I.: A cautionary note when looking for a truly reconfigurable resistive RAM PUF. In: IACR CHES (2018). ISSN 2569-2925. https://doi.org/10.13154/tches.v2018.i1.98-117 39. Mesbah Uddin, M., Majumder, B., Rose, G.S.: Robustness analysis of a memristive crossbar PUF against modeling attacks. IEEE Trans. Nanotechnol. 16(3), 396–405 (2017). https://doi. org/10.1109/TNANO.2017.2677882 40. Chen, A.: Comprehensive assessment of RRAM-based PUF for hardware security applications. In: IEDM IEEE (2015). ISBN 978-1-4673-9894-7/15 41. Shivastava, A.: RRAM based PUF: applications in cryptography, Thesis for Master of Science, Arizona State University (2015) 42. Sze, S., Ng, K.: Physics of Semiconductor Devices, 3rd edn. Wiley (2007). ISBN 10: 0-47114323-5 43. Chen, Y., Huang, H., Lin, C., Kim, S., Chang, Y., Lee, J.: Effects of ambient sensing on SiOx-based resistive switching and resilience modulation by stacking engineering. ECS J. Solid-State Sci. Technol. 7(8), 350 (2018)
1040
B. Cambou and Y.-C. Chen
44. Chen, Y., Lin, C., Hu, S., Lin, C., Fowler, B., Lee, J.: A novel resistive switching identification method through relaxation characteristics for sneak-path-constrained selectorless RRAM application. Sci. Rep. 9(1), 1–6 (2019) 45. Chen, Y.-C., Lin, C.-Y., Cho, H., Kim, S., Fowler, B., Lee, J.C.: Current-sweep operation on nonlinear selectorless RRAM for multilevel cell applications. J. Electron. Mater. 49(6), 3499–3503 (2020). https://doi.org/10.1007/s11664-020-07987-1 46. Delvaux, J., Gu, D., Schellekens, D., Verbauwhede, I.: Helper data algorithms for PUF-based key generation: overview and analysis. In: IEEE CAD-ICS (2015) 47. Korenda, A., Afghah, F, Cambou, B.: A secret key generation scheme for internet of things using ternary-states ReRAM-based PUFs. In: IWCMC (2018) 48. Taniguchi, M., Shiozaki, M., Kubo, H., Fujino, T.: A stable key generation from PUF responses with a fuzzy extractor for cryptographic authentications. In: IEEE GCCE (2013) 49. Cambou, B., Philabaum, C., Booher, D., Telesca. D.: Response-based cryptographic methods with ternary physical unclonable functions. In: 2019 SAI FICC. IEEE (2019) 50. Cambou, B.: Unequally powered cryptography with PUFs for networks of IoTs. In: IEEE Spring Simulation Conference (2019) 51. Kang, H., Hori, Y., Katashita, T., Hagiwara, M., Iwamura, K.: Cryptography key generation from PUF data using efficient fuzzy extractors. In: International Conference on ACT (2014)
Locating the Perpetrator: Industry Perspectives of Cellebrite Education and Roles of GIS Data in Cybersecurity and Digital Forensics Denise Dragos(B) and Suzanna Schmeelk Computer Science, Mathematics and Science, St. John’s University, New York, NY, USA {dragosd,schmeels}@stjohns.edu
Abstract. Geographic Information Systems (GIS) data is incorporated into cybersecurity and digital forensics at many levels from the development of secure code, to the metadata stored by systems and employed during civil and criminal cases. This paper reports on the role of GIS Data in the development of both systems and the development of legal probable cause. Specifically, we report on how GIS data is incorporated into three advanced courses: Wireless Security, Secure Software Development, and Forensic Investigation of Wireless Network and Mobile Devices. GIS data needs to be accurate for many reasons including probable cause. We report on IRB-approved student surveys about their experience in the three courses. We find that overall students liked the courses and offered insights into future course improvements. Keywords: Geographic Information System (GIS) · Location analysis · Crime · Forensic Investigation of Wireless Network and Mobile Devices · Wireless security · Secure Software Development
1 Introduction Location analytics has emerged as an important set of concepts, tools, and techniques with increasingly critical applications in the cybersecurity and digital forensics sectors. Geographic Information Systems (GIS) process location information typically comprised of latitudes range from −90 to 90, and longitudes range from −180 to 80. This research explores GIS in three of our advanced courses in our cybersecurity and digital forensics curriculum. This paper is organized as follows. First we review related GIS literature and work. Then, we describe how GIS is incorporated into three advanced cybersecurity and digital forensic courses: Forensic Investigation of Wireless Network and Mobile Devices, Wireless Security, and Secure Software Development. In each course we discuss components of the curriculum that involve GIS data as many similar curriculums do not go into detail on GIS data. In Forensic Investigation of Wireless Network and Mobile Devices we discuss the role of GIS for alibi and/or offender modus operandi, among other topics involving GIS data. In Wireless Security, we discuss GIS for recognizance among © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 1041–1050, 2021. https://doi.org/10.1007/978-3-030-80129-8_68
1042
D. Dragos and S. Schmeelk
other security concerns involving GIS data. And, in Secure Software Development, we emphasize number concerns such as underflow, overflow, denormalized numbers, casting, and truncating concerns. These concerns can completely change the interpretation of GIS data.
2 Review of Literature and Related Work To date, very little has been published about incorporating GIS into curriculums from a digital forensics and cybersecurity perspective. Currently, two main bodies of GIS education research exist: curriculum and justification. For example from a curriculum perspective, Gorton et al. [1] report on incorporating GIS data for archeology purposes. Chimgee [2] report on integrating GIS into business school curriculums. Khmelevsky et al. [3], discuss distance learning with GIS curriculums. Finally, Campbell [4] discussed GIS education for non-computer oriented college students. Other curriculums that emphasize GIS data can be environmental, life sciences, and economics. Little, if any, is published about GIS data in cybersecurity and digital forensic specific curriculums. The education justification perspective on why studying GIS is important is a second body of education research. John Barr and Ali Erkan [5] examine the role of GIS in computer science education. Sean W. Mulvenon and Kening Wang [6] examine the integration of GIS and educational achievement data for education policy analysis and decision-making. Finally, Thomas A. Wikle [7] discussed the digital future of GIS. Little, if any, is published on the importance of GIS data in security and forensics curriculums.
3 GIS in Forensics Curriculums GIS and its underlying data shows up in nearly every cybersecurity and digital forensics course either explicitly or implicitly. We examine the use of GIS in three advanced courses: Wireless Security, Secure Software Development and Forensic Investigation of Wireless Network and Mobile Devices. Since GIS data is essential for the success of many legal cases, this research focuses on the forensics course. 3.1 GIS in Wireless Security Wireless security deals with processing of wireless network protocols many times carrying GIS data. We cover GIS data in the 2nd week of the course as we structure part of the course around Joshua Wrigtha and Johnny Cache’s [8, 9] book. For example, during this week, we examine collecting data via Kismet. The students find it very interesting where we take a Keyhole Markup language (KMZ) file, which has embedded GIS locations where wireless access points have been scanned with passive and active network monitoring tools and import them into Google Earth, as seen in Fig. 1.
Locating the Perpetrator
1043
Fig. 1. Kismet export in Google Earth
3.2 Secure Software Development Many systems are built to process GIS data. In Java, GIS data can be stored as BigDecimals or as a GIS type. Typically, they are decimal in value. In Java, common decimal representations are double, float, and BigDecimal. Numbers in java are typically vulnerable to number overflow issues. Secure software designs should consider such overflow features and throw an Exception rather than mindless continuing to compute. Secondly, a loss of precision on latitude and longitudinal data can mean a complete loss of geographic precision. The Empire State Building, for example, is located at 40.7484° N, 73.9857° W. If we were to set a maker on a mapping utility, we would have a precise definition of the Empire State Building, as see in Fig. 2 [10].
Fig. 2. 40.7484° N, 73.9857° W
However, if we were to truncated the decimal, our map would show 40° N, 73° W as seen in Fig. 3. As can be seen, the decimal precision is extremely important for location analysis.
1044
D. Dragos and S. Schmeelk
Fig. 3. 40° N, 73° W
Lastly, the Carnegie Mellon University Software Engineering Institute recommends the following for BigDecimal types, “NUM10-J. Do not construct BigDecimal objects from floating-point literals [11].” CMU SEI gives two examples. A non-compliant example is seen in Fig. 4; and a compliant example in Fig. 5.
Fig. 4. CMU-SEI non-compliant BigDecimal
Fig. 5. CMU-SEI compliant BigDecimal
3.3 GIS in Mobile Device Forensics The examination of GIS data as part of Cellular and mobile device forensics is inherently vital. Due to the transitory nature of these devices and the ubiquity of location services
Locating the Perpetrator
1045
being enabled, there is often a treasure trove of available data. In our course, we study the two most common cell phone operating systems - IOS and Android. Students initially study the OS’s from a tool agnostic approach. The learn the structure and functions of how both systems interact with the hardware and manage the stored data. We then dive deeper using both BlackBag Forensics’ Mobilyze and Cellebrite’s Universal Forensic Extraction Device (UFED) and Physical Analyzer tools. We also examine IOT and wearable devices, such as Fitbit and Apple Watches, as well as vehicle forensics. Cellebrite and other tools can scrape GIS date from devices and embed them into maps to help visualize what the data represents, as shown in Fig. 6. Using location data in conjunction with timeline analysis can give an analyst a much greater understanding of a subject’s movements.
Fig. 6. Cellebrite location mapping
Participating in the Cellebrite Academic Program allows the course instructor to access official training material and data sets for student instruction. Students are afforded one attempt to pass the Cellebrite Certified Operator (CCO) exam. Upon receiving a passing mark, they are awarded the official certification. Students also register for and take the online BlackBag Certified Mobilyze Operator (CMO) course and certification. Students are given ZTE Prelude Z993 Android phones to work with during the class. They are also encouraged to analyze their own personal phones in order to understand the enormous scope of data that resides on the devices. Examples of GIS data analysis are shown by the following two examples. A homicide case in Australia was solved by Queensland police when they recovered geolocation data on the suspect’s phone that led them direction to the location where the victim’s body had been hidden. Investigators were able to decode the phone’s database to reveal spoken turn-by-turn directions requested of Google Maps [12]. Geolocation data recovered from photos taken on a cell phone and posted to Instagram were used to prove that Russian troops were operating within the Ukraine in 2014 [13].
1046
D. Dragos and S. Schmeelk
4 Educational Perspectives: Course Surveys We surveyed our cybersecurity and digital forensics classes Spring 2020. Specifically, 140 students across our classes submitted our IRB approved survey enquiring about the classes and overall certifications. We discuss the survey results. 4.1 Learning Satisfaction In asking our students during the Spring 2020 courses, “Q2 - Please agree on a Likert scale for the following questions: Did you feel that you learned from: (2).” In response, 63.78% (81) Strongly agreed, 31.50% (40) Somewhat agreed, 3.94% (5) Neither agree nor disagree, 0.00% (0) Somewhat disagree, and 0.79% (1) Strongly disagree. A histogram of results can be seen in Fig. 7.
Fig. 7. Survey Q2 – learning satisfaction
4.2 Student Survey: Working in Teams Our survey asked the students in Question Four, “Please agree on a Likert scale for the following questions: Did you feel that you learned from: (3) Working in teams [5 strongly agree–1 strongly disagree].” In response, 48.82% (62) Liked a great deal, 31.50% (40) Like somewhat, 11.02% (14) Neither like nor dislike, 6.30% (8) Dislike somewhat, 2.36% (3) Dislike a great deal. A histogram of results can be seen in Fig. 8.
Fig. 8. Survey Q4 – working in teams
Locating the Perpetrator
1047
4.3 Student Survey: Virtual Teams We asked our cybersecurity and forensic students, “Q5 - Please agree on a Likert scale for the following questions: Did you feel that you learned from: (4) Working in virtual teams [5 strongly agree–1 strongly disagree].” The students responded as follows: 24.41% (31) Like a great deal, 38.58% (49) Like somewhat, 22.83% (29) Neither like nor dislike, 9.45% (12) Dislike somewhat, 4.72% (6) Dislike a great deal. A histogram of results can be seen in Fig. 9.
Fig. 9. Survey Q5 – working in virtual teams
4.4 Student Survey: Class Overall We asked our students, “Q6 - Please agree on a Likert scale for the following questions: Did you feel that you learned from: (5) The class overall [5 strongly agree–1 strongly disagree].” The students responded as follows: 55.12% (70) Like a great deal, 37.80% (48) Like somewhat, 3.94% (5) Neither like nor dislike, 3.15% (4) Dislike somewhat, 0.00% (0) Dislike a great deal. A histogram of results can be seen in Fig. 10.
Fig. 10. Survey Q6 – learning class overall
1048
D. Dragos and S. Schmeelk
4.5 Student Survey: Industry Certification Our last Liker Scale question, asked the students, “Q7 - Please agree on a Likert scale for the following questions: Did you feel that you learned from: (6) Certification (if applicable) [5 strongly agree–1 strongly disagree].” Their responses were: 42.28% (52) Like a great deal, 23.58% (29) Like somewhat, 32.52% (40) Neither like nor dislike, 1.63% (2) Dislike somewhat, 0.00% (0) Dislike a great deal. A histogram of results can be seen in Fig. 11.
Fig. 11. Survey Q7: certification
4.6 Student Survey: Final Project We also asked our students two open-ended questions. The first was, “Q8 - (7) What ere the benefits to working on the final project? (Please keep responses anonymous).” And, the second was, “Q9 - (8) What were the obstacles to working on the final project? (Please keep responses anonymous).” We had 15 students respond about the Cellebrite certification as shown in Table 1. Table 1. Survey Q8 & Q9 – benefits and obstacles #
Description
1
The benefits included a certificate for Cellebrite that normally costs $1k+ to take and the final exam helped us see how much we learned throughout the semester
2
More classes for Cellebrite learning, if possible
3
Referring to the CCO Material, I learned a lot. It was very interesting learning to investigate mobile devices and actually learning hands on material
4
Since we took a certification instead it was great that we can walk away from this course with something for our resume
5
The certification on a resume, learning a tool to use in a workplace
6
The certification will look good on my resume, which will be extremely important in my working future (continued)
Locating the Perpetrator
1049
Table 1. (continued) #
Description
7
Received a certification, I liked this idea and think it should remain in the course
8
It helped me gain a certification
9
Got an industry recognized certification
10 Working on the COO certification was tough. The material wasn’t bad it was just when it came to the exam 11 Not knowing the contents of the certification exam earlier 12 Provided the class is in person, more hands-on work with the programs themselves would make learning the uses of the tools easier for those that might struggle working with theoretical information that they are supposed to retain for a certification exam 13 Nothing, I liked that the certification was implemented into the course 14 Nothing, I liked that the certifications were implemented into the course work 15 Better allocation of the material so that students can get a better understanding of a topic before being thrown into a new one. I would also suggest that the certification labs be done with the professor so the students can better understand how to maneuver the tool
5 Discussion and Future Work This paper contributes to the education and research developments and implications on location intelligence. Location intelligence is essential to cybersecurity and digital forensics for many reasons including developing a preponderance of evidence for civil and criminal trials. Future research includes building laboratory exercises where students examine location analytics in other courses such as in Healthcare Information Security with VPN authentication data, Crisis Management with GIS data on Twitter, among other applications.
6 Conclusions Location intelligence is an essential component to both cybersecurity and digital forensics. Location intelligence informs everything from authentication to probable cause during a legal trial. It is essential that computer science, cybersecurity, digital forensics, and criminal justice students understand location data (e.g. GIS data) to that they can properly apply the correct technical and legal requirements.
7 IRB Approval All reported data is under IRB Approval. Our first survey question asked students to agree/disagree with the study as follows: 99.29% (140) yes, and only 0.71% (1) indicated no. The students who indicated no had the survey ended.
1050
D. Dragos and S. Schmeelk
References 1. Gorton, D., et al.: WETANA-GIS: GIS for archaeological digital libraries. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2006, p. 379. Association for Computing Machinery, New York (2006) 2. Chimgee, D., Bolor, A., Enerel, A., Erdenechimeg, J., Oyunsuren, S.: Integrating GIS into business school curricula. In: Proceedings of the 2019 8th International Conference on Educational and Information Technology, ICEIT 2019, pp. 217–221. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3318396.3318436 3. Khmelevsky, Y., Burge, L., Govorov, M., Hains, G.: Distance learning components in CS and GIS courses. In: Proceedings of the 16th Western Canadian Conference on Computing Education, WCCCE 2011, pp. 17–21. Association for Computing Machinery, New York (2011). https://doi.org/10.1145/1989622.1989627 4. Campbell, H.G.: Geographic information systems education for non-computer oriented college students. SIGCSE Bull. 26(3), 11–14 (1994). https://doi.org/10.1145/187387.187393 5. Barr, J., Erkan, A.: Educating the educator through computation: what GIS can do for computer science. In: Proceedings of the 43rd ACM technical symposium on Computer Science Education, SIGCSE 2012, pp. 355–360. ACM, New York (2012). https://doi.org/10.1145/ 2157136.2157242 6. Mulvenon, S.W., Wang, K.: Integration of GIS and educational achievement data for education policy analysis and decision-making. In: Proceedings of the 2006 International Conference on Digital Government Research, dg.o 2006, pp. 354–355. Digital Government Society of North America (2006). https://doi.org/10.1145/1146598.1146699 7. Wikle, T.A.: Geographic information systems: digital technology with a future? ACM SIGCAS Comput. Soc. 21(2–4), 28–32 (1991). https://doi.org/10.1145/122652.122657 8. Wrigtha, J., Cache, J.: Chapter 2: scanning and enumerating 802.11 networks. In: Cache, J.W. (eds.) Wirelss Hacking Exposed, p. 544. McGraw-Hill Education (2015) 9. Wright, J., Cache, J.: Index of Chapter /files/02, 3 March 2015. Retrieved from hackingexposedwireless: http://www.hackingexposedwireless.com/files/02/ 10. Google Maps API, 13 July 2020. Retrieved from https://www.google.com/maps/place/40% C2%B044’54.2%22N+73%C2%B059’08.5%22W/@40.7484,-73.9878887,17z/data=!3m1! 4b1!4m5!3m4!1s0x0:0x0!8m2!3d40.7484!4d-73.9857 11. Carnegie Mellon Software Engineering Institute: NUM10-J. Do not construct BigDecimal objects from floating-point literals, 13 July 2020. Retrieved from wiki.sei.cmu.edu: https:// wiki.sei.cmu.edu/confluence/display/java/NUM10-J.+Do+not+construct+BigDecimal+obj ects+from+floating-point+literals 12. Cellebrite Case Study: How Turn-By-Turn Driving Directions Uncovered a Killer’s Lies (2020). Retrieved from Cellebrite.com: https://www.cellebrite.com/en/case-studies/digitalevidence-helps-uncover-nationwide-car-theft-operation/ 13. Park, A.: Five Case Studies With Digital Evidence In Corporate Investigations (2020). https://www.controlrisks.com/campaigns/compliance-and-investigations/five-casestudies-of-interest-to-corporate-investigators
Authentication Mechanisms and Classification: A Literature Survey Ivaylo Chenchev1(B) , Adelina Aleksieva-Petrova1 , and Milen Petrov2 1 Technical University of Sofia, Sofia, Bulgaria
[email protected]
2 Sofia University, Sofia, Bulgaria
[email protected]
Abstract. The Security must be considered from the need of two nodes to communicate with each other, through the moment of confirmed authentication to the secured establishment. It is an essential component of each authentication. In general, nodes communicate over different communication channels, which can be in private or in public networks. The methods and complexity of authentication vary depending on the parties involved in communication. In this paper, we are surveying the various methods for authentication, based on particular criteria, like the possibility to remember passwords; different graphical ways; body specifics (fingerprints, face); combining various techniques (certificates, PINs, digital signatures, one-time passwords, hash-chains, blockchains); QR codes (also with the involvement of augmented reality) for transmitting long passwords, etc. We aim to do as detailed as possible a survey over the literature, and to classify the ways for authentication, and different authenticators. Keywords: Authentication · Blockchain · Certificates · Passwords · Multi-factor · Biometric · Security system
1 Introduction Designing a security system that accurately identifies, authenticates, and authorizes trusted individuals is highly complex and filled with nuances but critical to security [47]. What is the meaning of these three access control steps? • First step - Identification (who are you?) • Second step - Authentication (a process to prove your identity) • Third step - Authorization (represents what is your access, what are your rights in the system) These access control steps are essential to apply in each system that provides the sensitive data. One of the APTITUDE project’s main tasks is to design and implement a prototype of a software system for learning and gaming analytics of data extracted from
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 1051–1070, 2021. https://doi.org/10.1007/978-3-030-80129-8_69
1052
I. Chenchev et al.
learning management systems and educational games, which are utilized for a learnercentric adaptation. That’s why the authentication mechanisms need to be classified and investigated for data protection on the system level. The purpose of this paper is to analyze existing studies and to summarize the efforts of research in authentication methods with a broad scope in both private and public networks. For example, if someone has to access a test environment in a private network (without access to the Internet) most probably, authentication via password is enough. But, if a client has to access his/her bank account over Internet, then definitely using only a password is not enough. In this case, there should be used a multi-factor authentication with the involvement of a third-party device (a hardware or software token, a mobile phone with a specific application, or any combination). We find in [29], four principles of authentication. It is worth of mentioning two of them because we believe they should be vital for building authentication systems. If the access is ongoing, then the identity verification should be continuous (this is the security component); and the other principle – authentication should be trivial for the person legitimately authenticating [30], but hard for an adversary to defeat [31] which relates to the usability. Authors in [4] are grouping the authentication methods in four categories – based on the knowledge factor (something we know), based on the inherence factor (something that we are), based on the possession factor (something we have) and based on multi-factors. In their research is missing the authentication method based on exchanged certificates. Also, they are discussing the need to have different and strong passwords. Still, they do not talk to mitigate (or at least partially) the circumstances around the password selection – validity period and complexity. Those can be enforced, for example with using Identity Management Systems (IDM) and with using password policies. If there is a need for authentication across multiple authorities or organizations, there should be used Federated Identity Management. In [28] are proposed two methods for improving the authentication factors “what you can” and “how you do it”, as they are not transferrable to an attacker. The focus of the authors is to suggest alternatives for authentication when accessing some high-security environments – like nuclear plants. They propose a simple and generic formal model for such an authentication system in the form of a probabilistic state transition machine. We find a description of the challengeresponse mechanism where the challenges take the form of games, and the responses are an encoding of how the game was played by the prover. The widely used three-dimensional categorization [18] is “who you are” (e.g., biometric), “what you know” (e.g., password), and “what you have” (e.g., token). Those three are an excellent mnemonic scheme and unlikely to fall from use, but it is not without problems. For instance, a password is not strictly known; it is memorized. This means that the password can be forgotten over a certain period of time. Also, the biometrics are definitely not “who you are” but are representing just one feature of your appearance (for example fingerprint, palm-print, retina, face etc.). Authors prefer the following labels: knowledge-based, object-based, and ID-based. This labeling makes sense, and we will be using it throughout this paper. Five years ago, the authors of [68] are sharing the idea of Choose Your Own Authentication (CYOA) – a novel authentication architecture that enables users to choose a
Authentication Mechanisms and Classification: A Literature Survey
1053
scheme amongst several authentication alternatives. Nowadays, we see the idea implemented in almost all modern smartphones. As part of the initial setup and configuration, we have to select the method for authentication amongst the following – PIN, pattern, password, fingerprint, face recognition. Although the so-called master password cannot be removed (used for some specific administrative configurations and installations), the user has to select the commonly used (the daily) method for authentication. The authors in [6] review the blockchain approach for authentication. One of their focuses is the overview of blockchain and how it can be used to solve PKI (Public Key Infrastructure). Also, some traditional PKI problems are compared to the blockchainbased key-management approaches, including CA (Certificate Authority) and WoT (Web of Trust). Those are the two most common approaches in achieving PKI – centralized by a Certificate Authority (standardized in the X.509 standard) and decentralized by WoT (this concept with self-signed certificates was first proposed in 1992 by Phil Zimmerman [7, 8]). Some authors [1, 2] are reviewing the possible authentication methods for different age groups, depending on how children and older users remember passwords and images. We aim to extend the existing authentication surveys by making the most comprehensive and detailed overview possible, regardless of the usage and application. Together with the most commonly used techniques for authentication, we review some possible methods for authentication of users, belonging to different age groups and people with different disabilities. We also review the fast-moving blockchain technology in the field of authentication. As a result of the whole study, we created three classifications: places, where human-to-device and device-to-device authentication is required, general authentication techniques, and the authenticator types. The paper is structured as follows. Next is a preliminary research which gives overview of who the data was extracted and how many papers were used for this research. Section 3 gives detailed overview of the most commonly used authentication mechanisms and techniques. The resulting classifications can be seen in Sect. 4. Section 5 summarizes the results and gives the conclusion.
2 Preliminary Research We identified 1088 research papers from 3 publishers. The search criteria within the publishers’ databases for papers was based on the following keywords: blockchain, authentication, QR-code, OTP, hash-chain, decentralized, P2P networks, PKI, face recognition, password policy, augmented reality. The search itself was done in the articles’ metadata (keywords, abstracts). The selected publications were published in the period between 1969 and mid—2020 (the time of this paper preparation). Once we had the initial set of papers, we did a few other searches within them. But this time we extended the set of words to search with the following words: blockchain, authentication, password, P2P networks, OTP, hash-chain, decentralised/decentralized, PIN, PKI, face recognition, fingerprint, identity management, IDM, pattern, augmented reality, MFA (Multi-Factor Authentication) and QR code. Table 1 shows the publishers and the covered period (earliest paper year – latest paper year) from which we managed to find research papers.
1054
I. Chenchev et al. Table 1. Scanned research papers over a certain period (years) Info source
Earliest year Latest year
ACM Digital Library
1969
ScienceDirect (Elsevier)
1997
2020
IEEE Xplore - open member access 2005
2020
2020
The distribution of all scanned articles per publisher and their percent correlation from the total amount are presented in Fig. 1.
Fig. 1. Percent correlation and the number of scanned papers per publishers.
The publish date of the extracted papers vary between 1969 and 2020. We count the number of occurrences of our 21 keywords in every article. Then we summarized them per year. The resulting table became huge enough to be published in this paper. We decided to aggregate the information for the earliest years and to represent data from most initial years up to 2010 and then to show every subsequent year. In Table 2 are presented the number of found articles per year. Table 2. Scanned research papers per different years. Year
[1969; 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Total 2010]
Number 167 of titles (articles)
31
23
41
24
30
57
78
180
275
182
1088
We can see that the number of extracted research papers has increased almost twice – the year 2018 compared to 2017. There is also another one-third increase in 2019. The time of preparation of this survey is the mid-2020, and if we extrapolate the found
33 0 7 5
45
2
21
40
MFA
IDM
Identity Management
5140
94
387
PKI
P2P
24822
82
57
388
151
Pattern
Fingerprint
Grand total
0
421
832
Certificate
0
166
1592
Face
Decentralised
208
1435
PIN
2
859
4950
Authentication
23
851
4710
Security
Decentralized
0
2355
0
10246
Blockchain
2011
Password
1969–2010
Keyword
Year
4969
2
0
0
1
3
3
87
0
1
49
550
208
900
741
2424
0
2012
7186
2
6
0
161
11
10
244
0
4
22
132
392
1161
1334
3707
0
2013
4568
21
48
0
7
24
82
195
1
0
46
965
119
1015
755
1285
5
2014
5244
7
2
4
0
65
244
165
1
0
599
248
186
955
842
1903
23
2015
11577
2
139
105
27
175
299
289
12
39
343
237
356
2203
2219
4569
563
2016
11587
26
12
145
46
198
84
300
22
162
566
651
690
2199
1910
2316
2260
2017
21869
43
28
26
228
186
415
416
52
567
503
660
1267
3230
3478
2500
8270
2018
40841
134
212
98
333
415
487
368
108
1366
729
866
1369
5281
6431
2587
20057
2019
Table 3. Total number of occurrences of keywords in the scanned research papers over the years.
25028
46
87
517
367
303
137
357
55
770
227
938
1112
3365
6209
1322
9216
Mid 2020
162831
328
562
897
1248
1861
1969
2891
251
2934
4337
7005
7342
26118
29480
35214
40394
Grand total
Authentication Mechanisms and Classification: A Literature Survey 1055
1056
I. Chenchev et al.
articles for 2020 (182), the amount is expected to become ~ 360 by the end of the year, which will be again one-third increase, compared to the previous year (2019). Another worth to mention conclusion that comes out is that the amount of papers for 2014 and 2015 is lower than in 2013. This inference can also be seen (in more detail) in Table 3. But the notable trend starts in 2016 and up till now the number of researches increases. In Table 3 are shown the total number of occurrences of the keywords in the scanned research papers over the years, sorted out in descending order by grand total column. Figure 2, Fig. 3 and Fig. 4 are created from Table 3 to represent the trends in number of occurrences for password, authentication, security, face, fingerprint and blockchain in a graphical way (for easier perception):
Fig. 2. The trend in number of words - security, authentication and password.
Fig. 3. The trend in number of words – face, fingerprint.
Results in Table 3 show in the first place that the research interests increase year over year with a slight slowdown 2014–2015. We only note it, because there could be a lot of reasons behind and we cannot generalize for everything. For example, in Fig. 3 are shown the trends for “face” (recognition) and fingerprint. It can be seen that in 2017 the trend for fingerprint goes down, while in the same year the number of occurrences of keyword “face” increases above 2.7 times. One reasonable explanation is the advent
Authentication Mechanisms and Classification: A Literature Survey
1057
Fig. 4. The trend in number of words - blockchain.
of whole new era – the used face recognition as authentication in the mobile devices (phones and tablets). Some other some other interesting conclusions can be made. The keyword “password” is found 4569 times in 2016 (the peak), which is from 1.5 to 3 times more than in the other years. The trend after 2016 is to have an increased number of occurrences over the years, which means that we are not getting even closer to the moment when we will get rid of passwords. A similar trend can be seen in Table 3 for the keyword “authentication”, which increases almost every year and becoming significant in 2019 with a 63% increase over 2018. Another similar and also considerable trend is for the keyword “security” – 84% increase in 2019, compared to 2018. It is essential to mention here is that we did not search for papers with the keyword “security”. An interesting trend is also seen for the keyword “blockchain”, which as technology is rather new, and we explicitly searched for it through the publishers. Starting from 2016 with 563 occurrences and reaching 20057 appearances in 2019. The increase for 2019 is 142% (2019 compared to 2018).
3 Review of the Different Authentication Mechanisms and Techniques In this chapter will be covered few of the most commonly used authentication mechanisms and techniques. We start with password and pin as those words were most common in the scanned papers. We included the image recognition (also added CAPTCHA, QR codes and augmented reality), one-time passwords, two-factor and multi-factor authentication, certificate-based and blockchain authentications. We also review some papers, bringing attention to the authentication where the users belong to different users’ groups – different age and users with specific disabilities. 3.1 Password Different methods for password authentication improvements can be found in [30–32, 46]. Authors in [45] propose together with the graphical passwords, an encrypted key to be sent to the user’s e-mail. The user will be asked to enter the key afterward for final validation.
1058
I. Chenchev et al.
We also find a lot of research papers, in the area of improving quality and passwords hardening – [34, 35, 41, 44, 53, 55]. In [33] is represented a research that has been done over 470 students, staff and faculty at Carnegie Mellon University (CMU) for their attitudes and behavior of using passwords, after the university changed their passwords policy. Authors believe that the results of their survey can help design better password policies that consider not only technical aspects of specific policy rules, but also users’ behavior in response to those rules. The results of a large-scale study of password use and password reuse habits of half-million web users for three months are presented in [39]. The UNIX system was the first realized with a password file that contains the actual passwords of all users. For that reason, the access to it has to be very strictly observed. Reviewing the security of the UNIX passwords, [31] proposes some methods like the usage of less predictable passwords; salted passwords; improved passwords encryption (before storing them in the system); usage of hardware chip to perform DES encryption, to equalize the login time - if a user enters correct or incorrect username (the response times are different, if someone with bad intention could find out whether a particular username is valid or not). Early UNIX implementations (70s–80s) have passwords with a length of 8 characters and salt of 12-bit. Later on, to meet the increasing security requirements, the shadow password system is used to limit the access to the encrypted passwords – having a salt length of eight characters and 86 characters of the generated hash value. Usage of salts has been researched in [40], proposing a novel password stretching algorithm that creates a strong password from a weak password and provides protection against a pre-computation attack. The research [21] reviews the options if ever will be possible to get rid of passwords. The passwords can provide the right level of protection if they are used correctly. From an operational perspective, passwords and PINs have almost universal applicability. Author shares that the passwords appear to be attractive because there is virtually no cost (at least no direct cost – our comment) associated with their deployment. While the usage of any HW tokens and biometrics will require additional hardware to be added to each system, which will bring additional deployment costs within the organizations. Although, the password authentication is one of the oldest methods for authentication, nowadays, research on the emotional state (fear and stress) on password decision making in well described in [36]. The authors claim to be the first in this area – investigation of the effects of incidental fear and stress on password choice. An authentication method, combining passwords, and a smart card is proposed in [22]. There are defined ten requirements for the evaluation of a new password authentication scheme. The authors present a password authentication scheme, which is using a smart card and achieves all of the ten requirements. Another research in this area is [23]. It proposes a shared cryptographic token for authentication. Authors define three design criteria, suggest a protocol, and a tamper-resistant hardware to achieve secure authentication – free from both replay and offline dictionary attacks, even if weak keys are used. To improve the security of password-based applications by incorporating biometric information into the passwords, [32] propose a method for password hardening based
Authentication Mechanisms and Classification: A Literature Survey
1059
on keystroke dynamics. About ten years later, [65] is exploring the possibilities of using the keystroke dynamics as an additional authentication method for user identification and authentication. Together with the password usage, it becomes more and more important to implement mechanisms for passwords recovery (forgotten passwords). One such research is [53] in which a hands-on cybersecurity project of recovering UNIX passwords is described. The passwords are used for authentication since the beginning of the computer era. We see that the requirements towards the password complexity and length increase. The passwords are still used and will be used in some form - most probably in combination with other authentication mechanisms. 3.2 PIN The PIN abbreviation means Personal Identification Number. In the era of mobile phones, the PIN is used in two places. The first place is the primary method for securing a mobile phone. And the second one is in the user’s SIM (Subscriber Identity Module) card – a removable token containing the cryptographic keys needed for authentication in the mobile network. Fifteen years ago, in [20], the authors were reviewing the results of a survey held between 297 mobile subscribers. A large portion of the responders (42%) believed that the PIN is providing an adequate level of security. Those who were using PIN were not using it properly. Another part of their research shows that 45% of the responders have never changed their code, 42% changed it only once. Another part of the survey is showing that 83% of the responders thought that some form of biometric authentication would be a good idea. Alternative to the PIN is the so-called PIP (Personal Identification Pattern) [25]. It represents a rectangular grid, each cell of it filled with a digit from 0 to 9. The user has to memorize an ordered sequence of four cell positions within the grid – known as the user’s PIP. Every time, the user has to authenticate, he/she sees which digits fall within the four cells of the PIP pattern – this is how he obtains the four-digit PIN. There are 390625 possible PIPs that a user might choose in a 5 × 5 grid. This number represents the ways of choosing four cells in the grid when repeated use of individual cells is allowed. If repeated use of cells is not permitted, there are still 303600 possible PIPs, the author mentions. The traditional PIN system has just 10000 possible four-digit permutations. Banks are inventing the so-called TANs (Transaction Authentication Numbers). It is seen as the only possible solution to the piggyback hacking. In other words, it utilizes a one-time PIN for each online transaction. 3.3 Image Recognition According to [24], up to 2007, most mainstream image authentication schemes are still unable to detect burst bit errors. In the paper is proposed the usage of Hamming code, Torus automorphism, and a bit rotation technique to do tamper-proofing. Although this practical approach is not directly related to the users’ authentication, it can be used when transferring images (used for authentication) over a noisy communication channel.
1060
I. Chenchev et al.
Authors in [38] state that the most widely advocated graphical password systems are based on recognition rather than recall. They try to determine if the use of recognition in text-based authentication systems could improve their usability. The research [66] represents a novel mechanism for user authentication that is secure and usable, regardless of the size of the device on which it is implemented. The approach relies on the human ability to recognize a degraded version of a previously seen image. The research [43] presents the implementation and initial evaluation of a picture password scheme for mixed reality. To the best of the authors’ knowledge, such an attempt is the first of its kind, aiming to investigate where picture passwords are suitable within such context. The literature still lacks a thorough review of the recent advances of biometric authentication for secure and privacy-preserving identification, as we see in [9]. In the paper, the authors review and classify the existing biometric authentication systems. They also analyze the differences and summarize advantages and disadvantages based on specific criteria. The defined levels of measures are accuracy, efficiency, usability, security, and privacy. The existing biometric authentication systems like face recognition, iris recognition, fingerprint/palmprint recognition, electrocardiographic (ECG) signals, voice recognition, keystroke, and touch dynamics are compared to the defined levels above. It is worthy of reading that different authors (according to their summary review) are classifying those methods to the standards in different quality levels. One example of a quality level of efficiency: high (the time cost is < 1 s), medium (time cost is between 1s and 3 s), low (the time cost more than 3 s). Another example – the security characteristic of the iris recognition is classified with high and medium levels from different authors. Fingerprinting methods classification and analysis are proposed in [63] to improve (to augment) the Web authentication. The face recognition algorithms are attractive to researchers. Face recognition is used to identify or to verify a person’s identity. The algorithms are getting input data from face biometrics and are matching them against known faces in a database. Authors in [10] review five dictionary learning algorithms – shared dictionary, class-specific dictionary, commonality and particularity dictionary, auxiliary dictionary, and domain adaptive dictionary. In [11], the face recognition algorithms are categorized into two main groups – single modal (visual face recognition, 3D face recognition, infrared face recognition) and multimodal (visual + 3D, visual + IR, visual + IR + 3D). Together with the advancements of face recognition algorithms, researchers work in the field of biometric antispoofing methods. A very detailed survey on the antispoofing methods can be found in [12]. Multimodality is also being explored and reviewed as an antispoofing technique, considering the combination of face and voice. Some of the discussed in [12] authors propose unusual methods for imaging technology outside of the visual spectrum – complimentary infrared (IR) or near-infrared (NIR) images, which claim to provide sufficient information to distinguish between identical twins. Authors of [64] believe that they presented the first study to explore secure and usable authentication to headwear devices using brainwaves biometrics. Another research [67] in a similar area, VR (Virtual Reality) applications, evaluates a head and body movementbased continuous authentication system.
Authentication Mechanisms and Classification: A Literature Survey
1061
Interesting research for a time-based one-time password (TOTP) system that requires no secrets on the server is presented in [57]. The starting point of their study is a classic one-time password system called S/Key, which is not time-based and is being modernized to make it time-based (a T/Key) and, of course, addresses the resulting security challenges. QR (Quick Response) codes are used in the authentication process as a communication channel from the secure mobile device to the authenticating device. To defeat against real-time Man-In-The-Middle phishing attack, [75] propose a geolocation based QR-code authentication scheme using mobile phones. Authors combine the login history of the user’s computer and location-based mutual authentication. The assumption here is that the user already has an account on the web server. The proposed scheme compares the geo-location information between the mobile phone (GPS coordinates) and the user’s computer (extracted geo-location from the IP address of the user’s computer). Authors in [76] propose generation of so called “ticket” (from their WebTicket system) - an encrypted code (could be 2D barcode or QR code) which includes URL of a web site, a user ID, password. This ticket can be either printed or stored in a mobile phone. To log into an account, the user has to show the ticket in-front of the computer’s webcam. The data is encrypted using a key, stored in the user’s computer. So, the authors transform the knowledge-based authentication into token-based authentication. It is worth to mention that the authors designed this system with some important goals – like reliable login, strong passwords (random and different for each account), low cost and some others. In [54] is described as a new way to hide a QR code inside a tile-based color picture optically. Each AR marker is built from hundreds of small tiles, and the different gaps between the tiles are used to determine the elements of the hidden QR Code. The CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) test is used in computing to determine whether or not the user is human – in other words – to improve the security of user authentication methods. A mechanism for adjusting the verification to distinguish between malware and humans is proposed in [26] and is called Invisible CAPTCHA. When the test starts, Invisible CAPTCHA checks if the smartphone is in the user’s hand. 3.4 OTP (One-Time Password) We find in 1969 a paper [13] with an excellent bibliography – [14] and [15]. Both documents are discussing the one-time passwords. Authors in [15] are talking for a randomly selected list of passwords in a “one-time-use” manner, where a new word is taken from the list each time authentication is needed. One copy of such a list would be stored in the processor (a server or a node, in which the user is about to authenticate/login), the other maintained in the terminal, or carried by the user. A similar idea is found in [14] – to issue the user of a remote terminal a one-time list of five-letter groups. The user would be required to use each group, in turn, once and only once. Later, in 1981 [16] extends the idea of using a list of passwords. The author proposes to store the generated value y = F(x), form a given password x. The user sends a value x to the system, which produces the value y from x. For that reason, the function F(x) must be a secure one-way encryption function. The author goes further by proposing the usage of the tampering amount
1062
I. Chenchev et al.
of 1000 passwords to prevent eavesdropper. Every password is about to be used just once. After using all passwords, another list with passwords must be generated. There are some other researches in this area, improving the Lamport’s OTP authentication algorithm with the usage of hash-chains and with a combination of different methods – [5, 17]. 3.5 Two-Factor Authentication (2FA) and MFA In [45] is proposed a method for authentication, utilizing graphical images (so-called graphical passwords) as an alternative to text passwords and exchange of an encrypted key transferred via the user’s e-mail. Usage of the two-factor and multi-factor authentication started to be widely and extensively used in the last few years with the advent of the cloud computing – [49–51, 56]. Google Authenticator or a 2-step verification is a software-based technique that provides a second layer of security [48]. The application generates two-step verification codes that can be used in addition to the account password. Another example of 2-step authentication could be the usage of hardware token (like RSA SecureID), which generates a new verification code (six-digits number) at fixed intervals – usually every 30–60 s. Key characteristic of such implementation is that the server’s clock must be synchronized with the authentication token’s built-in clock. There are a lot of other software equivalents to Google Authenticator – like Authy, Duo Security, RSA SecureID Access, Azure Multi-Factor Authentication, LastPass, Ping Identity, AuthPoint Multi-Factor Authentication, etc. Once configured, their usage is straightforward. There are compiled versions for almost all commonly used mobile phone operating systems. Supporting different platforms makes them universal OTP SW tokens. A gesture-based pattern lock system is an excellent example of multi-factor authentication, using touch dynamics in a smartphone [52]. In more general, the authors discuss a guided approach to behavioral authentication. 3.6 Certificate-Based The techniques for securely exchanging keys are fundamental for enabling secure communication over public, insecure networks. There are researches in this field, proposing different methods to secure the keys exchange – from using weak passwords [71, 73], using strong passwords [72], to using multi-factor password-authenticated key exchange [74]. 3.7 Blockchain There are some research papers around the gaining popularity blockchain technology to support IoT (Internet of Things) devices authentication – [58]. Authors are proposing a method for securing the messaging protocol in IoT, MQTT (Message Queueing Telemetry Transport), with the help of the OTP-authentication scheme, over the Ethereum to implement an independent logic channel for second-factor authentication. Another
Authentication Mechanisms and Classification: A Literature Survey
1063
usage of blockchain technology into the authentication of multi-server environments is described in [59]. Authors use blockchain to construct a privacy-awareness authentication scheme for multi-server environments, which can achieve distributed registry and efficient revocation. [60] proposes an authentication scheme for blockchain-based EHR (Electronic Health Records). Improvements in the authentication with blockchain in the field of Internet of Vehicles are proposed in [61, 62]. 3.8 Suitable Authentication Techniques, Depending on the Age In [1], we find research for design an easy-to-memorize, graphical password system, which is explicitly intended for older users (over the age of 60) and achieves a level of password entropy comparable to traditional PINs and text passwords. Their research resulted from a survey of older computer users, for whom creating and using complex text passwords is challenging. Resulting graphical passwords consisting of 4 sets of images (with dimensions 3 × 5, 4 × 4, 5 × 5, and 6 × 6 small portrait pictures), each having one pre-selected by a user portrait, and displayed in a random order every time a login (authentication) attempt starts. Another research referenced in [1] shows that the older users perform better at memorizing age-appropriate materials. The proposed in that paper personalized password technique is usable with any mobile device (or tablet) with a touchscreen to support people with finger disabilities. [2] founds a noticeable difference in theoretical and actual behavior in credentials creation and reuse in the adults’ group. In other words – there is a need to improve the adult practices in terms of authentication. Another research based on the users’ age is [2]. The author does semi-structured interviews with 22 children (5–11 years old) and an online survey with 33 adults (parents and teachers). A big part of the children reused their credentials when they have been asked to enter new credentials. Most of the interviewed children created a self-related or even duplicated from other logins that they have and yet write them down on a piece of paper to increase the memorability. The author in [37] is another research with young users (children aged 7–8) studied textual passwords with children. 3.9 Authentication Techniques for People with Different Disabilities An authentication method called “Passface” requires users to remember a sequence of five faces as their password, rather than alphanumeric characters. In some cases, this may be suitable for the vast majority type of users, but it would be difficult for persons with visual disabilities to use this type of system. The goal of the research [42] is to demonstrate that a similar and useful concept can be implemented using sounds instead of faces.
4 Classifications We propose the following classification of authentication methods based on all extracted research papers. The data extraction process was to search for the needed information
1064
I. Chenchev et al.
in their metadata and, therefore, to prepare the classifications. But we went deeply into the researches for the detailed methods explanations. In general, the places where human-to-device and device-to-device authentication is required can be classified as follows: • Applications for internal/external use • • • • • • • • •
Client-Server (Thin/Thick clients) Mobile device (mobile phone, tablet, smartwatch) Web-based Databases
Websites IoT devices Servers Network devices Services
In Fig. 5 are shown the general authentication techniques as extracted and grouped from the papers.
Fig. 5. The general authentication techniques.
Authenticators are well described in [19] and extended in this survey. We added some additional updates as per [20] – ear shape and handwriting recognition. This classification is presented in Fig. 6.
Authentication Mechanisms and Classification: A Literature Survey
1065
Fig. 6. Classification of the authenticator types.
5 Results, Discussions and Conclusions It is not possible to have just one or two methods for authentication in nowadays systems. The required/needed level of security justifies the means. Many factors are involved – age groups of the users; user groups with different disabilities; private or public access; from a static web hosting server to a complex web-based multitenant mobile payment system; development, test, and production environments; usage of private, public or hybrid cloud computing; any governmental and regulatory acts and standards; any company policies, and especially in learning platforms for data analysis as APTITUDE. The survey in [20], held in 2005, finds a significant portion of users who do not enable any protection at all, while some of them encounter related problems. Most of the users demonstrated their willingness to consider biometrics. Based on their findings, it is evident that alternative and stronger authentication approaches are required for mobile devices. In research work [3] is reviewed authentication options for IoT devices. They also spend some effort in describing the gaps and opportunities in current IoT authentication techniques: weak transport encryption, password limitations, faulty or complicated IoT systems, financial implications, insecure interfaces, broken authentication and authorization, security flows in devices software and firmware. An efficient and secure authentication scheme for machine-to-machine (M2M) networks in IoT enabled cyber-physical systems is proposed in [69]. This method is applicable in point-to-point decentralized networks. There is a research [70] relevant to the diverse range of IoT devices such as mobile phones, smartwatches, and wearable sensors. It is based on more in-depth gait analysis with the least number of features and data for authentication purposes. The technology development brings new opportunities for extending and improving the authentication. Authors in [5] propose a method for authentication, using a combination of both OTP and hash-chain generated passwords (with SHA2 and SHA3
1066
I. Chenchev et al.
hash functions), where the generated password is different for every authentication/login attempt and is not store anywhere. There is an approach for authentication, based on blockchains in [6]. We believe that for the authentication improvement, the research focus should be placed not just on finding alternatives, but on “smart enough” methods and algorithms that combine multiple heterogeneous authentication mechanisms and adapting their usage to the appropriate situation(s). Cost in mind: Approximately half of the technical support calls made to IT help desks are with forgotten passwords [21]. The issues with authentication might have significant financial implications. Organizations have to take their decisions between the ongoing (but in general hidden) helpdesk costs for passwords related issues and the upfront technology deployment plus some additional training costs that may be incurred by moving to other methods. In the future, the results of such evaluations will change as the technologies will develop and provide different new possibilities – [27]. It is evident from their findings that no single method offers all possible features in authentication. The papers for security improvements in different ages, their behavior, and the possibilities to remember are becoming more and more actual with the advent of technologies in our everyday life. The password policies are defining the rules to enforce the validity periods and complexity of passwords. New and robust authentication mechanisms are being developed and used. We, as users and learners, need to continuously change our behavior and even to our minds to meet these increasing security demands, to secure and protect our data, our business, our digital identities, and our lives. Acknowledgment. The research reported here was funded by the project “An innovative software platform for big data learning and gaming analytics for a user-centric adaptation of technologyenhanced learning (APTITUDE)” - research projects on the societal challenges – 2018 by Bulgarian National Science Fund with contract №: KP-06OPR03/1 from 13.12.2018.
References 1. Carter, N.: Graphical passwords for older computer users. In: UIST 2015 Adjunct, Charlotte, NC, USA, 08–11 November 2015. ACM. https://doi.org/10.1145/2815585.2815593. 978-14503-3780-9/15/11 2. Ratakonda, D.K.: Children’s authentication: understanding and usage. In: IDC 2019, Boise, ID, USA, 12–15 June 2019. ACM. https://doi.org/10.1145/3311927.3325354. ISBN 978-14503-6690-8/19/06 3. Atwady, Y., Hammoudeh, M.: A survey on authentication techniques for the Internet of Things. In: ICFNDS 2017, Cambridge, United Kingdom, 19–20 July 2017 (2017). https:// doi.org/10.1145/3102304.3102312 4. Shah, S.W., Kanhere, S.S.: Recent trends in user authentication – a survey. IEEE Access (2019). https://doi.org/10.1109/ACCESS.2019.2932400
Authentication Mechanisms and Classification: A Literature Survey
1067
5. Chenchev, I., Nakov, O., Lazarova, M.: Security and performance considerations of improved password authentication algorithm, based on OTP and hash-chains. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) FTC 2020. AISC, vol. 1290, pp. 921–934. Springer, Cham (2021). https:// doi.org/10.1007/978-3-030-63092-8_63 6. Salman, T., Zolanvari, M., Erbad, A., Jain, R., Samaka, M.: Security services using blockchains: a state of the art survey. IEEE Commun. Surv. Tutor. 21(1) (2019). https:// doi.org/10.1109/COMST.2018.2863956 7. Zimmermann, P.R.: The Official PGP User’s Guide. MIT Press, USA, May 1995. ISBN 978-0-262-74017-3 8. (RFC4880), [RFC4880] https://tools.ietf.org/html/rfc4880 9. Rui, Z., Yan, Z.: A survey on biometric authentication: toward secure and privacy-preserving identification. IEEE Access (2018). https://doi.org/10.1109/ACCESS.2018.2889996 10. Xu, Y., Li, Z., Yang, J., Zhang, D.: A survey of dictionary learning algorithms for face recognition. IEEE Access (2017). https://doi.org/10.1109/ACCESS.2017.2695239 11. Zhou, H., Mian, A., Wei, L., Creighton, D., Hossny, M., Nahavandi, S.: Recent advances on singlemodal and multimodal face recognition: a survey. IEEE Trans. Hum.-Mach. Syst. 44(6) (2014). https://doi.org/10.1109/THMS.2014.2340578 12. Galbally, J., Marcel, S., Fierrez, J.: Biometric antispoofing methods: a survey in face recognition. IEEE Access (2014). https://doi.org/10.1109/ACCESS.2014.2381273 13. Hoffman, L.J.: Computers and privacy: a survey. Comput. Surv. 1(2), 85–103 (1969). Article found in ACM Digital Library 14. Peters, B.: Security considerations in a multi-programmed computer system. In: Proceedings of the AFIPS 1967 Spring Joint Computer Conference, vol. 30, pp. 283–286. Thompson Book Co., Washington, D.C. (1967) 15. Petersen, H.E., Turn, R.: System implications of information privacy. In: Spring Joint Computer Conference, 17–19 April 1967, vol. 30, pp. 291–300. Thompson Book Co., Washington, D.C. (1967). (Also available as Doc. P-3504, Rand Corp., Santa Monica, California, Apr. 1967) 16. Lamport, L.: Password authentication with insecure communication. Commun. ACM 24(11), 770–772 (1981) 17. Park, C.-S.: One-time password based on hash chain without shared secret and re-registration. Compt. Secur. 75, 138–146 (2018) 18. O’Gorman, L.: Comparing passwords, tokens, and biometrics for user authentication. Proc. IEEE 91(12), 2019–2040 (2003) 19. Arias-Cabarcos, P., Krupitzer, C., Becker, C.: A survey on adaptive authentication. ACM Comput. Surv. 52(4) (2019). https://doi.org/10.1145/3336117. Article no. 80, 30 pages 20. Clarke, N.L., Furnell, S.M.: Authentication of users on mobile telephones – a survey of attitudes and practices. Comput. Secur. 24, 519–527 (2005). https://doi.org/10.1016/j.cose. 2005.08.003 21. Furnell, S.: Authenticating ourselves: will we ever escape the password? Netw. Secur. 2005, 8–13 (2005) 22. Liao, I.-E., Lee, C.-C., Hwang, M.-S.: A password authentication scheme over insecure networks. J. Comput. Syst. Sci. 72, 727–740 (2006). https://doi.org/10.1016/j.jcss.2005. 10.001 23. Yen, S.-M., Liao, K.-H.: Shared authentication token secure against replay and weak key attacks. Inf. Process. Lett. 62, 77–80 (1997) 24. Chan, C.-S., Chang, C.-C.: An efficient image authentication method based on Hamming code. Pattern Recogn. 40, 681–690 (2007). https://doi.org/10.1016/j.patcog.2006.05.018 25. Gold, S.: Password alternatives. Network Security. Elsevier, September 2010
1068
I. Chenchev et al.
26. Guerar, M., Merlo, A., Migliardi, M., Palmieri, F.: Invisible CAPTCHA: a usable mechanism to distinguish between malware and humans on the mobile IoT. Comput. Secur. 78, 255–266 (2018). https://doi.org/10.1016/j.cose.2018.06.007 27. Halunen, K., Haikio, J., Vallivaara, V.: Evaluation of user authentication methods in the gadget-free world. Pervasive Mob. Comput. 40, 220–241 (2017). https://doi.org/10.1016/j. pmcj.2017.06.017 28. Dossogne, J., Lafitte, F.: On authentication factors: “what you can” and “how you do it”. In: SIN 2013, Aksaray, Turkey, 26–28 November 2013. ACM (2013). https://doi.org/10.1145/ 2523514.2523528 29. Peiset, S., Talbot, E., Kroeger, T.: Principles of authentication. In: NSPW 2013, Banff, Canada, 9–12 September 2013. ACM (2013). https://doi.org/10.1145/2535813.2535819 30. Singh, K.: On improvements to password security. ACM (1985) 31. Morris, R., Thompson, K.: Password security: a case history. Commun. ACM 22(11), 594–597 (1979) 32. Monrose, F., Reiter, M.K., Wetzel, S.: Password hardening based on keystroke dynamics. In: CCS 1999, 11/99, Singapore. ACM (1999) 33. Shay, R., et al.: Encountering stronger password requirements: user attitudes and behaviors. In: Symposium on Usable Privacy and Security (SOUPS) 2010, Redmond, WA USA, 14–16 July 2010. ACM (2010) 34. Halderman, J.A., Waters, B., Felten, E.W.: A convenient method for securely managing passwords. In: International World Wide Web Conference Committee (IW3C2) 2005, Chiba, Japan, 10–14 May 2005. ACM (2005) 35. Garrison, C.P.: Encouraging good passwords. In: InfoSecCD Conference 2006, Kennesaw, GA, USA, 22–23 September 2006. ACM (2006) 36. Fordyce, T., Green, S., Gros, Th.: Investigation of the effect of fear and stress on password choice. In: 7th ACM Workshop on Socio-Technical Aspects in Security and Trust, Orlando, Florida, USA, (STAST 2017), December 2017 (2017). https://doi.org/10.475/123_4 37. Read, J.C., Cassidy, B.: Designing textual password systems for children. In: IDC 2012, Bremen, Germany, 12–15 June 2012 (2012) 38. Wright, N., Patrick, A.S., Biddle, R.: Do you see your password? Applying recognition to textual passwords. In: Symposium on Usable Privacy and Security (SOUPS) 2012, Washington, DC, USA, 11–13 July 2012 (2012) 39. Florencio, D., Herley, C.: A large-scale study of web password habits. In: International World Wide Web Conference Committee (IW3C2) 2007, Banff, Alberta, Canada, 8–12 May 2007. ACM (2007) 40. Lee, C., Lee, H.: A password stretching method using user specific salts. In: WWW 2007, Banff, Alberta, Canada, 8–12 May 2007. ACM (2007) 41. Korkmaz, I., Dalkilic, M.E.: The weak and the strong password preferences: a case study on Turkish users. In: SIN 2010, Taganrog, Rostov-on-Don, Russian Federation, 7–11 September 2010. ACM (2010) 42. Brown, M., Doswell, F.R.: Using passtones instead of passwords. In: ACMSE 2010, Oxford, MS, USA, 15–17 April 2010. ACM (2010) 43. Hadjidemetriou, G., et al.: Picture passwords in mixed reality: implementation and evaluation. In: CHI 2019 Extended Abstracts, Glasgow, Scotland UK, 4–9 May 2019. ACM (2019). https://doi.org/10.1145/3290607.3313076 44. Houshmand, S., Aggarwal, S.: Building better passwords using probabilistic techniques. In: ACSAC 2012, Orlando, Florida, USA, 3–7 December 2012. ACM (2012) 45. Manjula Shenoy, K., Supriya, A.: Authentication using alignment of the graphical password. In: ICAICR 2019, Shimla, H.P., India, 15–16 June 2019. ACM (2019). https://doi.org/10. 1145/3339311.3339332
Authentication Mechanisms and Classification: A Literature Survey
1069
46. Chang, Y.-F., Chang, C.-C.: A secure and efficient strong-password authentication protocol. ACM SIGOPS Oper. Syst. Rev. (2004). https://doi.org/10.1145/1035834.1035844 47. Schneier, B.: Sensible authentication. ACM Queue 10, 74–78 (2004) 48. Alhothaily, A., et al.: A secure and practical authentication scheme using personal devices. IEEE Access 5 (2017). https://doi.org/10.1109/ACCESS.2017.2717862 49. Derhab, A., et al.: Two-factor mutual authentication offloading for mobile cloud computing. IEEE Access 8, 28956–28969 (2020) 50. Siddiqui, Z., Tayan, O., Khan, M.K.: Security analysis of smartphone and cloud computing authentication frameworks and protocols. IEEE Access 6, 34527–34542 (2018) 51. Mohsin, J.K., Han, L., Hammoudeh, M.: Two factor vs multi-factor, an authentication battle in mobile cloud computing environments. In: ACM ICFNDS 2017, Cambridge, United Kingdom, 19–20 July 2017 (2017). https://doi.org/10.1145/3102304.3102343 52. Ku, Y., Park, L.H., Shin, S., Kwon, T.: POSTER: a guided approach to behavioral authentication. In: CCS 2018, Toronto, ON, Canada, 15–19 October 2018. ACM (2018). https://doi. org/10.1145/3243734.3278488 53. Gong, C., Behar, B.: Understanding password security through password cracking. JCSC 33(5), 81–87 (2018) 54. Nguyen, M., Tran, H., Le, H., Yan, W.Q.: A tile based color picture with hidden QR code for augmented reality and beyond. In: VRST 2017, Gothenburg, Sweden, 8–10 November 2017. ACM (2017). https://doi.org/10.1145/3139131.3139164 55. Shay, R., et al.: Can long passwords be secure and usable? In: CHI 2014, Toronto, ON, Canada, 26 April–01 May 2014. ACM (2014). https://doi.org/10.1145/2556288.2557377 56. Abuarqoub, A.: A lightweight two-factor authentication scheme for mobile cloud computing. In: ICFNDS 2019, Paris, France, 1–2 July 2019. ACM (2019). https://doi.org/10.1145/334 1325.3342020 57. Kogan, D., Manohar, N., Boneh, D.: T/Key: second-factor authentication from secure hash chains. In: CCS 2017, Dallas, TX, USA, 30 October–3 November 2017. ACM (2017). https:// doi.org/10.1145/3133956.3133989 58. Buccafurri, F., Romolo, C.: A blockchain-based OTP-authentication scheme for constrained IoT devices using MQTT. In: ISCSIC 2019, Amsterdam, Netherlands, 25–27 September 2019. ACM (2019). https://doi.org/10.1145/3386164.3389095 59. Xiong, L., Li, F., Zeng, S., Peng, T., Liu, Z.: A blockchain-based privacy-awareness authentication scheme with efficient revocation for multi-server architectures. IEEE Access 7 (2019). https://doi.org/10.1109/ACCESS.2019.2939368 60. Tang, F., Ma, S., Xiang, Y., Lin, C.: An efficient authentication scheme for blockchain-based electronic health records. IEEE Access 7 (2019). https://doi.org/10.1109/ACCESS.2019.290 4300 61. Wang, X., et al.: An improved authentication scheme for internet of vehicles based on blockchain technology. IEEE Access 7 (2019). https://doi.org/10.1109/ACCESS.2019.290 9004 62. Tan, H., Chung, I.: Secure authentication and key management with blockchain in VANETs. IEEE Access 8 (2020). https://doi.org/10.1109/ACCESS.2019.2962387 63. Alaca, F., van Oorschot, P.C.: Device fingerprinting for augmenting web authentication: classification and analysis of methods. In: ACSAC 2016, Los Angelis, CA, USA, 05–09 December 2016. ACM (2016). https://doi.org/10.1145/2991079.2991091 64. Lin, F., et al.: Brain password: a secure and truly cancelable brain biometrics for smart headwear. In: MobiSys 2018, Munich, Germany, 10–15 June 2018. ACM (2018). https://doi. org/10.1145/3210240.3210344 65. Chuda, D., Durfina, M.: Multifactor authentication based on keystroke dynamics. In: International Conference on Computer Systems and Technologies – CompSysTech 2009. ACM (2009)
1070
I. Chenchev et al.
66. Hayashi, E., Christin, N.: Use your illusion: secure authentication usable anywhere. In: Symposium on Usable Privacy and Security (SOUPS) 2008, Pittsburgh, PA, USA, 23–25 July 2008. ACM (2008) 67. Mustafa, T., et al.: Unsure how to authenticate on your VR headset? Come on, use your head! In: Authentication, Software, Vulnerabilities, Security Analytics, IQSPA 2018, Tempe, AZ, USA, 21 March 2018. ACM (2018). https://doi.org/10.1145/3180445.3180450 68. Forget, A., Chiasson, S., Biddle, R.: Choose your own authentication. In: NSPW 2015, Twente, Netherlands, 08–11 September 2015. ACM (2015). https://doi.org/10.1145/2841113. 2841114 69. Renuka, K., Kumari, S., Zhao, D., Li, L.: Authentication scheme for M2M networks in IoT enabled cyber-physical systems. IEEE Access 7 (2019). https://doi.org/10.1109/ACCESS. 2019.2908499 70. Batool, S., Hassan, A., Saqib, N., Khattak, M.: Authentication of remote IoT users based on deeper gait analysis of sensor data. IEEE Access 8 (2020). https://doi.org/10.1109/ACCESS. 2020.2998412 71. Katz, J., Ostrovsky, R., Yung, M.: Efficient and secure authentication key exchange using weak passwords. J. ACM 57(1) (2009). https://doi.acm.org/10.1145/1613676.1613679. Article no. 3 72. Jablon, D.: Strong password-only authenticated key exchange. ACM SIGCOMM Comput. Commun. Rev. (1996) 73. Halevi, S., Krawczyk, H.: Public-key cryptography and password protocols. ACM Trans. Inf. Syst. Secur. 2(3), 230–268 (1999) 74. Stebila, D., Udupi, P., Chang, S.: Multi-factor password-authenticated key exchange. In: Proceedings of the 8th Australasian Information Security Conference (AISC 2010), Brisbane, Australia. CRPIT Volume 105 – Information Security 2010. ACM (2010) 75. Kim, S.-H., Choi, D., Jin, S.-H., Lee, S.-H.: Geo-location based QR-code authentication scheme to defeat active real-time phishing attack. In: DIM 2013, Berlin, Germany, 08 November 2013. ACM Workshop on Digital Identity Management (2013). https://doi.org/10.1145/ 2517881.2517889 76. Hayashi, E., et al.: Web ticket: account management using printable tokens. In: CHI 2012, SIGCHI Conference on Human Factors in Computing Systems, May 2012, pp. 997–1006. ACM (2012). https://doi.org/10.1145/2207676.2208545
ARCSECURE: Centralized Hub for Securing a Network of IoT Devices Kavinga Yapa Abeywardena, A. M. I. S. Abeykoon, A. M. S. P. B. Atapattu(B) , H. N. Jayawardhane, and C. N. Samarasekara Information System Engineering, Sri Lanka Institue of Information Technology, Malabe, Sri Lanka [email protected]
Abstract. As far as it is considered, IoT has been a game changer in the advancement of technology. In the current context, the major issue that users face is the threat to their information stored in these devices. Modern day attackers are aware of vulnerabilities in existence in the current IoT environment. Therefore, securing information from being gone into the hands of unauthorized parties is of top priority. With the need of securing the information came the need of protecting the devices which the data is being stored. Small Office/Home Office (SOHO) environments working with IoT devices are particularly in need of such mechanism to protect the data and information that they hold in order to sustain their operations. Hence, in order come up with a well-rounded security mechanism from every possible aspect, this research proposes a plug and play device “ARCSECURE”. Keywords: Internet of Things · Information security · Machine learning · DoS · DDoS · Botnet · Authentication · Authorization · Detection · Mitigation · Malware
1 Introduction Protecting information, providing security for information is a major concern of any user. In a business environment, loss or unauthorized modification of information/ data could put the whole business’s functions at a hold and even create great losses financially and reputationally. For a Small Office/ Home Office (SOHO) environment the organization will hold various information related to the organization, customers and suppliers, etc. If in case, one of this device compromised the whole network of devices will be at a risk of being attacked. Therefore, each and every one of the devices within the network should be protected to guarantee that the network is safeguarded. Malware attacks such as Trojan horses and spyware could easily go undetected and the user could lose information without their knowledge. Many attacks go undetected due to the lack of expertise within the users to detect suspicious behavior and in some cases even if the user was able to identify anomalies in the functionality of the system it can be hard to come up with a mitigation option without the proper knowledge and expertise because simply shutting down the machine won’t help in this case. Therefore, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 1071–1082, 2021. https://doi.org/10.1007/978-3-030-80129-8_70
1072
K. Y. Abeywardena et al.
it is essential that better mechanisms have been put into place to immediately detect and address such attacks. Even if such attacks are detected and stopped from entering the system at an early stage, there can be instances where authentication stage is compromised by ways such as brute force their way to the system. Hence all possibilities should be considered when coming up with a well-rounded solution for the business. Another aspect that doesn’t gain as much attention is instances where at the initial stage when a user will be installing the relevant software for the devices that they have purchased. In this often a user might be directed to a cracked version of the software where the user will believe it to be legitimate, yet it might be a file with malware such as ransomware included. If the user downloads this executable and runs it the whole system will be compromised, and it might result in the user losing all his/her important data where in a business environment this could be crucial. Therefore, it can be said that taking all abovementioned aspects into account is a necessity when coming up with a solution to protect business functionalities in a SOHO environment working with a network of IoT devices. Hence the device proposed by this research “ARCSECURE” will be addressing all these issues in order to come up with a conversant solution which will be greatly suitable for a SOHO environment.
2 Literature Review Stepping into the future of technology, IoT, as suggested by Haller et al., “A world where physical objects are seamlessly integrated into the information network, and where the physical objects can become active participants in business process” carry many security problems [1]. First and foremost amongst the issues are, vulnerabilities produced by poor designing of the program, which will in return create backdoor installations and malware insertion opportunities for attackers [1]. Overlooking these issues in security has the ability to compromise the availability [1] as well as integrity and confidentiality of “IoT”. As discussed by the Open Web Application Security Project (OWASP), insecure software/firmware comes under top 10 vulnerabilities identified for the architecture of IoT [2]. There can exist instances where an attacker could disguise an old version of software which contains security vulnerability as the version offered as the latest by the vendor [3]. This would revert the system back to a faulty version giving attacker access [3]. A faulty software could leave the hardware vulnerable involuntarily [4]. When it comes to authentication and authorization keystroke dynamics have been researched for a long time. In 2002 Bergadano et al. [5] researched keystroke dynamics for user authentication using the volunteer’s self-collected dataset, data collected using the same text for all individuals, resulting in only 0.01 percent passing the authentication impostor. In 2003, Yu and Cho [6] performed preliminary experimental research on the selection of a function subset selection of keystroke dynamics identity verification and found that GA- SVM yielded good accuracy and speed of learning. Revett et al. [7] researched user authentication in 2007 and began researching authentication for a dynamic keystroke. The author has suggested that biometrics are robust, specifically fingerprints, but they can be easily spoofed. In their research paper Shivaramu K and Prasobh P.S discuss Security Vulnerabilities that the Wireless Ad-hoc network is vulnerable due to cooperative algorithms, lack of
ARCSECURE: Centralized Hub for Securing a Network of IoT Devices
1073
monitoring and management and lack of a particular line of defense [7]. An important conclusion drawn from the experimental results is that, of the various methods used, Fuzzy c-means clustering is very efficient in detecting DDoS attacks [8]. Going into another type of prominent attack types, it is stated that recent botnet attacks show a high vulnerability of IoT systems and devices. From the basis of this research, literature has been reviewed. Per the [9], explains that common IoT vulnerabilities such as, web, cloud or mobile interfaces with insecure interfaces, authorization and authentication methods which aren’t sufficient, network services which are insecure, software and firmware which consists vulnerabilities etc.
3 Methodology 3.1 Detection of IOT Botnet Attacks This project involves building a Host-based IDS for SOHO Networks. Host-Based IDS divided into two subcategories, as Packet Capture module, Botnet Prevention System (BPS). Packet Capture Module Packet Capture module contains three sub modules: • Packet Capturer: used to read data of the incoming packets. Libraries: pyshark, afpacket and tshark (Wireshark API). When the new packet comes into the network packet capturer gather the raw data and send the data to packet parser. • Packet Parser: used to retrieve the meta data out of the raw data that came from the packet capturer. Libraries: tshark. • Feature Extractor (FE): this module is used to extract n features from the incoming packets to create the x instance (xi ∈ R). The x instance holds the attributes of the packet. The n features that contains 115 traffic statistics. For network anomaly detection, it is important to extract features from every packet that coming through the packet parser. Because of that we implemented a solution for high speed feature extraction. This method uses O(l) complexity because of the incremental statistics. These incremental statics maintained over a damped window (only capture the recent behaviors). This method also uses 2D statistics which helps to the connection between the rx and tx network traffic. Also, this method consists with the memory requirement and runtime complexity of O(n). Because of the runtime complexity the weight of the older values is decreased over the time. The features extracted, 1. 2. 3. 4.
SrcMAC-IP: MAC and IP address of the packet source SrcIP: IP address of the Source Channel: Channel used between sender and receiver Socket: Socket of the packet sender and receiver (TCP/UDP)
1074
K. Y. Abeywardena et al. Table 1. Features extracted Packet Features extracted
#Fertures
Size
SrcMAC-IP, Src-IP, Channel, Socket 8
Size
Channel, Socket
8
Count SrcMAC-IP, Src-IP, Channel, Socket 4 Jitter
Channel
3
FE extract 23 features from single time frame and results yield were as shown in Table 1. Time frames are 1min, 10s, 1.5s, 500ms, and 100ms this totaling 115 features. Botnet Prevention System (BPS) BPS act as the core system for mitigate botnet attacks. This is based on Machine learning, Feature Mapper (FM) and Anomaly Detection (AD) modules. Machine Learning For botnet detection machine learning divide into two parts: • Data Analytics Data analysis helps to create better design from the raw data, quantify and track objectives. • Machine Learning Algorithms To create machine learning module, below Table 2 shown ML algorithms were selected along with the accuracy. Logistic Regression selected as most accurate in all.
Table 2. Test results of the ML algorithms Algorithm
Accuracy
Decision tree
98%
Logistic regression
99.98%
Perceptron
98.73%
Naïve bayes
97%
K-Nearest neighbours (KNN) 99.97% Decision tree
98%
ARCSECURE: Centralized Hub for Securing a Network of IoT Devices
1075
Feature Mapper (FM) This module is responsible for mapping the n features that collect trough the FE to a vector(x). Different smaller sub instances(k), one k for every encoder in the ensemble layer of Anomaly detector. Let v denote the set of k, v = {v1 , v2 , v3 , . . . , vk } FM consist with the two modes: • Train-Mode: using input vector x the model learns a feature map. • Execution-Mode: the learned mapping is used to create a collection of small instances of v vector from x vector.
Anomaly Detection (AD) • Train-Mode: It takes the v to train the respective Autoencoder in the ensemble layer. Now the RMSE is calculated during the forward-propagation and it is used in training the output layer. • Execution-Mode: The v vectors derived from the execution phase of the Feature Mapper is executed at the respective autoencoders of the ensemble.
RMSE →, → = x
n
i=1 (xi
y
− yi )2
n
x is capability to recreate hidden instances from the same data distribution. → is reconstruction error of the instance. x → is reconstructed output. y
The AD consists with two layers of Autoencoder: • Ensemble Layer: This measures the independent abnormality of each sub-instance in v. • Output Layer: This layer output the final anomaly score by considering abnormalities of the instance and noise in the network. Figure 1 shows the overall picture of feature mapping process.
1076
K. Y. Abeywardena et al.
Fig. 1. Mapping process
3.2 Malware Detection and Mitigation Detection Feature Extraction Machine Learning will only use integer or float data types as features when it comes to detection. Data such as “entropy” is vital when it comes to detecting malware. ‘Entropy’ refers to randomness or messiness of a code. It is stated that if the entropy is high (high messiness/randomness) then the possibility of the code carrying malware would be high. Hence, the minimum, mean and maximum values of entropy will be extracted to determine if the file is malicious or not. All Portable Executable (PE) features were extracted from with use of “pefile” reader module in Python library. Feature Selection Feature selection for this model was done with use of the Tree-based feature selection of SciKit library. Hence the original data of which contained 54 columns/features will be narrowed down to 14 most important features to reduce dimensionality of the dataset. The final features extracted were as follows, The original dataset was then split in to two as 80% of the original dataset as training data and the remaining 20% of testing data. Selection of Classification Algorithms and Training Thenceforth, by taking the training dataset into consideration the model was trained with 5 different algorithms with goal of choosing the best possible algorithm for the motive. Namely, the following were used, and results yield were as shown in Table 3. As Table 3 depicts, it is evident that Random Forest algorithm gave the best result in detecting malicious content. Therefore, Random Forest algorithm was taken as the winning algorithm and henceforth used to train the data for this module. Testing In the testing phase the entropy of the input file will be calculated first. The calculation will be as shown below. p(x) = Pr{X = x}, x ∈ X
ARCSECURE: Centralized Hub for Securing a Network of IoT Devices
1077
Table 3. Accuracy results Algorithm
Accuracy
Decision tree
98%
Random forest
99%
Gradient boosting 98% AdaBoosting
97%
GNB
70%
Let x be a discrete random variable with alphabet X and probability of p(x). Then entropy H(X), p(x) logp(x) H (X ) = x∈X
(The base for the logarithm in this module was used as 2). Then the version information and other information such as size of code, base of code etc. will be taken into consideration to see if the given file’s values tallies with the legitimate file’s mean, min and max entropy values and other values that was taken into consideration. By giving few entries (a mix of both malicious and legitimate files) from the testing dataset as inputs it can be confirmed that values were predicted as “legitimate” and “malicious” with much precision. Mitigation The next step of the module will be implementing the ML model in the web environment where the user interaction with the system will be taking place. When the files are inserted to the web application’s malware detection module, the malicious executable will be identified as malicious and the ARCSECURE site will display a message saying that the file has been identified to be malicious and vice versa. In these cases, the user will be prompted to either delete the file from the system or proceed with the installation. Thenceforth, ARCSECURE will maintain a log of the files that have been recognized as “malicious” and “legitimate” separately in order to prevent the user from installing malicious content in the future. 3.3 DoS Detection and Mitigation Module Detecting DDoS threats to utilize and communicate IoT devices with a proper manner is the main purpose of this module. Attack Detection Module All incoming traffic will be going through this module for complete the scanning process of this function and this module contains Signature and Policy engine. Attacks of DDoS can be identified by inspecting the system traffic changes. In a study performed by
1078
K. Y. Abeywardena et al.
Chonka et al. [10], by utilizing the property of system self-similitude a model is created to discover DDoS flooding traffic. The centralized node is controlling the boundaries of detectors. The plan objective of this intrusion detection framework is to upgrade the general execution of DDoS attack recognition, by shortening the discovery delay while expanding the detection accuracy and seed of the system communication. The block diagram is shown in Fig. 3 the information containing ordinary traffic and DDoS attacks has processed a few highlights and afterward the information connected to signaturebased detector blocks to detect attacks. Data Training Module In this module we purposed to train the machine using data which have been gathered for training purpose and the second approach is to test the captured data. Well trained ML algorithm will clearly classify the abnormal patterns from normal data packets. It will easily identify any anomalies inside the IoT environment. Mitigation Module This will be the final module in the framework and all incoming network traffic will also be going through here for IoT devices or if the purposed system will be able to detect any anomalies the packets are dropped or if they classify into safe category they will be forwarded into IoT environment. Machine Learning for DDoS Detection Networks are always face challenges in distinguishing legitimate and malicious network packets all the time. Random forest ML algorithm operates by constructing multitude decision trees at training time, Logistic regression is a statistical model and more complex extensions exists on it, there is a major drawback on standard feedforward and it is information moves only one direction there is no cycles or loops in this neural network algorithm and Support vector machine (SVM) uses classification algorithms for two group classification problems. SVM is successful when classifying datapoints into their corresponding classes. Support Vector Machine Support Vector Machines (SVMs) are learning machines that plot the preparation vectors in high dimensional component space, marking every vector by its group. SVM classifiers give a component to fit a hyper plane to play out a linear classification of the patterns using a kernel function. Support vector machine do not require a decrease in the quantity of highlights so as to maintain a strategic distance from over fitting-an obvious preferred position in applications. Another essential preferred position of SVMs is the low expected probability of generalization errors. Detecting DoS Attacks Using SVMs To assemble SVMs for DoS discovery, the input vectors are extricated from raw network packets dump in the test data set; the outcome is dependent on the SVM preparing boundaries and they are as follows, i.
Speed of source IP (SSIP)
ARCSECURE: Centralized Hub for Securing a Network of IoT Devices
ii. iii. iv. v.
1079
Standard Deviation of incoming flow packets (SDFP) Standard Deviation of incoming flow bytes (SDFB) Speed of flow entries (SFE) Ratio of pair-flow entries (RFIP)
A stream sections characterized as an interactive flow if there is a bidirectional way between the source and the destination in the stream. The standard deviation of flow packets and bytes decreases because of the high number of packets resulting the expanding number of flows however there is an extremely slight variety in the packet size just as the byte size which brings about lower deviation. Ratio pair of stream entries decreases the dos attack situation since the reaction set out by the objective machine is missed out in the address space. As shown in the figure, there is an expansion during DoS attack for the speed of source IP address since we are utilizing IP spoofing procedure to re-enact the DoS attacks. SVMs effectively accomplish high detection accuracy (more than 99%) for each attack [11] instances of data. The results of SVM is trained on. There is another IP which is entering the system at each moment, so the module needs to handle this issue and make a new flow entry for every single IP address entering the pool that outcomes in increment number of flows per unit time. SFE diagram additionally has an expansion during DoS attack since more number of flows are made into the module to handle the enormous number of incoming IPs. Figure 2 shows the results charts using training parameters of the SVM classifier.
Fig. 2. Results of SVM training parameters
The above parameters and the following figure show the result of SVM classifier.
Fig. 3. Results from SVM
1080
K. Y. Abeywardena et al.
Figure 3 shows that the highlights of high correlation with the event of an attack and dependent on the highlights separated, SVM is successful while grouping data points into their corresponding classes. 3.4 Secure User Authentication Keystroke Dynamics User authentication is one of the most vulnerable areas of compromise in any system [12, 13]. Therefore, a robust mechanism should be in place to secure access only to legitimate users. The proposed solution to strengthen the user authentication process involves the use of Keystroke dynamics. The use of behavioral biometrics such as Keystroke dynamics for authentication purposes are not unheard of but despite their increasing use in various other fields their usage in the field of IoT is yet to take off [14, 15]. Dataset The dataset which was chosen for the supervised model was a very commonly used standard dataset for keystroke dynamics which was readily available on the web. This dataset included the timings of various key holds recorded in milliseconds. Feature Extraction The dataset contains more than 100 attributes of typing timings. Out of which the following features were taken into consideration for the implementation of this model. • Hold time – the time interval between the press of a key and release of the same key. • Down-Down time – the time interval between the press of a key, release of the same key and the press of the next key. • Up-Down time – the time interval between the release of a pressed key and the press of the next key.
Fig. 4. Keystroke data collection features
Figure 4 shows a graphical representation of the features chosen for the keystroke dynamics model.
ARCSECURE: Centralized Hub for Securing a Network of IoT Devices
1081
The Model Keystroke dynamics involved the use of machine learning to train a supervised model that should be trained for a specified number of times to train the model. The model was designed by splitting the dataset into two halves as a testing set and a training set. The training set was used to train the algorithm. Then the testing set was used to test the algorithm and evaluate its accuracy. Testing A simple dummy web application was developed for testing purposes using basic frontend development programming languages such as HTML, CSS, and JavaScript. The machine learning module was then exported and integrated with the web application’s user input fields. Eventually, upon the user entering their password, the model would compare the typing pattern to that of the trained patterns. If the users typing characteristics manages to match or fall in-between the average range recorded during the users testing phase, then that user is deemed legitimate and thereby authenticated to access the system. If not, the user is deemed an imposter and would not be allowed to progress any further.
4 Results and Discussions There are existing systems that provide security solutions for IoT based networks. Namely Bitdefender Box and Azure’s IoT Box are currently existing security hubs that put emphasis on various aspects of IoT security. What makes this system unique is the achieved end goal which was a solution that had further enhancements and several new features such as Botnet prevention to the currently existing features provided by the various other solutions.
5 Conclusion Internet of things presents loads of energizing chances and thoughts for innovation, but numerous individuals probably might not realize the degree of protection and security risks associated with this technology. For that matter any device that shares a wireless connection faces the risk of security breach of one kind or the other. It is evident from the functions and features of ‘ARCSECURE’ illustrated above that it successfully achieves and outperforms current applications in the niche market of local IoT security products. Most of the time organizations have to pay a huge amount of money to hire an expert to configure these devices and it takes more time to get the final usability of that device. The proposed device ‘ARCSECURE’ will be able to adopt to the network and do the needed configurations by itself. The product can offer maximum security for devices even those that can’t run anti-virus. Block malware, stolen passwords, identity theft, hacker attacks and more while delivering a high level of performance. Even without a thorough knowledge in IT, the users will be able to get the similar productivity provided by proposed device achieved with unsupervised and supervised machine learning techniques. As per the future work, the final outcome of the research will be to have a fully functional integrated system which will have the ability to detect and mitigate attacks such as botnets and DDoS, detect and mitigate malware, and authenticate and authorize
1082
K. Y. Abeywardena et al.
legitimate personnel. This research is also expected to aid in building systems related to this context in future research.
References 1. Zhang, Z., et al.: IoT Security: Ongoing Challenges and Research Opportunities (2014) 2. Ahmad, M., Salah, K.: IoT security : review, blockchain solutions, and open challenges. Futur. Gener. Comput. Syst. 82, 395–411 (2018) 3. Lin, H., Bergmann, N.W.: IoT Privacy and Security Challenges for Smart Home Environments (2016) 4. Wurm, J., Hoang, K., Arias, O., Sadeghi, A.R., Jin, Y.: Security analysis on consumer and industrial IoT devices. In: Proceedings of Asia South Pacific Design Automation Conference ASP-DAC, vol. 25–28 January, pp. 519–524 (2016) 5. Jonsdottir, G., Wood, D., Doshi, R.: IoT network monitor. In: 2017 IEEE MIT Undergraduate Research Technology Conference (URTC), Cambridge, MA, pp. 1–5 (2017) Accessed 22 Feb 2020 6. Yu, E., Cho, S.: Ga-SVM wrapper approach for feature subset selection in keystroke dynamics identity verification. In: Neural Networks, 2003. Proceedings of the International Joint Conference on, vol. 3, pp. 2253–2257. IEEE (2003) 7. Revett, K., Gorunescu, F., Gorunescu, M., Ene, M., Magalhaes, S., Santos, H.: A machine learning approach to keystroke dynamics-based user authentication. Int. J. Electron. Secur. Digit. Forensics 1(1), 55–70 (2007) 8. Zahid, S., Shahzad, M., Khayam, S.A., Farooq, M.: Keystroke-based user identification on smart phones. In: Kirda, E., Jha, S., Balzarotti, D. (eds.) Recent Advances in Intrusion Detection, pp. 224–243. Springer, Berlin (2009). https://doi.org/10.1007/978-3-642-043420_12 9. Moose, F.S.: moosefs.com, MooseFS, 03 2019. https://moosefs.com/. Accessed 20 Feb 2020 10. Chonka, A., Singh, J., Zhou, W.: Chaos theory based detection against network mimicking DDoS attacks. IEEE Commun. Lett. 13(9), 717–719 (2009) 11. Mukkamala, S., Sung, A.H.: Detecting denial of service attacks using support vector machines. In: Institute for Complex Additive Systems Analysis New Mexico Tech Socorro, The IEEE International Conference on Fuzzy Systems (2003) 12. El-Hajj, M., Fadlallah, A., Chamoun, M., Serhrouchni, A.: A survey of internet of things (IoT) authentication schemes. Sensors (Switzerland) 19(5), 1–43 (2019). https://doi.org/10. 3390/s19051141 13. Muliono, Y., Ham, H., Darmawan, D.: Keystroke dynamic classification using machine learning for password authorization. Procedia Comput. Sci. 135, 564–569 (2018). https://doi.org/ 10.1016/j.procs.2018.08.209 14. Geneiatakis, D., Kounelis, I., Neisse, R., Nai-Fovino, I., Steri, G., Baldini, G.: Security and privacy issues for an IoT based smart home (2017). https://doi.org/10.23919/MIPRO.2017. 7973622 15. Andrew, D., Michael, O.A.: A Study of the Advances in IoT Security, pp.1–5 (2018). https:// doi.org/10.1145/3284557.3284560
Entropy Based Feature Pooling in Speech Command Classification Christoforos Nalmpantis1(B) , Lazaros Vrysis1 , Danai Vlachava2 , Lefteris Papageorgiou3 , and Dimitris Vrakas1 1
Aristotle University of Thessaloniki, Thessaloniki, Greece {christofn,dvrakas}@csd.auth.gr, [email protected] 2 International Hellenic University, Thessaloniki, Greece 3 Entranet Ltd, Thessaloniki, Greece [email protected]
Abstract. In this research a novel deep learning architecture is proposed for the problem of speech commands recognition. The problem is examined in the context of internet-of-things where most devices have limited resources in terms of computation and memory. The uniqueness of the architecture is that it uses a new feature pooling mechanism, named entropy pooling. In contrast to other pooling operations, which use arbitrary criteria for feature selection, it is based on the principle of maximum entropy. The designated deep neural network shows comparable performance with other state-of-the-art models, while it has less than half the size of them. Keywords: Entropy pooling · Speech classification neural networks · Deep learning
1
· Convolutional
Introduction
Internet-of-Things emerged from the amalgamation of the physical and digital world via the Internet. Billions of devices that are used in our daily life, are connected to the internet. Our environment is surrounded by mobile phones, smart appliances, sensors, Radio Frequency Identification (RFID) tags and other pervasive computing machines, which communicate with each other and most importantly with humans. From the humans perspective the most natural way to communicate is by speaking. Speech recognition has been one of the most difficult tasks in artificial intelligence and machine-to-human user interfaces have been restricted so far to other options such as touch screens. Yet, two technological advancements paved the way for more friendly user interfaces based on sound. The first technological advancement is the rise of multimedia devices like smart phones. Especially the development of digital assistants and their incorporation not only in mobile phones but also in smart home or smart car kits, has established the need for audio based interactions with humans. The second advancement is the Deep Learning revolution in many applications of artificial intelligence. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 1083–1091, 2021. https://doi.org/10.1007/978-3-030-80129-8_71
1084
C. Nalmpantis et al.
Deep neural networks have shown a tremendous success in many domains including, but not limited to, computer vision, natural language processing, speech recognition, energy informatics, health informatics etc. Such models are already applied to real world applications such as medical imaging [15], autonomous vehicles [4], activity recognition [8], energy disaggregation [11] and others. Speech recognition is not an exception and there is an increasing interest in audio based applications that can run on embedded or mobile devices [13]. Some examples of sound recognition tasks are automatic speech recognition (ASR), speech-to-text (STT), speech emotion classification, voice commands recognition, urban audio recognition and others. For several years, researchers were trying to manually extract features from sound that are relevant to the task. Thus, the traditional pipeline of such systems includes a preprocessing step, feature extraction and a learning model [14,20]. The first two steps mainly include unsupervised signal processing techniques, extracting information in the frequency domain, exploiting frame-based structural information and others [1]. Recently, deep neural networks have demonstrated unprecedented performance in several audio recognition tasks, outperforming traditional approaches [5,16,17,20]. This research focuses on the challenge of voice commands classification task. Despite the fact that ASR has reached human performance, such models are gigantic and would not fit on a device with limited resources. Moreover, ASR is not so robust in a real world environment where noise is present in many unexpected ways. Thus, a more direct, computationally efficient and resilient to noise system is required. A solution to the voice command classification task seems promising in both achieving an acceptable performance and meeting the aforementioned requirements. In this manuscript a novel 2D convolutional neural network is developed, utilizing a recent pooling operation named entropy pool [10] and applied on the Speech Commands dataset [18]. The paper is organized as follows. Firstly, previous work on this task is presented. Next, there is a detailed description of the proposed system and all aspects of the experimental arrangement. Afterwards the experimental results are demonstrated and analysed. Finally, conclusions and future research directions are presented.
2
Related Work
Recently, deep learning approaches have demonstrated superior performance than classic machine learning systems in various audio classification tasks. Until now there has been a race on achieving state-of-the-art performance in terms of accuracy on specific tasks. This lead to the development of huge neural networks with millions or billions of parameters that are prohibitive for resourceconstrained and real-time systems. In this context, researchers now put their effort on improving the efficiency of deep neural networks. Recent interest in deploying speech recognition models on the edge has lead to new work on ASR model compression [9] and other sound recognition tasks.
Entropy Based Feature Pooling in Speech Command Classification
1085
Coucke et al. [3] developed a model utilizing dilated convolution layers, allowing to train deeper neural networks that fit in embedded devices. It is worth noting that the dataset that they created, named “Hey Snips” is public with utterances recorded by over 2.2K speakers. Kusupati et al. [7] proposed a novel recurrent neural network (RNN) architecture named FastGRNN, which includes low-rank, sparse and quantized matrices. This architecture results in accurate models that can be up to 35x smaller than state-of-the-art RNNs. FastGRNN was tested on a variety of datasets and tasks including speech, images and text. The scope of this research was to build models that can be deployed to IoT devices efficiently. Zeng et al. [19] proposed a neural network architecture called DenseNet-BiLSTM for the task of keyword spotting (KWS) using the Google Speech Command dataset. Their main contribution was the combination of a new version of DenseNet named DenseNet-Speech and BiLSTM. The former component of the architecture captures local features whereas maintaining speech time series information. The latter one learns time series features. Solovyev et al. [13] used different representations of sound such as Wave frames, Spectrograms, Mel-Spectograms and MFCCs and designed several neural network architectures based on convolutional layers. Two of their best performing networks had very similar architecture with VGG [12] and ResNet [6]. The models were evaluated on the Google Speech Command dataset, showing very strong results with accuracy over 90%. In this research a novel neural network architecture has been developed. The proposed model is based on convolutional layers and pooling operations, has six convolutional layers and is more efficient than other deep architectures like VGG and ResNet, which usually have more than 12 layers. As a strong baseline Solovyev’s et al. models are used. The experiments show that the proposed model performs on par with the deep models but using much less computation power.
3 3.1
Materials and Methods Dataset
The Speech Commands dataset [18] has become a standard data source for training and evaluating speech command classification models that are targeted for devices with constraint resources. The primary real world application of such models are scenarios where a few target works have to be recognised in an unpredictable and noisy environment. The challenge in this type of problems is to achieve as very few false positives while restricting the energy consumption as much as possible. One common scenario is the recognition of keywords that trigger an interaction like the keywords “Hey Google”, “Hey Siri”, etc. It is obvious that such devices are usually placed in houses or offices where the presence of human conversations is strong and there are many other noises that the devices should ignore. The training data consists of 60K audio clips with size around 1 s. In total there are 32 different labels, from which only 10 are the target ones. The rest of
1086
C. Nalmpantis et al.
the labels are considered as silence or unknown. The target labels are left, right, up, down, yes, no, go, stop, on, off. Figure 1 illustrates a pie chart which shows the proportion of each target command in the training set. The audio files are 16-bit PCM-encoded WAVE files with sampling rate 16K.
Fig. 1. Proportions of target commands in speech commands training dataset.
3.2
Preprocessing and Audio Features
Audio is in the form of a time series, but it is very common to convert it to a representation in the frequency domain. The most popular sound representations are spectogram, log-mel spectrogram and MFCC. Spectogram is used as the main representation in this research. It is computed using the algorithm of Short Time Fourier Transform (STFT). STFT has the advantage of Fourier transformation converting small segments of a time series to the frequency domain, whereas at the same time preserves temporal information. STFT has three non-default inputs: the signal that will be transformed, the frame length and the stride. The latter one determines how much consecutive windows will overlap each other. The output is a matrix of complex numbers from which we get an energy spectrogram. The spectrogram is extracted using the magnitude of the complex numbers and then taking the logarithm of these values. In the final feature set the angle of the complex numbers is also considered, which improves the accuracy of the final model. 3.3
Entropy Pooling
Feature pooling has been an established layer that helps in sub sampling features with high cardinality. The two most popular pooling operations in deep learning are max and average. However, choosing the right pooling operation is mainly done through a series of experiments and based on the final performance of the model.
Entropy Based Feature Pooling in Speech Command Classification
1087
To the best of the author’s knowledge, in the literature there are two main efforts that shed light on the properties of these two mechanisms. Firstly, Boureau et al. [2] presented a theoretical analysis and described the statistical properties of max and average feature pooling for a two-class categorization problem. The analysis evaluated the two methods in terms of which properties affected the models performance in separating two different classes. The most important outcome was that among other unknown factors, experiments showed that the sparsity and the cardinality of the features affect the model’s performance. More recently, Nalmpantis et al. [10] investigated feature pooling operations from the information theory point of view. The authors showed theoretically and empirically that max pooling is not always compatible with the maximum entropy principle. In practice a model’s performance can vary a lot with different weight initialization. On the contrary average pooling gives more consistent results because it will always give a more uniform feature distribution. In this context, a novel pooling operation, named entropy pooling, was presented with guarantees to select features with high entropy. Entropy pooling calculates the probabilities p of the features with cardinality N. Then, the values of probabilities are spatially separated using a kernel and a stride size. For each group the most rare feature is selected. Given a group with size r the mathematical formula is: fentr (Xr ) = Xr [g(Pr )],
(1)
g(Pr ) = arg min pi ,
(2)
1≤i≤r
where Xr is the input feature map and Pr the constructed map of probabilities. 3.4
Neural Network Topology
For the problem of speech command recognition recent research has borrowed popular architectures from computer vision such as VGG16, VGG19, ResNet50, InceptionV3, Xception, InceptionResnetV2 and others [13]. These neural networks all have in common the utilization of convolutional neural networks. In this research convolutional layers are included but with the scope to meet the following requirements. Firstly the model has to be smaller than the original models that were introduced for cloud based applications. In the context of speech command recognition, the developed model has to be deployed in devices with limited storage. Moreover, a reduced size also means less computational complexity which is essential for embedded devices or devices with constraint resources. Furthermore, these devices also depend on a battery, magnifying the need for more energy-efficient models. Finally, regarding the input data of the model, they will often be very short and most of the time irrelevant sounds. This means that false positives have to be eliminated. It is not a surprise that a larger architecture usually achieves better results in terms of accuracy at the expense of computation. After trying many different configurations, a neural network architecture with a satisfying trade off
1088
C. Nalmpantis et al.
between best performance and efficiency has been developed. It consists of six 2D convolutional layers with activation function ReLU. Each of the first four convolutional layers is followed by a batch normalization layer and an entropy pooling operation. Figure 2 shows the details of the architecture.
Fig. 2. The proposed neural network with 2D convolutions and entropy pooling operations.
4
Experimental Results and Discussion
The evaluation of the proposed neural network was done using accuracy and cross entropy error. The latter one is the result of the cross entropy of the model’s output and the target. Accuracy is more intuitive and estimates the total number of correctly classified instances to the total number of samples. The model achieves test error 0.398 and accuracy 90.4%. Other state-of-the-art models perform slightly better with accuracy around 94%, but the size of these neural networks is multiple times larger [13]. To give an example a modification of the popular deep neural network VGG [12] named VGG16 includes 16 convolutional layers in contrast to the current solution which involves only six and its performance is around 2% better. For a more in depth analysis of the performance of the model, recall, precision and f1 score are also employed. Recall is defined as the number of correctly predicted positive observations divided by the false and true positive observations. It gives a percentage of total commands that should be recognized. Precision
Entropy Based Feature Pooling in Speech Command Classification
1089
is defined as the true positives divided by the total number of true and false positives. It gives insight of how many of the commands are actually true. This metric is quite important because it is directly affected by the false positives, which is very critical in the real world, as explained previously. Finally f1 score is the harmonic mean of precision and recall. Table 1 presents a detailed classification report for each of the commands. According to the table the command “yes” is the easiest to detect whereas the worst performance is shown for the command “go”. The latter one, as shown in the confusion matrix in Fig. 3, is mixed with silence. Table 1. Analytical classification report. Yes No
Up
Down Left Right On
Off Stop Go
Silence Macro Micro
Precision 0.92 0.82 0.93 0.90
0.93 0.96
0.86 0.83 0.92 0.76 0.92
0.89
0.90
Recall
0.88 0.81 0.78 0.78
0.80 0.76
0.83 0.84 0.83 0.78 0.96
0.82
0.90
F1-score 0.90 0.81 0.84 0.84
0.86 0.85
0.85 0.84 0.87 0.77 0.94
0.85
0.90
Fig. 3. Confusion matrix with results of the proposed neural network recognizing 10 different speech commands.
5
Conclusion
The problem of speech command recognition is a critical one for the success of personal assistants and other internet of things devices.In this research a
1090
C. Nalmpantis et al.
novel neural network has been developed, utilizing 2D convolutional layers and a pooling operation which is based on entropy instead of randomly selecting audio features. The proposed model meets the real world requirements and can be used by a product with confidence. It achieves a descent performance when compared to larger models while at the same time it is energy efficient and computationally lightweight. For future work it is recommended to examine the performance of entropy pool in larger neural networks. Researchers are advised to conduct further experiments on more datasets and different sound recognition tasks. Acknowledgement. This research has been co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH–CREATE–INNOVATE (project code:T1EDK-00343(95699) - Energy Controlling Voice Enabled Intelligent Smart Home Ecosystem).
References 1. Bountourakis, V., Vrysis, L., Konstantoudakis, K., Vryzas, N.: An enhanced temporal feature integration method for environmental sound recognition. In: Acoustics, vol. 1, pp. 410–422. Multidisciplinary Digital Publishing Institute (2019) 2. Boureau, Y.L., Ponce, J., LeCun, Y.: A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 111–118 (2010) 3. Coucke, A., Chlieh, M., Gisselbrecht, T., Leroy, D., Poumeyrol, M., Lavril, T.: Efficient keyword spotting using dilated convolutions and gating. In: ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6351–6355 (2019) 4. Fayyad, J., Jaradat, M.A., Gruyer, D., Najjaran, H.: Deep learning sensor fusion for autonomous vehicle perception and localization: a review. Sensors 20(15), 4220 (2020) 5. Han, W., et al.: Contextnet: improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191 (2020) 6. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 7. Kusupati, A., Singh, M., Bhatia, K., Kumar, A., Jain, P., Varma, M.: Fastgrnn: a fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. In: Advances in Neural Information Processing Systems, pp. 9017–9028 (2018) 8. Lentzas, A., Vrakas, D.: Non-intrusive human activity recognition and abnormal behavior detection on elderly people: a review. Artif. Intell. Rev. 53, 1975–2021 (2020). https://doi.org/10.1007/s10462-019-09724-5 9. McGraw, I., et al.: Personalized speech recognition on mobile devices. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5955–5959. IEEE (2016) 10. Nalmpantis, C., Lentzas, A., Vrakas, D.: A theoretical analysis of pooling operation using information theory. In: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1729–1733. IEEE (2019)
Entropy Based Feature Pooling in Speech Command Classification
1091
11. Nalmpantis, C., Vrakas, D.: On time series representations for multi-label NILM. Neural Comput. Appl. 32, 17275–17290 (2020). https://doi.org/10.1007/s00521020-04916-5 12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014) 13. Solovyev, R.A., et al.: Deep learning approaches for understanding simple speech commands. In: 2020 IEEE 40th International Conference on Electronics and Nanotechnology (ELNANO), pp. 688–693. IEEE (2020) 14. Tsipas, N., Vrysis, L., Dimoulas, C., Papanikolaou, G.: Mirex 2015: Methods for speech/music detection and classification. In Processing, Music information retrieval evaluation eXchange (MIREX) (2015) 15. Viswanathan, J., Saranya, N., Inbamani, A.: Deep learning applications in medical imaging: Introduction to deep learning-based intelligent systems for medical applications. In: Deep Learning Applications in Medical Imaging, pp. 156–177. IGI Global (2021) 16. Vrysis, L., Thoidis, I., Dimoulas, C., Papanikolaou, G.: Experimenting with 1d CNN architectures for generic audio classification. In: Audio Engineering Society Convention 148. Audio Engineering Society (2020) 17. Vrysis, L., Tsipas, N., Thoidis, I., Dimoulas, C.: 1d/2d deep cnns vs. temporal feature integration for general audio classification. J. Audio Eng. Soc. 68(1/2), 66–77 (2020) 18. Warden, P.: Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 (2018) 19. Zeng, M., Xiao, N.: Effective combination of densenet and bilstm for keyword spotting. IEEE Access 7, 10767–10775 (2019) 20. Zhang, Z., Geiger, J., Pohjalainen, J., Mousa, Amr, E.D., Jin, W., Schuller, B.: Deep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans. Intell. Syst. Technol. 9(5), 28 p. (2018). https://doi. org/10.1145/3178115. Article 49
Author Index
A Abeykoon, A. M. I. S., 1071 Abeywardena, Kavinga Yapa, 1071 Akhtari, Ali Akbar, 259 Al Azkiyai, Ahmad Whafa Azka, 851 Al Masry, Zeina, 87 Alabdali, Aliaa, 36 Aleksieva-Petrova, Adelina, 1051 Alharbi, Nesreen, 36 Alhusayni, Samirah A., 649 Alieksieiev, Volodymyr, 595 Alkhusaili, Majed, 173 Almalki, Faris A., 649 Alsuwat, Shuruq K., 649 Altalhi, Shahd H., 649 Alwashmi, Reem, 36 Alzahrani, Hawwaa S., 649 Andonov, Stefan, 885 Annarelli, Alessandro, 818 Anthopoulos, Marios, 575 Apiola, Mikko-Ville, 369 Asino, Tutaleni I., 446 Atapattu, A. M. S. P. B., 1071 Attas, Dalia, 58
Blackmore, Chris, 58 Bonakdari, Hossein, 259 Boranbayev, Askar, 437 Boutahar, Jaouad, 761 Bradish, Philip, 108 Brink, Willie, 100
B Bah, Bubacarr, 100 Baidyussenov, Ruslan, 437 Barmawi, Ari Moesriami, 851 Bartsch, Jan, 722 Bellini, Emanuele, 681 Benaggoune, Khaled, 87 Bernik, Andrija, 353, 527 Blackburn, Daniel, 58
D D’Souza, Daryl, 369 Deakyne, Alex J., 390 Delsing, Jerker, 664 Dervinis, Gintaras, 784 Devalland, Christine, 87 Di Tullio, Daniele, 923 Dimitrova, Vesna, 885 Dragos, Denise, 1041
C Caballero-Hernandez, Hector, 833 Cambou, Bertrand, 904, 1020 Camenen, Pierre, 124 Chan, Eang Teng, 624 Chaudhari, Sarang, 108 Chawki, Mohamed, 986 Chen, Daqing, 124 Chen, Jim Q., 309 Chen, Ying-Chen, 1020 Chenchev, Ivaylo, 1051 Christensen, Heidi, 58 Christou, Nikolaos, 270 Clear, Michael, 108 Clemente, Serena, 818 Cleveland, Signe Marie, 161 Coffey, Christopher, 904
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 K. Arai (Ed.): Intelligent Computing, LNNS 285, pp. 1093–1095, 2021. https://doi.org/10.1007/978-3-030-80129-8
1094
Author Index
Du, Wenliang (Kevin), 868 Dushanova, Juliana, 233
Kozlov, Alexander, 213 Kwembe, Tor A., 446
E Ebtehaj, Isa, 259 El Ghazi El Houssaïni, Souhaïl, 761 Eugenio, Evercita C., 949
L Laakso, Mikko-Jussi, 369 Lamkin, Darron, 446 Larson, Kent, 641 Larsson, Peter, 369 Lazarevich, Ivan, 213 Li, Bo, 124 Liu, Fang, 949 Lopes, Vasco, 298 López-López, Aurelio, 1 Lovaas, Petter, 933 Luo, Yaqing, 286 Lyalyushkin, Nikolay, 213
F Fazendeiro, Paulo, 298 Fiallo, Darius, 743 Franchina, Luisa, 1003 G Ghafarian, Ahmad, 743 Gharabaghi, Bahram, 259 Gholami, Azadeh, 259 Ghosh, Robin, 446 Gibson, Ryan M., 327 Gjorgjievska Perusheska, Milena, 885 Gojka, Efthimios-Enias, 722 Gonçalves, Gil, 664 Goosen, Leila, 507 Gorbachev, Yury, 213 Gowanlock, Michael, 904 Grammenou, Sotiria, 575 Gyawali, Manoj, 923 H Haddara, Moutaz, 161 Hajderanj, Laureta, 124 Halstead-Nussloch, Richard, 549 Harkness, Kirsty, 58 Heiden, Bernhard, 595 I Iaizzo, Paul A., 390 J Jayawardhane, H. N., 1071 Jensen, Christian D., 967 K Kaimakamis, Christos, 575 Kalinin, Konstantin, 70 ˇ Kamenar Cokor, Dolores, 527 Kami´nski, Paweł, 136 Kanaya, Ichi, 560 Kannengießer, Niclas, 722 Klein, Ran, 17 Klemsa, Jakub, 702 Kopeliovich, Mikhail, 70 Kouam Kamdem, Igor Godefroy, 48 Koulas, Emmanouil, 575 Kousaris, Konstantinos, 575
M Maiti, Monica, 399 Mallet, Sarah, 124 Marzouk, Osama A., 462 Mashat, Arwa, 36 Maskani, Ilham, 761 Mazhitov, Mikhail, 437 McHale, S. A., 192 Mensah, Samuel Ofosu, 100 Meraghni, Safa, 87 Mirheidari, Bahman, 58 Mironenko, Yuriy, 70 Moreno, Juan, 804 Morison, Gordon, 327 Morsy, Aya Mohamed, 145 Mostafa, Mostafa Abdel Azim, 145 Muñoz-Jimenez, Vianney, 833 N Nalmpantis, Christoforos, 1083 Nawyn, Jason, 641 Nkenlifack, Marcellin Julius Antonio, 48 Nonato, Luis Gustavo, 804 Nonino, Fabio, 818 Novogrudska, Rina, 340 Nyström, Tobias, 491 O Obuchowicz, Rafał, 136 P Palmaro, Federico, 1003 Palombi, Giulia, 818 Panavou, Fotini-Rafailia, 575 Papageorgiou, Lefteris, 1083 Papp, Glenn, 933 Pearson, Esther, 208 Pei´c, Dalibor, 353
Author Index Pereira, E., 192 Peristeras, Vassilios, 575 Petrov, Milen, 1051 Petrushan, Mikhail, 70 Philabaum, Christopher, 904 Pinto, Rui, 664 Piórkowski, Adam, 136 Piskioulis, Orestis, 575 Popova, Maryna, 340 Popovska-Mitrovikj, Aleksandra, 885 Priyaadharshini, M., 399 Q Quintero, Sebastian, 804 Qureshi, Kalim, 173 R Ramos, Marco A., 833 Raudonis, Vidas, 784 Regencia, Josiah Eleazar T., 606 Ren, Hao, 124 Reuber, Markus, 58 Riascos, Alvaro, 804 Rico, Andres, 641 Rodrigues, Manuel, 542 Roldán-Palacios, Marisol, 1 Rossi, Matteo, 681 Rutherfoord, Rebecca, 549 S Salman, Ammar S., 868 Salman, Odai S., 17 Samarasekara, C. N., 1071 Sanchez, Cristian, 804 Sanchez, Susana, 475 Schmeelk, Suzanna, 1041 Selitskaya, Natalya, 270 Selitskiy, Stanislav, 270 Shah, Syed Iftikhar Hussain, 575
1095 Shamporov, Vasily, 213 Smolikowska, Alicja, 136 Smuts, Carson, 641 Sturm, Benjamin, 722 Sultan, Shizra, 967 Sunyaev, Ali, 722 T Tang, Mui Joo, 624 Taskov, Tihomir, 233 Tawaki, Meina, 560 Terrissa, Labib Sadek, 87 Tewari, Hitesh, 108 Tonino-Heiden, Bianca, 595 Tovar, Eduardo, 664 V Valenzuela, Thomas, 390 van Heerden, Dalize, 507 Veerasamy, Ashok Kumar, 369 Venneri, Annalena, 58 Vijeikis, Romas, 784 Vinayaga Sundaram, B., 399 Vlachava, Danai, 1083 Vrakas, Dimitris, 1083 Vrysis, Lazaros, 1083 W Wahyudi, Bambang Ari, 851 Walker, Traci, 58 Y Yamamoto, Keiko, 560 Yu, William Emmanuel S., 606 Z Zerhouni, Noureddine, 87 Zerpa, Levis, 419 Zhao, Erlong, 124