Intelligence in Big Data Technologies—Beyond the Hype: Proceedings of ICBDCC 2019 [1st ed.] 9789811552847, 9789811552854

This book is a compendium of the proceedings of the International Conference on Big-Data and Cloud Computing. The papers

298 88 22MB

English Pages XIII, 636 [625] Year 2021

Table of contents :
Front Matter ....Pages i-xiii
From Dew Over Cloud Towards the Rainbow (Zorislav Šojat)....Pages 1-15
L1 Norm SVD-Based Ranking Scheme: A Novel Method in Big Data Mining (Rahul Aedula, Yashasvi Madhukumar, Snehanshu Saha, Archana Mathur, Kakoli Bora, Surbhi Agrawal)....Pages 17-29
Human Annotation and Emotion Recognition for Counseling System with Cloud Environment Using Deep Learning (K. Arun Kumar, Mani Koushik, Thangavel Senthil Kumar)....Pages 31-42
Enhancing Intricate Details of Ultrasound PCOD Scan Images Using Tailored Anisotropic Diffusion Filter (TADF) (Suganya Ramamoorthy, Thangavel Senthil Kumar, S. Md. Mansoorroomi, B. Premnath)....Pages 43-52
LSTM and GRU Deep Learning Architectures for Smoke Prediction System in Indoor Environment (S. Vejay Karthy, Thangavel Senthil Kumar, Latha Parameswaran)....Pages 53-64
A Mobile-Based Framework for Detecting Objects Using SSD-MobileNet in Indoor Environment (K. K. R. Sanjay Kumar, Goutham Subramani, Senthil Kumar Thangavel, Latha Parameswaran)....Pages 65-76
Privacy-Preserving Big Data Publication: (K, L) Anonymity (J. Andrew, J. Karthikeyan)....Pages 77-88
Comparative Analysis of the Efficacy of the EEG-Based Machine Learning Method for the Screening and Diagnosing of Alcohol Use Disorder (AUD) (Susma Grace Varghese, Oshin R. Jacob, P. Subha Hency Jose, R. Jegan)....Pages 89-96
Smart Solution for Waste Management: A Coherent Framework Based on IoT and Big Data Analytics (E. Grace Mary Kanaga, Lidiya Rachel Jacob)....Pages 97-106
Early Detection of Diabetes from Daily Routine Activities: Predictive Modeling Based on Machine Learning Techniques (R. Abilash, B. S. Charulatha)....Pages 107-114
Classification of Gender from Face Images and Voice (S. Poornima, N. Sripriya, S. Preethi, Saanjana Harish)....Pages 115-124
An Outlier Detection Approach on Credit Card Fraud Detection Using Machine Learning: A Comparative Analysis on Supervised and Unsupervised Learning (P. Caroline Cynthia, S. Thomas George)....Pages 125-135
Unmasking File-Based Cryptojacking (T. P. Khiruparaj, V. Abishek Madhu, Ponsy R. K. Sathia Bhama)....Pages 137-146
Selection of a Virtual Machine Within a Scheduler (Dispatcher) Using Enhanced Join Idle Queue (EJIQ) in Cloud Data Center (G. Thejesvi, T. Anuradha)....Pages 147-153
An Analysis of Remotely Triggered Malware Exploits in Content Management System-Based Web Applications (C. Kavithamani, R. S. Sankara Subramanian, Srinevasan Krishnamurthy, Jayakrishnan Chathu, Gayatri Iyer)....Pages 155-168
GSM-Based Design and Implementation of Women Safety Device Using Internet of Things (N. Prakash, E. Udayakumar, N. Kumareshan, R. Gowrishankar)....Pages 169-176
A Novel Approach on Various Routing Protocols for WSN (E. Udayakumar, Arram Sriram, Bandlamudi Ravi Raju, K. Srihari, S. Chandragandhi)....Pages 177-187
Fraud Detection for Credit Card Transactions Using Random Forest Algorithm (T. Jemima Jebaseeli, R. Venkatesan, K. Ramalakshmi)....Pages 189-197
Deep Learning Application in IoT Health Care: A Survey (Jinsa Mary Philip, S. Durga, Daniel Esther)....Pages 199-208
Context Aware Text Classification and Recommendation Model for Toxic Comments Using Logistic Regression (S. Udhayakumar, J. Silviya Nancy, D. UmaNandhini, P. Ashwin, R. Ganesh)....Pages 209-217
Self-supervised Representation Learning Framework for Remote Crop Monitoring Using Sparse Autoencoder (J. Anitha, S. Akila Agnes, S. Immanuel Alex Pandian)....Pages 219-227
Determination of Elements in Human Urine for Transient Biometrics (N. Ambiga, A. Nagarajan)....Pages 229-243
Optimal Placement of Green Power Generation in the Radial Distribution System Using Harmony Search Algorithm (S. Ganesh, G. Ram Prakash, J. A. Michline Rupa)....Pages 245-252
Data Security and Privacy Protection in Cloud Computing: A Review (J. E. Anusha Linda Kostka, S. Vinila Jinny)....Pages 253-257
Re-Ranking ODI Batsman Using JJ Metric (Jerin Jayaraj, Maria Sajan, Linda Joy, Narayanan V. Eswar, G. Pankaj Kumar)....Pages 259-266
Intelligent Cloud Load Balancing Using Elephant Herd Optimization (Pradeep Abijith, K. S. Aswin Kumar, Raphael Sunny Allen, Dileep Vijayakumar, Antony Paul, G. Pankaj Kumar)....Pages 267-273
A Quest for Best: A Detailed Comparison Between Drakvuf-VMI-Based and Cuckoo Sandbox-Based Technique for Dynamic Malware Analysis (A. Alfred Raja Melvin, G. Jaspher W. Kathrine)....Pages 275-290
CloudStore: A Framework for Developing Progressive Streaming Games (G. Pankaj Kumar, C. Anagha Zachariah, D. R. Umesh, M. N. Arun Kumar)....Pages 291-296
Classification of DGA Botnet Detection Techniques Based on DNS Traffic and Parallel Detection Technique for DGA Botnet (Seena Elizebeth Mathew, A. Pauline)....Pages 297-304
Effective Utilization of Face Verification in Fog Computing on Cloud Architecture (S. Princy Suganthi Bai, D. Ponmary Pushpa Latha, R. David Vinodh Kumar Paul)....Pages 305-312
An Efficient Mechanism for Revocation of Malicious Nodes in Vehicular Ad Hoc Networks (R. Jeevitha, N. Sudha Bhuvaneswari)....Pages 313-323
Robust Service Selection Through Intelligent Clustering in an Uncertain Environment (K. Nivitha, A. Solaiappan, P. Pabitha)....Pages 325-332
DNA: Dynamically Negotiable Approach—A P2P-based Overlay for Live Multimedia Streaming (Preetha Evangeline, Anandhakumar Palanisamy, Pethuru Raj Chelliah)....Pages 333-339
A Study on Feature Extraction and Classification for Tongue Disease Diagnosis (Saritha Balu, Vijay Jeyakumar)....Pages 341-351
MRI Brain Image Classification System Using Super Pixel Color Contrast and Support Vector Neural Network (A. Jayachandran, A. Jegatheesan, T. Sreekesh Namboodiri)....Pages 353-360
Performance Comparison of Machine Learning Models for Classification of Traffic Injury Severity from Imbalanced Accident Dataset (P. Joyce Beryl Princess, Salaja Silas, Elijah Blessing Rajsingh)....Pages 361-369
FCM-Based Segmentation and Neural Network Classification of Tumor in Brain MRI Images (S. Sandhya, B. Chidambararajan, M. Senthil Kumar)....Pages 371-378
SDN-Based Traffic Management for Personalized Ambient Assisted Living Healthcare System (Deva Priya Isravel, Salaja Silas, Elijah Blessing Rajsingh)....Pages 379-388
Design and Implementation of Parking System Using Feature Extraction and Pattern Recognition Technique (H. Varun Chand, J. Karthikeyan)....Pages 389-400
Classification of Diabetes Milletus Using Naive Bayes Algorithm (S. Josephine Theresa, D. J. Evangeline)....Pages 401-412
A Mood-Based Recommender System for Indian Music Using K-Prototype Clustering (K. A. Rashmi, B. Kalpana)....Pages 413-418
A 3D Convolutional Neural Network for Bacterial Image Classification (T. S. R. Mhathesh, J. Andrew, K. Martin Sagayam, Lawrence Henesey)....Pages 419-431
Intelligent Big Data Domain for R-fMRI Big Data Preprocessing—An Optimized Approach (K. Elaiyaraja, M. Senthil Kumar, B. Chidambararajan)....Pages 433-441
Textual Feature Ensemble-Based Sarcasm Detection in Twitter Data (Karthik Sundararajan, J. Vijay Saravana, Anandhakumar Palanisamy)....Pages 443-450
Designing Parallel Operation for High-Performance Cloud Computing Using Partition Algorithm (Krishnan Rajkumar, A. Sangeetha, V. Ebenezer, G. Ramesh, N. Karthik)....Pages 451-462
ATSA: Ageing-Based Task Scheduling Algorithm for Mobile Edge Computing (S. M. Muthukumari, E. George Dharma Prakash Raj)....Pages 463-471
Application of Big Data in Field of Medicine (J. Sabarish, S. Sonali, P. T. R. Vidhyaa)....Pages 473-484
Implementation of Extended Play-Fair Algorithm for Client-Side Encryption of Cloud Data (J. David Livingston, E. Kirubakaran)....Pages 485-493
System Modeling and Simulation of an Outdoor Illumination System Using a Multi-layer Feed-Forward Neural Network (Titus Issac, Salaja Silas, Elijah Blessing Rajsingh)....Pages 495-507
Comparative Performance Analysis of Various Classifiers on a Breast Cancer Clinical Dataset (E. Jenifer Sweetlin, D. Narain Ponraj)....Pages 509-516
Handling Data Imbalance Using a Heterogeneous Bagging-Based Stacked Ensemble (HBSE) for Credit Card Fraud Detection (V. Sobanadevi, G. Ravi)....Pages 517-525
Advances in Photoplethysmogram and Electrocardiogram Signal Analysis for Wearable Applications (G. R. Ashisha, X. Anitha Mary)....Pages 527-534
An Authenticated E-Voting System Using Biometrics and Blockchain (A. Priyadharshini, M. Prasad, R. Joshua Samuel Raj, S. Geetha)....Pages 535-542
Early Recognition of Herb Sickness Using SVM (S. Geetha, P. Nanda, R. Joshua Samuel Raj, T. Prince)....Pages 543-550
Automatic Detection of Sensitive Attribute in Privacy-Preserved Hadoop Environment Using Data Mining Techniques (R. Anitha Murthy, Dhina Suresh)....Pages 551-558
Performance Analysis of Grid Connected Modified Z-Source High Step up Inverter for Solar Photovoltaic System (Y. Pavithra, K. Lakshmi)....Pages 559-569
Application of Integrated IoT Framework to Water Pipeline Transportation System in Smart Cities (E. B. Priyanka, S. Thangavel, V. Madhuvishal, S. Tharun, K. V. Raagul, C. S. Shiv Krishnan)....Pages 571-579
Improved Image Deblurring Using GANs (Prathamesh Mungarwadi, Shubham Rane, Ritu Raut, Tanuja Pattanshetti)....Pages 581-588
A Power-Efficient Security Device Leveraging Deep Learning (DL)-Inspired Facial Recognition (R. S. Saundharya Thejaswini, S. Rajaraajeswari, Pethuru Raj)....Pages 589-597
Applications, Analytics, and Algorithms—3 A’s of Stream Data: A Complete Survey (L. Amudha, R. Pushpalakshmi)....Pages 599-606
Optimization of Extreme Learning Machine Using the Intelligence of Monarch Butterflies for Osteoporosis Diagnosis (D. Devikanniga, R. Joshua Samuel Raj)....Pages 607-615
Secure and Efficient Sensitive Info-Hiding for Data Sharing via DACES Method in Cloud (R. Joshua Samuel Raj, J. Jeya Praise, M. Viju Prakash, A. Sam Silva)....Pages 617-636

Recommend Papers

Advances in Artificial Intelligence and Data Engineering: Select Proceedings of AIDE 2019 [1st ed.] 9789811535130, 9789811535147

This book presents selected peer-reviewed papers from the International Conference on Artificial Intelligence and Data E

979 111 57MB Read more

Advances in Computational Intelligence and Communication Technology: Proceedings of CICT 2019 [1st ed.] 9789811512742, 9789811512759

This book features high-quality papers presented at the International Conference on Computational Intelligence and Commu

585 95 21MB Read more

Advances in Machine Learning and Computational Intelligence: Proceedings of ICMLCI 2019 [1st ed.] 9789811552427, 9789811552434

This book gathers selected high-quality papers presented at the International Conference on Machine Learning and Computa

1,182 23 31MB Read more

Proceedings of International Joint Conference on Computational Intelligence: IJCCI 2019 [1st ed.] 9789811536069, 9789811536076

This book gathers outstanding research papers presented at the International Joint Conference on Computational Intellige

402 23 74MB Read more

Information Management and Machine Intelligence: Proceedings of ICIMMI 2019 [1st ed.] 9789811549359, 9789811549366

This book features selected papers presented at the International Conference on Information Management and Machine Intel

392 92 32MB Read more

Urban Intelligence and Applications: Proceedings of ICUIA 2019 [1st ed.] 9783030450984, 9783030450991

This volume presents selected papers from the International Conference on Urban Intelligence and Applications (ICUIA), w

225 87 8MB Read more

Proceedings of International Conference on Big Data, Machine Learning and Their Applications: Icbma 2019 9811583765, 9789811583766

This book contains high-quality peer-reviewed papers of the International Conference on Big Data, Machine Learning and t

863 73 15MB Read more

Big Data Analysis and Deep Learning Applications: Proceedings of the First International Conference on Big Data Analysis and Deep Learning (Advances in Intelligent Systems and Computing Book 744) [1st ed. 2019] 9789811308697, 9789811308680, 9811308691

114 68 8MB Read more

Data Warehousing in the Age of Big Data 9780124058910, 0124058914

Data Warehousing in the Age of the Big Datawill help you and your organization make the most of unstructured data with y

667 94 16MB Read more

Proceedings of International Conference on Data Science and Applications: ICDSA 2019 [1st ed.] 9789811575600, 9789811575617

This book gathers outstanding papers presented at the International Conference on Data Science and Applications (ICDSA 2

870 9 15MB Read more

Intelligence in Big Data Technologies—Beyond the Hype: Proceedings of ICBDCC 2019 [1st ed.]
9789811552847, 9789811552854

Author / Uploaded
J. Dinesh Peter
Steven L. Fernandes
Amir H. Alavi

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Advances in Intelligent Systems and Computing 1167

J. Dinesh Peter Steven L. Fernandes Amir H. Alavi Editors

Intelligence in Big Data Technologies— Beyond the Hype Proceedings of ICBDCC 2019

Advances in Intelligent Systems and Computing Volume 1167

Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong

The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artiﬁcial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover signiﬁcant recent developments in the ﬁeld, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. ** Indexing: The books of this series are submitted to ISI Proceedings, EI-Compendex, DBLP, SCOPUS, Google Scholar and Springerlink **

More information about this series at http://www.springer.com/series/11156

J. Dinesh Peter Steven L. Fernandes Amir H. Alavi •

•

Editors

Intelligence in Big Data Technologies—Beyond the Hype Proceedings of ICBDCC 2019

123

Editors J. Dinesh Peter Department of Computer Science and Engineering Karunya Institute of Technology and Sciences Coimbatore, Tamil Nadu, India

Steven L. Fernandes Department of Computer Science University of Central Florida Orlando, FL, USA

Amir H. Alavi Department of Civil and Environmental Engineering University of Pittsburgh Pittsburgh, PA, USA Department of Computer Science and Information Engineering Asia University Taichung, Taiwan

ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-15-5284-7 ISBN 978-981-15-5285-4 (eBook) https://doi.org/10.1007/978-981-15-5285-4 © Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

This work comprises the proceedings of the International Conference on Big Data and Cloud Computing (ICBDCC’19). This conference was organized with the primary theme of promoting ideas that provide technological solutions to the big data and cloud computing applications. ICBDCC provided a unique forum for the practitioners, developers and users to exchange ideas and present their observations, models, results and experiences with the researchers who are involved in real-time projects that provide solutions for research problems of recent advancements in big data and cloud computing technologies. In the last decade, a number of sophisticated and new computing technologies have been developed. With the introduction of new computing paradigms such as cloud computing, big data and other innovations, ICBDCC provided a high-quality dissemination forum for new ideas, technology focus, research results and discussions on the evolution of computing for the beneﬁt of both scientiﬁc and industrial developments. ICBDCC is supported by a panel of reputed advisory committee members both from India and from all across the world. This proceedings includes topics in the ﬁelds of big data, data analytics in cloud, cloud security, cloud computing, and big data and cloud computing applications. The research papers featured in this proceedings provide novel ideas that contribute to the growth of the society through computing technologies. The contents of this proceedings will prove to be an invaluable asset to the researchers in the areas of big data and cloud computing. We appreciate the extensive time and effort put in by all the members of the Organizing Committee for ensuring a high standard for the papers published in this volume. We would like to express our thanks to the panel of experts who helped us to review the papers and assisted us in selecting the candidate for the Best Paper

v

vi

Preface

Award. We would like to thank the eminent keynote speakers who have shared their ideas with the audience and all the researchers and academicians who have contributed their research work, models and ideas to ICBDCC’19. Coimbatore, India Pittsburgh, PA, USA/Taichung, Taiwan Orlando, FL, USA

J. Dinesh Peter Amir H. Alavi Steven L. Fernandes

Contents

From Dew Over Cloud Towards the Rainbow . . . . . . . . . . . . . . . . . . . . Zorislav Šojat L1 Norm SVD-Based Ranking Scheme: A Novel Method in Big Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahul Aedula, Yashasvi Madhukumar, Snehanshu Saha, Archana Mathur, Kakoli Bora, and Surbhi Agrawal Human Annotation and Emotion Recognition for Counseling System with Cloud Environment Using Deep Learning . . . . . . . . . . . . . . . . . . . K. Arun Kumar, Mani Koushik, and Thangavel Senthil Kumar Enhancing Intricate Details of Ultrasound PCOD Scan Images Using Tailored Anisotropic Diffusion Filter (TADF) . . . . . . . . . . . . . . . . Suganya Ramamoorthy, Thangavel Senthil Kumar, S. Md. Mansoorroomi, and B. Premnath LSTM and GRU Deep Learning Architectures for Smoke Prediction System in Indoor Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . S. Vejay Karthy, Thangavel Senthil Kumar, and Latha Parameswaran A Mobile-Based Framework for Detecting Objects Using SSD-MobileNet in Indoor Environment . . . . . . . . . . . . . . . . . . . . K. K. R. Sanjay Kumar, Goutham Subramani, Senthil Kumar Thangavel, and Latha Parameswaran Privacy-Preserving Big Data Publication: (K, L) Anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . J. Andrew and J. Karthikeyan

1

17

31

43

53

65

77

vii

viii

Contents

Comparative Analysis of the Efﬁcacy of the EEG-Based Machine Learning Method for the Screening and Diagnosing of Alcohol Use Disorder (AUD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Susma Grace Varghese, Oshin R. Jacob, P. Subha Hency Jose, and R. Jegan

89

Smart Solution for Waste Management: A Coherent Framework Based on IoT and Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . E. Grace Mary Kanaga and Lidiya Rachel Jacob

97

Early Detection of Diabetes from Daily Routine Activities: Predictive Modeling Based on Machine Learning Techniques . . . . . . . . . . . . . . . . 107 R. Abilash and B. S. Charulatha Classiﬁcation of Gender from Face Images and Voice . . . . . . . . . . . . . . 115 S. Poornima, N. Sripriya, S. Preethi, and Saanjana Harish An Outlier Detection Approach on Credit Card Fraud Detection Using Machine Learning: A Comparative Analysis on Supervised and Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 P. Caroline Cynthia and S. Thomas George Unmasking File-Based Cryptojacking . . . . . . . . . . . . . . . . . . . . . . . . . . 137 T. P. Khiruparaj, V. Abishek Madhu, and Ponsy R. K. Sathia Bhama Selection of a Virtual Machine Within a Scheduler (Dispatcher) Using Enhanced Join Idle Queue (EJIQ) in Cloud Data Center . . . . . . 147 G. Thejesvi and T. Anuradha An Analysis of Remotely Triggered Malware Exploits in Content Management System-Based Web Applications . . . . . . . . . . . . . . . . . . . . 155 C. Kavithamani, R. S. Sankara Subramanian, Srinevasan Krishnamurthy, Jayakrishnan Chathu, and Gayatri Iyer GSM-Based Design and Implementation of Women Safety Device Using Internet of Things . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 N. Prakash, E. Udayakumar, N. Kumareshan, and R. Gowrishankar A Novel Approach on Various Routing Protocols for WSN . . . . . . . . . . 177 E. Udayakumar, Arram Sriram, Bandlamudi Ravi Raju, K. Srihari, and S. Chandragandhi Fraud Detection for Credit Card Transactions Using Random Forest Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 T. Jemima Jebaseeli, R. Venkatesan, and K. Ramalakshmi Deep Learning Application in IoT Health Care: A Survey . . . . . . . . . . 199 Jinsa Mary Philip, S. Durga, and Daniel Esther

Contents

ix

Context Aware Text Classiﬁcation and Recommendation Model for Toxic Comments Using Logistic Regression . . . . . . . . . . . . . . . . . . . 209 S. Udhayakumar, J. Silviya Nancy, D. UmaNandhini, P. Ashwin, and R. Ganesh Self-supervised Representation Learning Framework for Remote Crop Monitoring Using Sparse Autoencoder . . . . . . . . . . . . . . . . . . . . . 219 J. Anitha, S. Akila Agnes, and S. Immanuel Alex Pandian Determination of Elements in Human Urine for Transient Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 N. Ambiga and A. Nagarajan Optimal Placement of Green Power Generation in the Radial Distribution System Using Harmony Search Algorithm . . . . . . . . . . . . . 245 S. Ganesh, G. Ram Prakash, and J. A. Michline Rupa Data Security and Privacy Protection in Cloud Computing: A Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 J. E. Anusha Linda Kostka and S. Vinila Jinny Re-Ranking ODI Batsman Using JJ Metric . . . . . . . . . . . . . . . . . . . . . . 259 Jerin Jayaraj, Maria Sajan, Linda Joy, Narayanan V. Eswar, and G. Pankaj Kumar Intelligent Cloud Load Balancing Using Elephant Herd Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 Pradeep Abijith, K. S. Aswin Kumar, Raphael Sunny Allen, Dileep Vijayakumar, Antony Paul, and G. Pankaj Kumar A Quest for Best: A Detailed Comparison Between Drakvuf-VMI-Based and Cuckoo Sandbox-Based Technique for Dynamic Malware Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 A. Alfred Raja Melvin and G. Jaspher W. Kathrine CloudStore: A Framework for Developing Progressive Streaming Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 G. Pankaj Kumar, C. Anagha Zachariah, D. R. Umesh, and M. N. Arun Kumar Classiﬁcation of DGA Botnet Detection Techniques Based on DNS Trafﬁc and Parallel Detection Technique for DGA Botnet . . . . . . . . . . . 297 Seena Elizebeth Mathew and A. Pauline Effective Utilization of Face Veriﬁcation in Fog Computing on Cloud Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 S. Princy Suganthi Bai, D. Ponmary Pushpa Latha, and R. David Vinodh Kumar Paul

x

Contents

An Efﬁcient Mechanism for Revocation of Malicious Nodes in Vehicular Ad Hoc Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 R. Jeevitha and N. Sudha Bhuvaneswari Robust Service Selection Through Intelligent Clustering in an Uncertain Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 K. Nivitha, A. Solaiappan, and P. Pabitha DNA: Dynamically Negotiable Approach—A P2P-based Overlay for Live Multimedia Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Preetha Evangeline, Anandhakumar Palanisamy, and Pethuru Raj Chelliah A Study on Feature Extraction and Classiﬁcation for Tongue Disease Diagnosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Saritha Balu and Vijay Jeyakumar MRI Brain Image Classiﬁcation System Using Super Pixel Color Contrast and Support Vector Neural Network . . . . . . . . . . . . . . . . . . . . 353 A. Jayachandran, A. Jegatheesan, and T. Sreekesh Namboodiri Performance Comparison of Machine Learning Models for Classiﬁcation of Trafﬁc Injury Severity from Imbalanced Accident Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 P. Joyce Beryl Princess, Salaja Silas, and Elijah Blessing Rajsingh FCM-Based Segmentation and Neural Network Classiﬁcation of Tumor in Brain MRI Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 S. Sandhya, B. Chidambararajan, and M. Senthil Kumar SDN-Based Trafﬁc Management for Personalized Ambient Assisted Living Healthcare System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379 Deva Priya Isravel, Salaja Silas, and Elijah Blessing Rajsingh Design and Implementation of Parking System Using Feature Extraction and Pattern Recognition Technique . . . . . . . . . . . . . . . . . . . 389 H. Varun Chand and J. Karthikeyan Classiﬁcation of Diabetes Milletus Using Naive Bayes Algorithm . . . . . . 401 S. Josephine Theresa and D. J. Evangeline A Mood-Based Recommender System for Indian Music Using K-Prototype Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413 K. A. Rashmi and B. Kalpana A 3D Convolutional Neural Network for Bacterial Image Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 T. S. R. Mhathesh, J. Andrew, K. Martin Sagayam, and Lawrence Henesey

Contents

xi

Intelligent Big Data Domain for R-fMRI Big Data Preprocessing—An Optimized Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 K. Elaiyaraja, M. Senthil Kumar, and B. Chidambararajan Textual Feature Ensemble-Based Sarcasm Detection in Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443 Karthik Sundararajan, J. Vijay Saravana, and Anandhakumar Palanisamy Designing Parallel Operation for High-Performance Cloud Computing Using Partition Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Krishnan Rajkumar, A. Sangeetha, V. Ebenezer, G. Ramesh, and N. Karthik ATSA: Ageing-Based Task Scheduling Algorithm for Mobile Edge Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 S. M. Muthukumari and E. George Dharma Prakash Raj Application of Big Data in Field of Medicine . . . . . . . . . . . . . . . . . . . . . 473 J. Sabarish, S. Sonali, and P. T. R. Vidhyaa Implementation of Extended Play-Fair Algorithm for Client-Side Encryption of Cloud Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 J. David Livingston and E. Kirubakaran System Modeling and Simulation of an Outdoor Illumination System Using a Multi-layer Feed-Forward Neural Network . . . . . . . . . . . . . . . . 495 Titus Issac, Salaja Silas, and Elijah Blessing Rajsingh Comparative Performance Analysis of Various Classiﬁers on a Breast Cancer Clinical Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 E. Jenifer Sweetlin and D. Narain Ponraj Handling Data Imbalance Using a Heterogeneous Bagging-Based Stacked Ensemble (HBSE) for Credit Card Fraud Detection . . . . . . . . . 517 V. Sobanadevi and G. Ravi Advances in Photoplethysmogram and Electrocardiogram Signal Analysis for Wearable Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 G. R. Ashisha and X. Anitha Mary An Authenticated E-Voting System Using Biometrics and Blockchain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 A. Priyadharshini, M. Prasad, R. Joshua Samuel Raj, and S. Geetha Early Recognition of Herb Sickness Using SVM . . . . . . . . . . . . . . . . . . 543 S. Geetha, P. Nanda, R. Joshua Samuel Raj, and T. Prince Automatic Detection of Sensitive Attribute in Privacy-Preserved Hadoop Environment Using Data Mining Techniques . . . . . . . . . . . . . . 551 R. Anitha Murthy and Dr. Dhina Suresh

xii

Contents

Performance Analysis of Grid Connected Modiﬁed Z-Source High Step up Inverter for Solar Photovoltaic System . . . . . . . . . . . . . . . . . . . 559 Y. Pavithra and K. Lakshmi Application of Integrated IoT Framework to Water Pipeline Transportation System in Smart Cities . . . . . . . . . . . . . . . . . . . . . . . . . 571 E. B. Priyanka, S. Thangavel, V. Madhuvishal, S. Tharun, K. V. Raagul, and C. S. Shiv Krishnan Improved Image Deblurring Using GANs . . . . . . . . . . . . . . . . . . . . . . . 581 Prathamesh Mungarwadi, Shubham Rane, Ritu Raut, and Tanuja Pattanshetti A Power-Efﬁcient Security Device Leveraging Deep Learning (DL)-Inspired Facial Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 R. S. Saundharya Thejaswini, S. Rajaraajeswari, and Pethuru Raj Applications, Analytics, and Algorithms—3 A’s of Stream Data: A Complete Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599 L. Amudha and R. Pushpalakshmi Optimization of Extreme Learning Machine Using the Intelligence of Monarch Butterﬂies for Osteoporosis Diagnosis . . . . . . . . . . . . . . . . . 607 D. Devikanniga and R. Joshua Samuel Raj Secure and Efﬁcient Sensitive Info-Hiding for Data Sharing via DACES Method in Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617 R. Joshua Samuel Raj, J. Jeya Praise, M. Viju Prakash, and A. Sam Silva

About the Editors

J. Dinesh Peter is currently working as an Associate Professor, Department of Computer Sciences Technology at Karunya University, Coimbatore. Prior to this, he was a full time research scholar at National Institute of Technology, Calicut, India, from where he received his Ph.D. in Computer Science and Engineering. His research focus includes Big-data, image processing and computer vision. He has highly cited publications in journals of national and international repute. He is a member of IEEE, CSI & IEI and has served as session chairs and delivered plenary speeches for various international conferences and workshops. Steven L. Fernandes is currently a post-doctoral researcher in the Department of Computer Science, University of Central Florida. He has previously been afﬁliated with the University of Alabama Birmingham, USA and the Sahyadri College of Engineering and Management, India. Dr Fernandes has authored 1 book and 40 research papers in refereed journals. Amir H. Alavi is an Assistant Professor in the Department of Civil and Environmental Engineering, and holds a courtesy appointment in the Department of Bioengineering at the University of Pittsburgh. Dr. Alavi’s research interests include structural health monitoring, smart civil infrastructure systems, deployment of advanced sensors, energy harvesting, and engineering informatics. He is the director of the Pitt’s Intelligent Structural Monitoring and Response Testing (iSMaRT) Lab which focuses on advancing the knowledge and technology required to create self-sustained and multifunctional sensing and monitoring systems. His research activities involve implementation of these systems enhanced by engineering informatics in the ﬁelds of civil infrastructure, construction, aerospace, and biomedical engineering. Dr. Alavi has authored ﬁve books and over 170 publications in archival journals, book chapters, and conference proceedings. He is among the Google Scholar 200 Most Cited Authors in Civil Engineering, as well as Web of Science ESI’s World Top 1% Scientiﬁc Minds.

xiii

From Dew Over Cloud Towards the Rainbow Ecosystem of the Future: Nature—Human—Machine Zorislav Šojat

Abstract This article is dedicated to the Philosophy of Computing, where Computing is taken in its wide sense of “constructed/programmed machine action”. As in modern times computing equipment ubiquitously penetrates all aspects of human and natural environments, suddenly Computer Science has to deal with, and consequently be responsible for, an almost inexhaustible area of natural and human life—individual, social, political and economic subsistence, and cultivation (adaptation/change) of natural environments. Therefore, there are three main fields it has to deal with: the Machine, the Human and the Nature. So there are also three major aspects from which the development of the Future Ecosystem, presently extremely influenced by Computer Science, has to be seen and done: the Technological, the Humanistic and the Naturalistic. As archetypal symbols, colours are used throughout our history. A symbolic “division” of a future Nature–Human–Machine Ecosystem, or of the major symbolic areas of Computer Science, is given through the vision of a Rainbow: Infrared—Energy, Red—Hardware, Orange—Creativity, Yellow— Appropriateness, Green—Nature, Blue—Communication, Indigo—Cooperation, Violet—Interference, Ultraviolet—Visions. Keywords Philosophy of Computing · Computer Science · Rainbow-Ecosystem · Dew-Computing · Fog-Computing · Cloud-Computing · Energy · Hardware · Creativity · Appropriateness · Nature · Communication · Cooperation · Interference · Visions

1 Yesterday—Today—Tomorrow It is the essence of human nature to be inquisitive and creative. We are beings which always try to understand and change our world we live in, trying to beautify it and adapt it, making it as close as possible to our dreams of what we would like it to be. And we always imagine a world which we think would be better for us. And Z. Šojat (B) - Boškovi´c Institute, Zagreb, Croatia Centre for Informatics and Computer Science, Ruder e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_1

1

2

Z. Šojat

often we think that the piece of the world we adapt to our dreams of being better is a priori better for all of us. So we end up with many problems we never foresaw, like smokey atmosphere, undrinkable water, destruction of thousands of living species, or complete dependence on technical infrastructure. Interestingly enough, it can be seen as a kind of “laziness” which often forces us to create the most marvellous things and to spend years and years creating them. Manual labour gets sometimes really tough. And calculating and exploring nature can be completely exhausting. Those are the two prime movers of the development of “thinking” and “controlling” machines. And if a machine can thoughtfully control, that’s even better! During the last century, we evolved those basic technologies to unforeseen heights. And just the word ‘unforeseen’ is very important. The development of Computer Science, though it being extremely young, has shown two major features: Unbelievable forgetfulness. Do any of the modern Computer Science students know how to make a relay-based computer? An analogue computer? To design a processor from scratch? And, as seen, unforeseen, unforeseeable and uncoordinated, almost stochastic expansion in all possible directions of human endeavour, into social, financial and political life, into the human and natural physical environment and even into the possibility of everyday survival. But, sincerely, do we ever really think of Future? Do we, by applying all our intellectual potential into the development of computer technology and the Computer Science, do we often enough think about the fact that the world we make now will be the world our children and grandchildren will have to live in? Do we have a realistic, human and nature inhabitable vision of where we go? Yes, we see the path. Well, we see many paths. But where do they lead? Will the natural, social and political environment these paths lead towards be for the betterment of basic human values, or will it degrade the qualities of being human? Do we, computer scientists and experts see and feel the life in our future? Do we sincerely take responsibility for what we do and where we go? Huge are, really, the successes of Computer Science. Most of the whole collected knowledge of humankind is at the fingertips of anybody interested. The possibility to communicate by writing, talking and even seeing each other, wherever we are in the world we know, makes a huge positive shift in human civilisation. The scientific discoveries enabled by Computer Science are unbelievable. But just for a moment imagine a huge coronal mass ejection on the Sun, directed directly towards our planet Earth, and the consequences of emerging geomagnetic storms on our infrastructure, both near-Earth, like satellites, and Earth-bound, like power distribution and EMF disturbances of micro-components. And consequently the consequences on everyday life, from heating, water-supply and lighting to the production, transport and finances. Do we have any backup plan? It is known in Cybernetics that it is impossible to understand a system being inside it. So let us take the wings of mind and see the field of Computer Science from above. What we see immediately is that there are three main paradigms in presentday computing. We wish to have “thinking” machines, huge, powerful ‘intelligent’ human assistants in whatever we want. That is the area of the Cloud-Computing paradigm, and it is situated in the core of the Internet. And then we wish to have

From Dew Over Cloud Towards the Rainbow

3

“controllers”, be it for heating, cooling, cooking or health, in cars, traffic of all kinds, production or sports… These machines have to be fully under the control of humans, but it is true that huge possibilities open up with the possibility to coordinate them, i.e. by giving them “suggestions” (but never “commands”) from the Clouds. This is the area below the Edge of Internet—the area of the Dew-Computing paradigm. And finally, in between, we have an area of Fog-Computing, where the Clouds and the Dew meet in constant interaction with the human user. Due to an undefinable growing amount of different functionalities and architectural heterogeneity, future of Computing will have to be based on Information and not Data processing. Data by itself, without a context (or meta-data), is completely meaningless. For example 50,000,000, 125,000, 4500. If you have no idea of what those numbers mean, i.e. no meta-data, no context, you cannot know that the first number is actually the mass, in tonnes, of electronic equipment waste in the year 2018. The second number is the amount of jumbo-jets which would have the same mass, and the third the amount of Eiffel towers needed to have that mass [12]. Regarding the huge expansion of general computing usage in all fields of human endeavour, to enable seamless integration, enhancing human learning and exploration facilities and general natural well-being, it will be necessary to develop a consistent expandable Ontology, which would be able to preserve Information throughout the whole computing hierarchy as a necessary component of each Information transfer. A possible preliminary work on such an ontology, using shades of four primary colours as thematic groups can be found in [2]. Such an approach may have many advantages, due to consistent colour coding which allows association and attribution to a specific thematic group, as well as cross-referencing with other thematic (colour) groups in knowledge/notion subdivisions. But then, from that high viewpoint, we see another aspect of Computer Science— the global lack of a global coherent vision, the lack of Philosophy of Computing, including Ethics and Ecology (Science of the Whole). Our present-day world, both human and natural, is completely intertwined and intertwisted, and, almost unwillingly, Computer Science crept into every pore of it. Therefore, we will have to take into account three almost opposing aspects of the whole of the Future Ecosystem, consisting of Nature, Humans and Machines: the technological, the humanistic and the naturalistic. We certainly may not forget that Nature is the prime moving force of our existence, and that it will certainly survive whatever disastrous end we may find on the end of some path of development we follow. Second come Humans, and we may not allow ourselves to lose some of our basic natural humanity. We are physically and mentally agile, we have a strong emotional sense of ethics, of what’s good and what’s bad, we need freedom, we have feelings and we each have a soul. We are eager to understand and have many emotional, intellectual and mystic enlightenment. The Machines do not!

4

Z. Šojat

2 Dew—Fog—Cloud Regarding the technological aspect, in recent times, we can see in Computer Science a paradigmatic hierarchical system consisting basically of three layers [5]. To start from the lowest level, the level of equipment and devices which are directly responsible for our physical environment, controlling anything from soil humidity to public lighting, from home temperature to traffic lights—in recent years we recognise the paradigm of Dew-Computing [7, 11]. Generally, Dew-Computing is responsible for the layer of devices which are below the Edge of Internet, and which are directly responsible for specific physical aspects of our environment. Therefore, in Dew, the devices may inter-coordinate or be coordinated, but they may not be “ordered” outside the parameters set by the human user or natural process. Consequently, Dew Droplets, basic constituent elements of Dew, must follow two major principles, not existing in the rest of the hierarchy—self -sufficiency and cooperation [3]. On this level information processing is essential. As the Dew evaporates to slowly become a Cloud, or the Cloud touches the Earth to spread the freshness of Dew over Nature, in between them there lies Fog. FogComputing is the hierarchical layer between the heterogeneity of Dew Devices, the complexity of Cloud Processing and the wilfulness of Human users. The main areas Fog-Computing has to deal with are ergonomics, human–computer interaction, service delivery support, multi-service edge computing, adaptive application support, communication mobility and rerouteing, etc. In view of the central role of Fog, as the main connection to humans, an aforementioned consistent and expandable Ontology would hugely advance the possibilities of human–computer interaction, as it would be easily integrated into specific human language forms, enabling seamless communication in both directions. Though the devices at the Dew level must be directly accessible for manual control, it is easily envisionable that most of the monitoring, control, as well as coordination of those devices may come into the realm of Fog. The third layer in the emerging computing hierarchy is the layer of Clouds. Clouds are far away, somewhere, nobody knows where. Because it is irrelevant. Cloud-Computing is responsible for networking, monitoring, modelling and control (suggestive and/or directive), mass storage, advanced usage, etc. There is High Performance Computing (HPC), High Throughput Computing (HTC), the evolving area of High Productivity Computing (HProC) [9], the realm of Artificial Intelligence… In view of that, Clouds are the main glue of the Global System. However, they are still primarily data centred. This can quite easily be perceived in the scientific environment, where usage of results obtained by others may be overwhelmingly complicated due to inconsistent global data formats, and often complete lack of ontologically systematised meta-data. This can also be painfully perceived in the enormous amount of special, specific and “proprietary” data formats, each defined programmatically, even syntactically, but none defined ontologically and semantically.

From Dew Over Cloud Towards the Rainbow

5

3 The Rainbow The other two over-important aspects of activities which Computer Science, through its Philosophy, its Human Collaboration and its Technological Development, has to embrace are the humanistic and the naturalistic aspects of its endeavours. It is true that Computer Science is still regarded as a Technical Science. However, from previous discussion, it becomes obvious that the development of Philosophy of Computing, and basing future “naturo-humano-techno-logical” developments on full multidisciplinarity over the whole range of natural and humanistic sciences, is essential in forming the envisioned Global Information Processing Environment [10] as a sustainable and human and nature appropriate self-organising (!) Global Ecosystem. An initial view of a Computing Architecture of such Rainbow-Computing future environment is more thoroughly given in [8]. The use of colours, as in healing so, as symbols, in philosophy, is ages old. Let us, therefore, take some symbolic meanings of colours to shed some light on the over-complex area of Computer Science as seen from afar. Note that each of these symbolic colours has a basic meaning and, naturally, a wide spectrum of elements to be covered. As in sunlight itself, all the symbolic rainbow colours are thoroughly intertwined. This means that each and every investigation into or development of each of those areas has to take into account all the others as well.

3.1 Infrared—Energy Efficiency, Garbage, Quality There is much talk about computing efficiency, not only of hardware, but also on algorithmic level, and a lot of effort is put into lowering energy consumption of computing equipment. And ever and again new “green”er computers emerge. However, let us consider the following fact: The previously mentioned estimated mass of electrical/electronic equipment junked, thrown into garbage in 2018, the 50 million tons, i.e. 50 billion kilograms, when divided by the estimated Earth human population mid-2018 (7.6 billion), means that each and every human on this planet, including the youngest and the oldest, the richest and the poorest and each of the remotest rainforest dwellers, actually threw away 6.6 kg (!) of complex devices, of which only around 20% is formally recycled [12]. So can it be in any way true that new “green”-er equipment recaptures in its “low(er) consumption” the enormous amounts of energy necessary for its development, production, installation, deployment, as well as the deinstallation and disposal of previous equipment. It is obvious that our energy balance is much off-balance. The same can be found on all levels of our behaviour as a civilisation—replacing things constantly with “newer” ones, as generally, it is our disease to perceive “new” as a priori “better”, without even (re-)thinking. The only reasonable answer to this ever-growing problem is long-term quality. Our garbage production economy even invented “product lifetime management”, to

6

Z. Šojat

programmatically limit the lifetime of a product. We often see this, also, in software, based on time-limiting licenses, and newer versions always, for some unimaginable reason, necessitating the newest versions of both operating systems and hardware. Therefore, a global effort has to be put into developing long-term viable computers, extremely efficient operating systems and software which does what people really necessitate, standards which do not change abruptly and non-compatibly, and products of all kinds which can stand the trials of time. This effort has to involve all levels of our societal, political and global environments.

3.2 Red—Hardware Architecture, Memory, Operations Almost all of our present-day computing and communication hardware is based on a legacy from the early 1940s binary system and serial processing. Though there were in the meantime several primarily vector and parallel architectures developed and used, and presently we see much progress in adopting parallel Graphics Processors for general computing, the binary orientation did stay. It even came to the point that we often use the notion of “digital” to mean “binary”! Anything countable in any number system is digital! A worthy area of future investigation is in the area of non-binary (i.e. ternary, quaternary, or even decimal) logic circuits and hardware communication protocols. This may lead towards a huge speed-up of computing and communication. Quantum Computers became recently a buzzword. Huge advancements are popularly expected from their development. But, alas, a Quantum Computer cannot, by its basic principles, be Turing complete! This means that only a specific class of problems can be solved by them, and no generic ones. A worthy remedy to this huge problem is “Quassical Computing”, a hybridisation of Quantum and Classical computing, as elaborated in [1]. The development of Analogue and Hybrid (Analogue/Digital) Computers stopped in the mid-1970s, as in that time analogue technology was not developed enough to enable high precision and easy reprogrammability. Fortunately recently we introduced Field Programmable Analogue Arrays (FPAA-s), an analogue equivalent of the digital Filed Programmable Gate Arrays (FPGA-s), which solved the abovementioned programming (and much of the precision) problem. However, many other architectural avenues are unexplored or underexplored. An extremely interesting and important approach using Lukasiewicz logic as its basis is the development of the Extended Analogue Computer described in [6]. Due to modern (dynamic) memory design, and due to the fact that internal processor instruction execution times became much shorter than memory access times, the approach and layout of a programme directly influence the speed of execution. Up to mid-1990s the memory speed was one-to-one to the instruction processing speed. A faster processor needed a faster memory. Today, however, this is not any more the case, as memory latency cannot be much improved. Our

From Dew Over Cloud Towards the Rainbow

7

most commonly used memory today is not Random Access Memory (RAM), but Sequential Burst Memory. Therefore, any algorithm necessitating random access to some large memory-resident array (linked lists, complex structures, etc.), as opposed to algorithms doing linear serial processing, necessitates several times longer processing. A small experiment showed that this difference is approximately 6 (six!) times on the newest processor in a scientific cluster, as opposed to equal times on a 1996 scientific computer. Much work will have to be done in the area of random access speed-up and new memory layout architectures. Regarding the basic operations at the hardware level, huge advancements could be made by making Algorithmic Hardware (Co-)Processors (AHP-s). Through years of investigation, we defined a huge amount of high-level algorithmic operations which are commonly used in the form of so-called “libraries”. Hardwarisation of such algorithms would extremely raise the processing speed and also much simplify the programming. Furthermore, as more and more processing is done using multidimensional spaces, hardware support for multidimensional numbers (primarily complex and quaternions) would be very welcome. The development of new/alternative/enhanced computer architectures is becoming a necessity in our time, as the “classical” architectural approach is already at the upper edge of Moore’s Law. The great challenge, in all three paradigmatic areas, in Dew-Computing, Fog-Computing and Cloud-Computing, will be to coordinate and allow seamless cooperation of a yet unknown large range of extremely heterogeneous architectures and consequently also programming principles.

3.3 Orange—Creativity Stimulation, Ideas, Education The Art of Creativity is a gift of our existence. However, to actually be able to create there are three necessary prerequisites of Creativity: Knowledge of Thinking, Knowledge of Principles and Knowledge of Making. And, naturally, the stimulation of creativity and knowledge acquisition on all levels and in all areas is essential for further continuance of our civilisation. The complexities of modern-day world are exuberant. Therefore, special attention has to be given to proper Knowledge of Thinking. Traditionally different areas of human endeavour developed their own particular ways of generic thinking. The way a political scientist approaches some problem is quite different from the approach a computer scientist will take. However, in the future, we will have to organise Education that promotes the full spectrum of specialised “thinkings”, as problems posed in front of us cannot be solved without the view of the whole. One can, unfortunately, see that the Knowledge of Principles is much disregarded in modern education on all levels. It is not enough to teach theory. Without real practical experience by constructing from basic principles, how do we expect future scientists to be able to have and develop novel creative ideas?

8

Z. Šojat

And finally the Knowledge of Making. Presently we are quite versed in the knowledge of making in the “virtual”, computer world. For example, through Computer Assisted Design and Manufacturing (CAD/CAM) a vast majority of present-day products is “made” in a virtual environment, and then, very often, directly given to a robotic factory to be produced. However, for the development of real creativity actually, a “hands-on” experience (and knowledge) of all production levels is necessary. Where have we forgotten the Art of Repair? Ideas are what starts the whole process. With human endless creativity, fantastic results can ensue. Creativity is, as ideas are, boundless. From the worthiest to the weirdest, from the most beneficial to the most dangerous. Therefore, not all ideas, as much as they may seem to enhance something, are globally beneficial, and many of them often in our past came out to be actually quite detrimental to our global ecosystem. Hence, the results of Creativity have to be, before being applied anywhere, thoroughly explored through the (Yellow) Appropriateness Filter and as many as possible consequences taken into account.

3.4 Yellow—Appropriateness Filter Consequences An age-old wisdom says in a proverb: “The way to hell is paved by good intentions”. And then it also says: “Hell is full of good meanings, Heaven is full of good works”. So what makes the distinction between ‘good intentions’ and ‘good effects’? Appropriateness = Future Consequences. It should be said that the basic Appropriateness Filter for any (Orange) idea, creation, development or usage must be strictly based on (Indigo) Ethics. What will future consequences be? Will it benefit a (relatively) small community, but harm (in any way) another community, society, humanity or the natural environment? Without exploring future global and local consequences of each and every development we do in any field by using Knowledge, Wisdom and Conscience and taking Responsibility, we may end up in a world no-one of us would even like to imagine, the less live in it. But our descendants would have to. To be able to foresee at least some of those future consequences it is necessary to have a very wide perspective on the interrelationships and interferences which may be changed or aroused by the introduction of new development. The science of Ecology, through Cybernetic principles applied to the coordination and modelisation of knowledge and experience from all three major object-fields (nature, human, machine), must be used for any kind of exploration of possible consequences on the local and global human, social, political and environmental life. For that to be possible, we need people who have a wide overview, and deep knowledge of interdependencies of all fields of science. But, how will we be able to have such necessary scientific population educated to understand wide principles if we continue insisting on more and more narrow specialistic knowledge education?

From Dew Over Cloud Towards the Rainbow

9

Nevertheless, it is on each and every one of us involved in any kind of creativity to apply all of our knowledge and best conscience to, as best as possible, evaluate the appropriateness of any idea before introducing it, and, if we are in any way uncertain, to cooperate with others to foresee possible consequences of our work.

3.5 Green—Nature Environment, Health, Well-Being, Backup Systems, Global Ecosystem This is the area which is, of all, most closely connected with, and responsible for, our physical environment, be it heating/cooling, cooking or health monitoring and support, and is directly responsible for our and Nature’s physical well-being. This includes systems which can improve energy efficiency, like traffic and lighting control; “smart” homes, cities, forests, islands; health and disabled/elderly care [4], early natural (or other) disaster warnings, (forest-)fire detection and prevention, soil cultivation; and, extremely important—backup systems necessary for basic survival (if the modern infrastructure fails in any of its aspects). Technologically speaking, this is the prime area of Dew-Computing. In this regard, Dew-Computing has to take into account not only the “computer” side but also human and natural prerequisites. On the level of the global ecosystem, much investigation and scientific work has to be done in many sciences to start understanding the principles, variables and limits of the uncountable amount of elements responsible for its balance, and their interrelationships, as well as the principles governing consequences of self-organisation of a sub-system and the global ecosystem driven by any change. By careful and well-thoughtout steps huge benefits can be achieved by applying an Ecology-conscious Dew system in the field of agriculture/forestry on all levels— care of humidity and mineral content of the soil, the amount of insolation, etc., as well as chemical and meteorological warnings/protection—e.g. hail prevention. This then includes environmental sensors and appropriate effectors which “rectify” the deficiencies, as well as “alarms” in cases a “rectification” of a problem has to be thought-out and performed in some other area or level of the whole ecosystem. Such a system of nature computer assistance in agriculture would (will) bring a wealth of good: high quality of natural food production enabling a higher level of the general health of world population bringing higher quality of life. As this Green area of Computer Science is intertwined with the natural ecosystem, special attention has to be given to ethical and ecological principles and standards. Let us not forget that machine-devices on this level directly influence the physical existence of both humans and the whole global natural ecosystem.

10

Z. Šojat

3.6 Blue—Communication Information, Knowledge, Human–Computer Interaction, Languages How often is it that we talk about Information Processing, Information-Communication Technology, Infromatisation…, but what we see in modern times? The development is primarily oriented towards data. As explained earlier, Data is not Information, as it lacks the context of an ontological system, so, consequently, a huge amount of data cannot be generically transformed into a huge amount of information. Therefore, it is extremely hard to gain Knowledge out of data. Sometimes, if data is completely contextless, even impossible. Consequently, it is essential to start introducing Information Communication instead of Data Transfers. Regarding the human–computer interaction and communication, as well as the machine–machine communication, we will have to develop a generic ontology, semantics or at least generic compatibility on the level of “what is communicated” and “what is to be done”. Only by enabling seamless integration of our human linguistic understanding (all human languages share a common ontological and expressional system) with what machines understand will we be able to develop a Global Information Services Environment which would not become an uncontrollable, intolerable Tower of Babel, as it is now. It will be necessary, as much as our continuous efforts will enable, to aspire towards a communicational “golden age”, as it is beautifully expressed in the Bible: “And the whole Earth was of one language, and of one speech” (Genesis 11:1). Naturally, this does not mean in any way harming the magnificent diversity of human languages, but, in the sense of Computer Science, to help develop such a system of notions which would encompass most of the common ground of machine and human understanding, based on the needs and possibilities of every element involved (machine, human and natural). The once flourishing area of Computer Linguistics (not Computational Linguistics!) seems almost to have been completely abandoned. Presently the human– computer interaction and communication is primarily based on some preprogrammed possibilities, or the tedious and exhausting “programming” in “programming languages” which are, due to historical reasons, actually a slightly more human approachable machine code, and not adaptable structured formalisations derived from human language principles. Therefore both Computer Linguistics and Ergonomics are basic building blocks in which serious development has to be done to properly encompass the area of human–computer interaction. Although the problems of language and ergonomics spread throughout the whole spectrum of human activities, this area would, technologically speaking, be the prime area of Fog-Computing. On this level communication necessary for the effectuation of individual functionalities has to be recognised by filtering information, forwarding only messages which are part of a specific semantic field to appropriate equipment, services, memories or humans. This is, naturally, possible only in the case that the information context of transferred messages is ontologically/semantically/linguistically known.

From Dew Over Cloud Towards the Rainbow

11

3.7 Indigo—Cooperation Ethics, Information Use, Redundancy, Knowledge Gathering and Preservation Many areas in Computer Science open up deep and important ethical questions. Ethics deals with concepts of human right- and wrong-doing as generic terms applicable to any form of living or non-living entities. Ethical questions are raised all around in Computer Science, from “cyborgisation” through to everyday survival, from privacy to “Artificial” Intelligence, self-driving cars to population behaviour control, robots to wrongful information. Naturally Ethics, as a philosophical discipline, is also a constantly evolving scientific field, and Computer Science Ethics, a discipline of Philosophy of Computing, has to maintain a constant effort towards furtherment of understanding and resolution of ethical issues in the application of computing to any aspect of human and natural life. In this context philosophical stipulation of Scientific Conscience is extremely important, based on carefully elaborated ethical principles, as in the area of interchange and information use, so in the area of prevention of possible future problematic consequences regarding unwished, unwanted and undesirable use of computer technology. Technologically speaking, this is the prime area of Cloud-Computing, integrated in the Dew-Fog-Cloud Ecosystem. Therefore, ethical principles have to be implemented throughout the whole computing hierarchy, and a lot of effort shall be put into cooperation of scientists on all levels. It is obvious that for the proper future development of Computer Science it is necessary to encompass in its spectrum also Social and Humanistic Sciences, Arts, as well as legal, political and all other intellectual efforts. It is important to note that, due to ad hoc developments, the usage of the world’s communication and computation conglomerate is extremely inefficient, and that more often then not the search for information and the use of various services calls for huge efforts of the human user, being the result of a series of trials and errors. Consequently, knowledge gathering, intellectual or computerised, is extremely hard to perform. A consistent (Blue) Ontology, with the introduction of a generic semantic and syntactic linguistic ecosystem, would enable gradual solution to the problem of present-day overload of the data-space (by turning it into information space), and would moreover enable the introduction of behavioural rules (which can be integrated directly into the linguistic ecosystem). Another very important aspect which we will have to take into account in future development is the aspect of redundancy. Presently, the global data/information space, the Internet, is overflowed by redundancy (e.g. multiple copies of same data, not even aware of the fact that they are the same), necessitating huge amounts of machinery and energy to sustain. However, on a long run, we will have to develop methodologies for global redundancy “pruning”, and define long-term minimal redundancy levels, principles, methodologies and technologies for historically longterm important information/knowledge preservation. Although leaving a trace in future history may not be the first thing on our minds, it is extremely important. As the Latin proverb states: “Historia est magistra vitae”. All our efforts, successes and

12

Z. Šojat

failures have important meaning for the future of humankind—the same as we learned from previous civilisations. This future history knowledge preservation aspect is very much absent in our days. Imagine that our civilisation, like so many historical ones did, ceases to exist. Let’s take a time machine and jump just two thousand years into the future. What is left of all our efforts, machinery and developments? How, ever, could anybody know we had plastics? How could any future archaeologist even suspect we had mobile phones, extremely powerful micro-components and microsystems, and that we had a vibrant artistic life in our virtual spaces? The present drive towards digitalisation is very bad regarding preservation—many problems already arose and arise due to the very feeble physical nature of the way we store data. And, furthermore, now, as we jumped 2000 years into the future, imagine you find a wellpreserved disk or memory-stick, or a diskette, or a CD/DVD in a time capsule from the year 2019. There will be nothing readable any more! And even if it would: What is the meaning, the principles, the electrical levels, the timing, the cypher, the format of it? As a future archaeologist you would, with utmost care, put that historical artifact as an exhibit in a museum showcase, never trying to destroy it by trying to read it.

3.8 Violet—Interference Security, Limits of Expansion, Human–Computer Interference, Interference Processing The realm of security is very important in Computer Science. But, unfortunately, the meaning of this ‘security’ is quite limited to access, communication, data/information and persona protection and similar. After the disaster the hurricane Dorian made on Bahamas (2/3 September 2019), I overheard a clip with an eyewitness lady speaking of the consequences (paraphrasing): “Everything is gone! Everything! The houses, the homes! The banks!…”. After such a terrible experience mentioning the Banks on the second place is sociologically extremely interesting: no bank—no money—no food—what next? The marvellous modern “digital” society is gone. There are no backup systems independent of the technical infrastructure. This is the wide sense of the term Security. Another possible disaster, with presumably even worse consequences on the whole of our civilisation is described earlier in this article. Computer Science will have to tackle this enormous problem, together with experts from all fields, international standardisation bodies and policy makers. And, finally, it is the most basic responsibility of anybody introducing a new system to provide appropriate backup solutions, specifically as in our civilisation survival itself may be seriously endangered without the present-day machine infrastructure. There are also limits of expansion—every system has a limit of growth, development and expansion. Alas, some buzzwords in modern world promote limitlessness. What would be and include an Internet of Everything (IoE), so promotionally propagated all around? Would that “IoE” include a chip in me, so my brain and my thoughts become part of the Internet? Where is the limit? Do we want to become a race of

From Dew Over Cloud Towards the Rainbow

13

Cyborgs, half flesh, half machine? Do we want to make a Virtual World, and leave to the machines to take basic care of our bodies, just to sustain them in the real world? Will the plum-tree in my garden have to ask a remote weather control system to please send some rain at its geocoordinates? And will the willow next door approve? An area presently completely unexplored is the interference between computers and humans on physical and quantum level, that is including wrong buttons, and including those days when we are overstressed and the computer just does not want to work properly. (I doubt anybody can claim never to have experienced it.) The same applies also to any kind of disruptive interference between the environment and the computers, interfering, as a direct consequence, with human activities and life. This area of future exploration and the development of guidances are quite important, as such occurrences significantly raise the human user stress levels and negatively influence both individual and collective health. Finally, technological evolution through novel research into optical and quantum processing, with a possible future in optical interference processing and multi-state or analogue quantum interference computing, may be in front of us. By the formation of interference patterns as a means of storing and processing information, a high level of interrelatedness can be achieved in high-frequency spectra.

3.9 Ultraviolet—Visions Wisdom, Prudence, Conscience, Responsibility, Holism Visions are what makes us, the whole human civilisation, move forward. Visions of a better future there are all. Better individual future, better (smaller or larger) collective future, or better future for us humans in a stable and flourishing natural environment. Well, consolidating all human visions is impossible. However, we can recognise a certain futuristic vision in all civilisation areas. It is obvious that the so-called “digitalisation”, actually computerisation, is ubiquitously penetrating everything. The common vision behind this development is a world in which humans are almost exempt of physical labour of any kind (like having to go to bank to take paper money, going to the local bar to talk with friends, or moving steel bars into and out of a steel-press), and exempt of much of the “menial” intellectual work (what is 12 * 12, how does one drive…), and most of the communication, control and monitoring of all aspects of our immediate and wide environment is done through some “virtual” means. But what would people do, if they would not have to do anything? In the background of this vision a centuries-old philosophical view is still active: “By not having to perform menial work, humans will have much more time to develop their intellectual and artistic abilities.” But, alas, a few centuries of development of machines has shown us that though there are huge sociological (better or worse) changes (and challenges) due to constant (over-)introduction of new and new machine based technological solutions, the main flaw in the above view is that by removing much of the physical labour we introduced a huge amount of “sitting in the office”

14

Z. Šojat

labour, and by removing more and more “menial” intellectual work we start being less and less intellectually apt, being extremely dependent on technology. Not to mention that in any event of infrastructure failure we become completely helpless in almost all aspects of our everyday life. What do we want? What do we want for our children and grandchildren? What kind of future do we want? need? like? worry about? fear? dread? We need to use wisdom, prudence and conscience to project our visions of future and have always to take responsibility for our visions. Only through a holistic view of the global nature–human–machine ecosystem our Visions and aspirations can pass through the (Yellow) Appropriateness Filter—only by applying the highest knowledge of dynamic systems behaviour to as many aspects of human civilisation as possible and using all the knowledge different scientific fields, natural, technical, artistic and humanistic have collected, can we aim to understand the possible consequences which some particular vision or action in its applicable environment may provoke on the wider scale up to the global ecosystem.

4 Philosophy of Computing Unexpectedly Computer Science must, regarding the speed of “machine” penetration in all aspects of human and natural life, enormously widen its area of knowledge including primarily philosophical, ethical, humanistic and naturalistic aspects. In this regard, the development of Philosophy of Computing is a way to encompass this enormous complex system by providing a high and broad viewpoint. So we could “define” that Philosophy of Computing is a scientific way of thinking about the world, the machines, the humans and the society, nature, and their interrelationships, primarily through the lens of Cybernetics and Computer Science, and its main aim is to adapt the Computer Science Visions to the highest ethical, psychological, sociological and ecological standards our civilisation has and can achieve. Acknowledgements The author wishes to express his gratitude for scientific input and support to Gordana Grediˇcak Šojat, Karolj Skala, Centre for Informatics and Computer Science of Ruder Boškovi´c Institute, Zagreb, Croatia and the Organising Committee of the ICBDCC19 at Karunya University, Coimbatore, India. It is also necessary to express the highest gratitude and honour to the countless scientists and explorers who gathered and evolved our present-day knowledge.

References 1. E.H. Allen, C.S. Calude, Quassical computing. Quant. Phys. (2018). https://arxiv.org/pdf/1805. 03306.pdf. Accessed 26 Nov 2019, 14:50 CET

From Dew Over Cloud Towards the Rainbow

15

2. B. Bebek, Z. Šojat, Harmony: spiritual, mental, strategical, operational. Preliminary Approach (2018). https://www.researchgate.net/publication/328466718_Harmony_Spiritual_ Mental_Strategical_Operational. Accessed 14 Nov 2019, 12:51 CET 3. M. Gusev, Y. Wang, Formal description of Dew computing, in Proceedings of the 3rd International Workshop on Dew Computing (DewCom 2018) (IEEE DewCom STC, Toronto, Canada, 2018), pp. 8–13 4. Y. Gordienko, et al., Augmented coaching ecosystem for non-obtrusive adaptive personalized elderly care on the basis of Cloud-Fog-Dew computing paradigm (MIPRO 2017) 5. P. Kukreja, D. Sharma, A detail review on Cloud, Fog and Dew computing. Int. J. Sci. Eng. Technol. Res. (IJSETR) 5(5) (2016) 6. J.W. Mills et al., Extended analog computers: a unifying paradigm for VLSI, plastic and colloidal computing systems. Semantic Scholar. https://pdfs.semanticscholar.org/eaad/7b8f93 265286106c3ce24ff17dd794911674.pdf. Accessed 26 Nov 2019, 15:04 CET 7. P.P. Ray, An Introduction to Dew Computing: Definition, Concept and Implications, vol. 6 (IEEE Access, 2017), pp. 723–737. https://doi.org/10.1109/accedss.2017.2775042 8. K. Skala, Z. Šojat, The Rainbow Global Service Ecosystem (DewCom 2018, IEEE DewCom STC, Toronto, Canada, 2018), pp. 25–30 9. Z. Shoyat, An approach towards high productivity computing, in First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014) (ARCOS, University Carlos III Madrid, Spain, 2014), pp. 27–35. ISBN 978-84-617-2251-8 10. Z. Šojat, K. Skala, Views on the role and importance of Dew computing in the service and control technology (MIPRO 2016), pp. 175–179. https://doi.org/10.1109/mipro.2016.7522131 11. Z. Šojat, K. Skala, The dawn of Dew: Dew Computing for advanced living environment (MIPRO 2017), pp. 375–380. https://doi.org/10.23919/mipro.2017.7973447 12. World Economic Forum, A new circular vision for electronics: time for a global reboot. http:// www3.weforum.org/docs/WEF_A_New_Circular_Vision_for_Electronics.pdf. Accessed 13 Nov 2019, 22:39 CET

L1 Norm SVD-Based Ranking Scheme: A Novel Method in Big Data Mining Rahul Aedula, Yashasvi Madhukumar, Snehanshu Saha, Archana Mathur, Kakoli Bora, and Surbhi Agrawal

Abstract Scientometrics deals with analyzing and quantifying works in science, technology, and innovation. It is a study that focuses on quality rather than quantity. The journals are evaluated against several different metrics such as the impact of the journals, scientific citation, SJR, SNIP indicators as well as the indicators used in policy and management context. The practice of using journal metrics for evaluation involves handling a large volume of data to derive useful patterns and conclusions. These metrics play an important role in the measurement and evaluation of research performance. Due to the fact that most metrics are being manipulated and abused, it becomes essential to judge and evaluate a journal by using a single metric or a reduced set of significant metrics. We propose l1 -norm singular value decomposition (l1 -SVD) to efficiently solve this problem. We evaluate our method to study the emergence of a new journal, Astronomy and Computing, by comparing it with 46,000 journals chosen from the fields of computing, informatics, astronomy, and astrophysics. Keywords l1 -norm · Sparsity norm · Singular value decomposition · Journal ranking · Astronomy and computing · Big Data

R. Aedula (B) · Y. Madhukumar · S. Saha · K. Bora · S. Agrawal PES Institute of Technology, Bangalore South Campus, Bengaluru, India e-mail: [email protected] Y. Madhukumar e-mail: [email protected] S. Saha e-mail: [email protected] K. Bora e-mail: [email protected] S. Agrawal e-mail: [email protected] A. Mathur Indian Statistical Institute, 8th Mile, Mysore Road, Bengaluru, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_2

17

18

R. Aedula et al.

1 Introduction Scientometrics evaluates the impact of the results of scientific research by placing focus on the work’s quantitative and measurable aspects. Statistical mathematical models are employed in this study and evaluation of journals and conference proceedings to asses their quality. The implosion of journals and conference proceedings in the science and technology domain coupled with the insistence of different rating agencies and academic institutions to use journal metrics for evaluation of scholarly contribution presents a Big Data accumulation and analysis problem. This high volume of data requires an efficient metric system for fair rating of the journals. However, certain highly known and widely used metrics such as the impact factor and the H factor have been misused lately through practices like non-contextual selfcitation, forced citation, copious citation, etc. [1] Thus, the way this volume of data is modeled needs improvement because it influences the evaluation and processing of this data to draw useful conclusions. One effective way to deal with this problem is to characterize a journal by a single metric or a reduced set of metrics that hold more significance. The volumes of data scraped from various sources are organized as a rectangular m × n matrix where m is the rows representing the number of articles in a journal and n columns of various scientometric parameters. An effective dimensionality and rank reduction technique such as the singular value decomposition (SVD) applied on the original data matrix not only helps to obtain a single ranking metric (based on the different evaluation parameters enlisted as various columns) but also identifies pattern used for efficient analysis of the Big Data. Apache Mahout, Hadoop, Spark, R, Python, and Ruby are some tools that can be used to implement SVD and other similar dimensionality reduction techniques [2]. One notable characteristic of the scientometric data matrix is its sparsity. The matrix is almost always rectangular, and most metric fields (columns) do not apply to many of the articles (rows). For instance, a lot of journals may not have patent citations. Similarly, a number of other parameters might not apply to a journal as a whole. Usually, n and m differ from each other by a good integer difference. Thus, by virtue of this sparsity, the efficiency of the SVD algorithms can be enhanced when coupled with norms like l1 -norm, l2 -norm, or the group norms. In general, both sparsity and structural sparsity regularization methods utilize the assumption that the output Y can be described by a reduced number of input variables in the input space X that best describe the output. In addition to this, structured sparsity regularization methods can be extended to allow optimal selection over groups of input variables in X.

2 The Depths of Dimensionality Reduction Dimensionality reduction has played a significant role in helping us ascertain results of the analysis for voluminous dataset [3]. The propensity to employ such methods comes from the phenomenal growth of data and the velocity at which it is generated.

L1 Norm SVD-Based Ranking Scheme: A Novel Method …

19

Dimensional reduction, such as singular value decomposition and principle component analysis, solves such Big Data problems by means of extracting more prominent features and obtaining a better representation of the data. This data tends to be much smaller to store and much easier to handle to perform further analysis. These dimensionality reduction methods are very often found in most of the tools which handle large datasets and perform rigorous data analysis. Such tools include Apache Mahout, Hadoop, Spark, R, Python, etc. The ease of employing such methods is directly dependent on the performance of such tools to be able to compute and assess the results quickly and store it efficiently, all this while managing resources available at an optimal rate. The divergence in the methods used in these tools to compute such algorithms gives us scope to study and evaluate such case scenarios and help us choose the right kind of tools to perform these tasks.

2.1 PCA Principal Component Analysis is a technique mostly used in statistics to transform a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called as principal components. These principal components are the representation of the underlying structure in the data or the directions in which the variance is more and where the data is more concentrated. The procedure lays emphasis on variation and identification of strong patterns in the dataset. PCA extracts a low-dimensional set of features from a higher-dimensional dataset, simultaneously serving the objective of capturing as much useful information as possible. PCA is most commonly implemented in two ways: • Eigenvalue Decomposition of a data covariance (or correlation) matrix into canonical form of eigenvalues and eigenvectors. However, only square/ diagonalizable matrices can be factorized this way, and hence, it also takes the name matrix diagonalization. • Singular Value Decomposition of the initial higher-dimensional matrix. This approach is relatively more suitable for the problem being discussed since it exists for all matrices: singular, non-singular, dense, sparse, square, or rectangular.

2.2 Singular Value Decomposition Singular value decomposition is the factorization of a real or complex matrix. Large scale of scientometric data is mined using suitable web scraping techniques and is modeled as a matrix in which the rows represent the articles in a journal published over the years, and the columns represent various scientometrics or indicators proposed by experts of evaluation agencies [4]. The original data matrix, say A of dimension m × n and rank k is factorized into three unique matrices U, V, and WH .

20

R. Aedula et al.

• U—Matrix of left singular vectors of dimension mxr • V—Diagonal matrix of dimension r × r containing singular values in decreasing order along the diagonal • WH —Matrix of right singular vectors of dimension n × r . The Hermitian, or the conjugate transpose of W is taken, changing its dimension to r × n, and hence, the original dimension of the matrix is maintained after the matrix multiplication. In this case of scientometrics, since the data is represented as a real matrix, Hermitian transpose is simply the transpose of W. r is a very small number numerically representing the approximate rank of the matrix or the number of “concepts” in the data matrix A. Concepts refer to latent dimensions or latent factors showing the association between the singular values and individual components [4]. The choice of r plays a vital role in deciding the accuracy and computation time of the decomposition. If r is equal to k, then the SVD is said to be a full rank decomposition of A. Truncated SVD or reduced rank approximation of A is obtained by setting all but the first r largest singular values equal to zero and using the first r columns of U and W [5]. Therefore, choosing a higher value of r closer to k would give a more accurate approximation, whereas a lower value would save a lot of computation time and increase efficiency.

2.3 Regularization Norms In the case of Big Data, parsimony is central to variable and feature selection, which makes the data model more intelligible and less expensive in terms of processing. l p -norm of a matrix or vector x, represented as ||x p || is defined as, ||x p || = p i |x|i p , i.e., the pth root of summation of all the elements raised to the power p. Hence, by definition, l1 norm = ||x||1 = i |x|i . Sparse approximation, inducing structural sparsity as well as regularization, is achieved by a number of norms, the most common ones being l1 norm, and the mixed group l1 –lq norm. The relative structure and position of the variable in the input vector, and hence the interrelationship between the variables is inconsequential as a variable is chosen individually in l1 regularization. Prior knowledge aids in improving the efficacy of estimation through these techniques. The l1 norm concurs to only the cardinality constraint and is unaware to any other information available about the patterns of nonzero coefficients [6].

L1 Norm SVD-Based Ranking Scheme: A Novel Method …

21

2.4 Sparsity Via the l 1 Norm Most variable or feature selection problems are presented as combinatorial optimization problems. Such problems focus on selecting the optimal solution through a discrete, finite set of feasible solutions. Additionally, l1 norm turns these problems to convex problems after dropping certain constraints from the overall optimization problem. This is known as convex relaxation. Convex problems classify as the class of problems in which the constraints are convex functions and the objective function is convex if minimizing, or concave if maximizing. l1 regularization for sparsity through supervised learning involves predicting a vector y from a set of usually reduced values/observations consisting a vector in the original data matrix x. This mapping function is often known as the hypothesis h : x→y. To achieve this, we assume there exists a joint probability distribution P(x,y) over x and y which helps us model anomalies like noise in the predictions. In addition to this, another function known as a loss function L(y , y) is required to measure the difference in the prediction y = h(x) from the true result y. Consider the resulting vectors consisting of the predicted value and the true value to be y and y, respectively. A characteristic called Risk, R(h) associated with loss function, and hence in turn with the hypothesis-h(x) is defined as the expectation of the loss function. R(h) = E[L(y , y)] = L(y , y) dP(x, y) Thus, the hypothesis chosen for mapping should be such that the risk, R(h) is minimum. This refers to as risk minimization. However, in usual cases, the joint probability distribution of the problem in hand, P(x, y) is not known. So, an approximation called empirical risk is computed by taking the average of the loss function of all the observations. Empirical risk is given by: n 1 L(yi , yi ) Remp (h) = n i=1

The empirical risk minimization principle states that the hypothesis (h ) selected must be such it that reduces the empirical risk Remp (h): h = min Remp (h) h

While mapping observations x in n dimensional vector x to outputs y in vector y, we consider p pairs of data points—(xi , yi ) ∈ Rn × y where i = 1, 2 …p. Thus, the optimization problem for the data matrix in scientometrics takes the form: p 1 L(yi , wT xi ) + λ(w) minn w∈ R p i=1

22

R. Aedula et al.

L is a loss function which can either be square loss for least squares regression, L(y , y) = 21 (y − y)2 , or a logistic loss function. Now, the problem thus takes the form: minn ||y − Aw||2 w∈R

Since the variables in the vector space/groups can overlap, it is ideal to choose (w) to be a group norm for better predictive performance and structure. The m rows of data matrix A are treated as vectors or groups (g) of these variables, forming a partition equal to the vector dimension, [1:n]. If G is the set of all these groups and d g is a scalar weight indexed by each group g, the norm is said be a l1 -l − q norm where q ∈ [2, ∞) [6]. dg ||wg ||q (w) = g∈G

The choice of the indexed weight d g is critical because it is responsible for the discrepancies of sizes between the groups. It must also compensate for the possible penalization of parameters which can increase due to high-dimensional scaling. The factors that affect the selection are the choice of q in the group norm and the consistency that is expected of the result. In addition to this, accuracy and efficiency can be enhanced by weighing each coefficient in a group rather than weighing the entire group as a whole. The initial sparse data matrix is first manipulated using the l1 -norm [6].

3 Methodology An estimate of a journal’s scholastic indices is necessary to judge its effective impact. The nuances of scientometric factors such as total citation count and self-citation count come into play when deciding the impact of a journal. However, these factors unless considered in ideal circumstances do not by themselves become a good indicator to represent the importance of a journal. Many anomalies arise when considering these indices directly which may misrepresent or falsify a journal’s true influence. The necessity to use these indices in context with a ranking algorithm is imperative to better utilize these indices. The resulting transformation of l1 -norms gives rise to a row matrix which is of the length equal to the number of features of the pristine scientometric data. This row matrix effectively represents the entire dataset at any given iteration. The application of the singular value decomposition operation on this row matrix is key in determining the necessary norm values to remove through a recursive approach. The singval array contains the normalized singular values of all the individual l1 norm transformed columns. These values act as scores while addressing the impact of any given journal. In the context of singular values, the one with the lowest singval score is the most influential journal. Utilizing these scores, we can formulate a list of

L1 Norm SVD-Based Ranking Scheme: A Novel Method …

23

Algorithm 1 Recursive l1 -norm SVD 1: A ← Input Transposed Feature Matrix A 2: procedure Lasso 3: row_matrix ← Coefficents of Lasso Regression 4: r etur n row_matrix 5: procedure SVD 6: U,,V ← Matrices of SVD 7: r etur n 8: procedure Normalize 9: Norm_Data ← Normalized using l1 -norms 10: r etur n Norm_Data 11: procedure Recursive 12: L1_row ← L ASS O(A) 13: singval [] ← SV D(L1_r ow) 14: Row_Norm ← N or mali ze(L1_r ow) 15: Col_Norm ← N or mali ze(All columns of A) 16: Col_i ← Closest Col_Norm Value to Row_Norm 17: Delete Col_i from A 18: goto RECURSIVE

journals which give preference to subtle factors such as high or low citation counts and give an appropriate ranking. Identifying the influential journals from a column norm and contrasting it with the singular values are the equivalent of recursively eliminating the low impact journal by comparing its singular value to its Frobenius norm. This allows the algorithm to repeatedly eliminate the journals and find the score simultaneously to give a more judicious ranking system. Our method is different from the SCOPUS journal rank (SJR) algorithm. The SJR indicator computation uses an iterative algorithm that distributes prestige values among the journals until a steadystate solution is reached. The method is similar to eigen factor score [7] where the score is influenced by the size of the journal so that the score doubles when the journal doubles in size. Our method, on the contrary, adopts a recursive approach and does not assume initial prestige values. Therefore, the eigen factor approach may not be suitable for evaluating the short-term influence of peer-reviewed journals. In contrast, our method works well under such restrictions.

4 The Big Data Landscape The appeal of modern-day computing is its flexibility to handle volumes of data through an aspect of coordination and integration. Advancements in Big Data frameworks and technologies have allowed us to break the barriers of memory constraints for computing and implement a more scalable approach to employ methods and algorithms [2]. The aforementioned journal ranking scheme is one such algorithm which thrives under the improvements made to scalability in Big Data. With optimized additions such as Apache Spark to the distributed computing family, the enactment

24

R. Aedula et al.

of l1 regularization and singular value decomposition has reached an all new height. Implementing the SVD algorithm with the help of Spark can not only improve spatial efficiency but temporal as well. The l1 -norm SVD scheme utilizes the SVD and regularization implementation of ARPACK and LAPACK libraries along with a cluster setup to enhance the speed of execution by a magnitude of at least three times depending on the configuration. Collecting data is also a very important aspect of Big Data topography. The necessity of a cluster-based system is rendered useless without the requisite data to substantiate it. Scientometric data usually deals with properties of the journals such as total citation and self-citation. This data could be collected using Web scraping methodologies but also can be found by most journal ranking organizations, available for open-source use; SCOPUS and SCIMAGO. For the l1 -norm SVD scheme, we used SCOPUS as it had an eclectic set of features which were deemed appropriate to showcase the effectiveness of the algorithm. The inclusion of the two important factors such as CiteScore and SJR indicators gave a better enhancement over just considering one over the other. For more information about the data and code used to develop this algorithm, please refer to [8], Github repository of the project.

4.1 Case Study: Astronomy and Computing SCOPUS and SCIMAGO hold some of the best journal ranking systems to this day, using their CiteScore and SJR indicators, respectively, to rank journals. However, due to the manner in which both these indicators are considered, it is often the case that the ranking might not display the true potential of a specified scientific journal. To demonstrate this, we considered the case of the journal Astronomy and Computing within the context of SCOPUS journals in the relevant domain of astronomy and astrophysics. The primary focus of this case study is to determine where the journal Astronomy and Computing stands with respect to other journals which were established prior to it. The algorithm also tests the validity of the ranking and suggests an alternative rank which used a more holistic approach toward the features (Table 1). Using the publicly available SCOPUS dataset, we implemented the aforementioned l1 -norm SVD scheme to rank all its corresponding journals and simultaneously determine the potency of the algorithm. SCOPUS contains approximately around 46 k journals listed in different domains. Discarding few redundancies, SCOPUS effectively covers a large range of metrics and provides adequate resources for verification. For this demonstration, we have considered SCOPUS’s seven different metrics to be used as features in our algorithm. These features include Citation Count, Scholarly Output, SNIP, SJR, CiteScore, Percentile and Percent Cited. To cross verify the results of the algorithm, they were compared to SJR-based ranking of SCIMAGO to articulate the discrepancies. The l1 -norm SVD scheme worked brilliantly in rating the journals and approached the data in a more wholesome sense. The result was a ranking system which ranked Astronomy and Computing

L1 Norm SVD-Based Ranking Scheme: A Novel Method …

25

Table 1 Case study: Astronomy and computing, SJR, and L1-SVD ranks Journal Name L1 scheme SJR-based rank rank Astronomy and Computing Astronomy and Astrophysics Review Radiophysics and Quantum Electronics Solar System Research Living Reviews in Solar Physics Astrophysical Bulletin Journal of Astrophysics and Astronomy Revista Mexicana de Astronomia y Astrofisica Acta Astronomica Journal of the Korean Astronomical Society Cosmic Research Geophysical and Astrophysical Fluid Dynamics New Astronomy Reviews Kinematics and Physics of Celestial Bodies Astronomy and Geophysics Chinese Astronomy and Astrophysics

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

31 5 51 48 3 45 55 23 20 32 58 46 12 65 67 72

Year 2013 1999 1969 1999 2005 2010 1999 1999 1999 2009 1968 1999 1999 2009 1996 1981

much higher than most older journals and also at the same time highlighting the niche prominence of the particular journal. Similarly, this method also highlighted the rise of other journals which were underrepresented due to the usage of the aforementioned SCOPUS and SCIMAGO indicators. This method was largely successful in rectifying the rank of such journals. This l1 -norm SVD scheme can be extrapolated to other data entries as well. It can also be used to study the impact of individual articles. Utilizing similar features such as total citation, self-citation, and NLIQ. The algorithm can be used to rank articles within a journal with great accuracy along with a holistic consideration.

4.2 Contrasting Performances of l 1 and l 2 Norms Being recursive in nature, the norm-based algorithms are subjected to some lapse while parallelizing its execution. However, they can be improved by using the right kind of suitable norm to enhance its running time. The decision of using l1 -norm over the l2 -norm was made because of a pragmatic choice for the following recursive scheme. The facet of the l1 -norm to use a loss function over the l2 -norm’s squared data approach proves to be significantly better in structuring the data for a high-

26

R. Aedula et al.

Table 2 Performance time for a row matrix of size 46 k Norm Time per row (s) l1 norm l2 norm

0.172 0.188

Table 3 Performance time for SVD of size 100 k × 100 k Data framework Overall time Python R L1 SVD

2 h+ 58 min 15 min

density computation. This type of method allows the overall dataset to reduce to a row matrix the size of the smallest dimension of the original data. This gives the added benefit of having a very consistent execution time and scale accordingly with the increase in data size. The execution time mentioned in Table 2 of this article gives the time-based performance of the different norms. This will only get significant with the increase in the size of the rows. This dereliction in parallelization can be compensated by the expected speed increase in the execution of the l1 -norm and SVD routines in a cluster setup. Optimized settings like Apache Spark which uses the aforementioned LAPACK and ARPACK libraries are able to boost the speed even further. The biggest benefit of opting such Big Data settings is that by increasing the size of the cluster, the overall speed of the algorithm also scales appropriately. Table 3 indicates the performance time for the SVD algorithms in different ecosystems. The usage of SVD function in the algorithm to determine the individual singular values of the reduced row matrices of the columns can also be enhanced by using the corresponding eigen value optimization which are usually provided within the Big Data environment. Algorithms such as Lanczo’s algorithm can not only enhance the speed of the operation but also can be very easily parallelized. Hence, this combination of l1 -norm and SVD can effectively make the best version of algorithm, being fast in execution at the same time delivering a holistic approach.

5 Knowledge Discovery from Big Data Computing: The Evolution of ASCOM Even though Astronomy and Computing (ASCOM) has been in publication for five years only, its reputation has grown quickly as seen from the ranking system proposed here. This is despite the fact that ASCOM is severely handicapped in size. ASCOM is ranked 39 according to our method, slightly lower than its 31 rank in SCOPUS. This

L1 Norm SVD-Based Ranking Scheme: A Novel Method …

27

Fig. 1 l1 Rank progression of ASCOM based on SCOPUS data computed by the proposed method. The steady ascendancy in the journal’s rank is unmistakable. It will be interesting to investigate the behavior of the journal rank in the long run once enough data is gathered

is due to the fact that we have not used “citations from more prestigious journals” as a feature. Nonetheless, it is ranked higher than many of its peers which have been in publication over 20 years. This is also due to the fact that ASCOM is “one of its kind” and uniquely positioned in the scientific space shepherded by top notch editors. Such qualitative feature, regrettably is not visible from the Big Data landscape. There is another interesting observation to take note of. By ignoring the “size does matter” paradigm, the ranks of some journals (many years in publication with proportionate volumes and issues) suffered. A few examples include Living Reviews in Solar Physics, ranked 43 according to our scheme while it is ranked 3 in SCOPUS and Astronomy and Astrophysics Review, ranked 40 in our scheme while it is ranked 5 according to SCOPUS. This is important as our goal was to investigate the standing of a journal relatively new and in a niche area. This indicates that years in publication may sometimes dominate over other quality indicators and may not capture the growth of journals in “short time windows”. Our study also reveals that ASCOM is indeed a quality journal as far as early promise is concerned (Fig. 1).

6 Conclusion The Big Data abode adds a new dimension to the already existing domain of machine learning; where the computation aspect is as important as the algorithmic and operational facet. The l1 -norm SVD scheme does just that, and it introduces a brand new

28

R. Aedula et al.

way of ranking data by considering all the features to its entirety. The added benefit of optimizing the required norms and methodologies in terms of a Big Data domain suggests its vast flexibility in the area of Big Data mining. This article covered its application in the scientometric domain. However, it can be extended to any type of data, provided that the nuances are well understood. The aforementioned recursive methodology of the scheme allows us to carefully consider the important feature of the dataset and make prudent decisions based on the outcome of an iteration. This allows us to take a more wholesome approach which is very similar to the page rank algorithm which gives a specific importance to each one of the features under computation. In the context of scientometrics, this scheme is also applicable as a way to rank specific articles in a given journal with the result that their respective scholastic indices are available. We can conduct similar data experiments using indicators like total citations, self-citations, etc., to categorize them of their various features available for articles. We have also done some extensive studies based on the scholastic indices of the ACM journal whose case study lies outside the scope of this article and was able to successfully rank the corresponding journals and article. The scheme proved to be successful in evaluating the parameters with their nuances intact. More often than not, most scientometric indicators do not apply to the journal being evaluated. As a consequence of this, the data matrix in which the rows represent the articles in the journal and the columns represent the different evaluation metrics is clearly sparse. Exploiting this sparsity, using certain structural sparsity inducing norms and applying recursive singular value decomposition to eliminate metrics can make the process more efficient. Sparse approximation is ideal in such cases because although the data is represented as a matrix in a high-dimensional space, it can actually be obtained in some lower-dimensional subspace due to it being sparse. With the ever-expanding necessity to process voluminous amounts of data, there needs to be a need to provide solutions which can adapt to the fluctuating technological climate. The l1 -norm SVD scheme tries to achieve similar potency, and the usage of norm-based dimensionality reduction enhances the overall efficiency on how we interpret data. The usage of techniques like sparsity norms suppresses outliers and only highlights the most meaningful data in store. The evolution of such methods will prove to be an absolute prerequisite in the future to compute copious amounts of data. Moving forward, dimensionality reduction-based techniques will become the foundation of salient data identification, and the l1 -norm SVD scheme is such a step along that direction. Acknowledgements We would like to thank the Science and Engineering Research Board (SERB)Department of Science and Technology (DST), Government of of India, for supporting our research. The project reference number is: SERB-EMR/2016/005687.

L1 Norm SVD-Based Ranking Scheme: A Novel Method …

29

References 1. G. Ginde, S. Saha, A. Mathur, S. Venkatagiri, S. Vadakkepat, A. Narasimhamurthy, B.S. Daya Sagar, ScientoBASE: a framework and model for computing scholastic indicators of non-local influence of journals via native data acquisition algorithms. J. Scientomet. 107(1), 1–51 (2016) 2. G. Ginde, R. Aedula, S. Saha, A. Mathur, S.R. Dey, G.S. Sampatrao, B. Sagar, Big data acquisition, preparation and analysis using apache software foundation projects, in Big Data Analytics, ed. by A. Somani, G. Deka (Chapman and Hall/CRC, New York, 2017) 3. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (John Hopkins University, Baltimore, MD, 2012) 4. D. Kalman, A singularly valuable decomposition: the SVD of a matrix. College Math. J. 27, 2–23 (1996) 5. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A.H. Byers, Big Data: The Next Frontier for Innovation, Competition, and Productivity (McKinsey Global Institute, 2011) 6. F. Bach, R. Jenatton, J. Mairal, G. Obozinski, Structured sparsity through convex optimization. Stat. Sci. 27, 450–468 (2011) 7. S. Ramin, A.S. Shirazi, Comparison between impact factor, SCImago journal rank indicator and eigenfactor score of nuclear medicine journals. Nucl. Med. Rev. (2012) 8. R. Aedula, rahul-aedula95/L1_Norm. https://github.com/rahul-aedula95/L1_Norm 9. K. Bora, S. Saha, S. Agrawal, M. Safonova, S. Routh, A. Narasimhamurthy, CD-HPF: new habitability score via data analytic modeling. Astronomy and Computing 17, 129–143 (2016)

Human Annotation and Emotion Recognition for Counseling System with Cloud Environment Using Deep Learning K. Arun Kumar, Mani Koushik, and Thangavel Senthil Kumar

Abstract As we progress in making computers understand human, we as human beings have made a great advance in human–computer interaction. As a part of this research work, we have leveraged the existing deep learning architecture to develop an artificially intelligent system that helps understand the facial emotion of a person undergoing a feedback session with carefully picked questions that help in identifying the deviation of the observed facial emotion from the expected facial emotion. This knowledge is used to filter out people who are in real need of counseling. In this project, we have made a comparison study on CNNs with differing depths[22] to find out the best suitable architecture to recognize human facial emotion that helps with the counseling system. The workflow starts with the detection of a face, proceeds with face recognition and annotation using FaceNet architecture, and moves on to recognizing the facial emotion of the person using a trained convolutional neural network model. The models are trained on a FER2013 for emotion recognition and trained on manually collected images for face recognition (FaceNet). The trained models are hosted on the server side and are connected to the client workstation through an intermediate proxy server to share connectivity over the LAN. Keywords Face recognition · Emotion recognition · CNN · Counseling system

K. Arun Kumar · M. Koushik · T. Senthil Kumar (B) Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India e-mail: [email protected] K. Arun Kumar e-mail: [email protected] M. Koushik e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_3

31

32

K. Arun Kumar et al.

1 Introduction Humans have always had an innate ability to recognize and distinguish between faces. Now, computers are able to do the same. This opens up tons of applications. Facial expressions give us lots of useful information even when humans try to hide their emotions. Being able to interpret non-verbal communication has been the greatest task for professional counselors. With the ever-increasing demand for professional counseling, counselors find it challenging to prioritize the people’s need for counseling. Counselors oftentimes will not be able to predict the client’s problem until they interact with the client directly. These limitations can be resolved by an artificially intelligent system that can filter out the people who are in real need of counseling through a well-curated questionnaire that clients can answer. Clients’ reaction to each question is noted, which helps in identifying the deviation of the observed emotion from expected facial expression. This way a professional counselor is able to assess the current mental situation of their client and will also help them identify the key area to focus. In this research work, we present an approach based on convolutional neural network (CNN) for both face recognition and facial emotion recognition [23]. The Web application hosts questions for the user to respond, and the user’s face is captured and sent to the server hosting the model that takes in the face as input, recognizes the face based on the dataset it is trained on, and annotates the person and then predicts the facial emotion of the user. The input data is preprocessed using some of the feature engineering techniques to reduce noise and make illumination correction. A manually collected dataset for face recognition is converted to a grayscale image to reduce the complexity involved in the prediction process. The model classifies the input images to the following seven classes: “Angry,” “Disgust,” “Fear,” “Happy,” “Sad,” “Surprise,” “Neutral.” The rest of the paper is structured as follows. Section 2 of this research paper describes the related work that other researchers in this industry have contributed. Section 3 introduces various algorithms that have been used generally by others. Section 4 describes the dataset that we have used in order to train the model efficiently. Section 5 explains the model that we have developed using FaceNet architecture for face recognition and annotation and how the CNN architecture works to make an effective model that predicts the facial emotion from the input image. Section 6 then infers the results that we have achieved after implementation. Section 7 consist of conclusion. Finally, we present the sources that we have referenced for our research.

2 Related Work In recent years, researchers have made considerable progress in developing face detection and annotation of the recognized face and automatic expression classifier.

Human Annotation and Emotion Recognition for Counseling System …

33

2.1 Face Recognition and Annotation Schroff et al. [1]: This paper presents a method which measures the facial similarity between different pictures using compact Euclidean space. The distances calculated using the mapping of facial images correlate to the similarity of the images. These similarities help in facial recognition, verification, and clustering using feature vectors generated by standard FaceNet embeddings. FaceNet directly optimizes the embeddings using deep convolutional neural networks instead of optimizing the intermediate bottleneck layer. The approach uses a triplet mining method to train more or less aligned matching or non-matching face patches. This method uses only 128 bytes per face and achieves state-of-the-art face recognition performance. Taigman et al. [2]: There are four stages while performing face recognition. This conventional model goes through the detect phase, then to the align phase, then to the represent phase, and then finally the classify phase. The approach revisits the conventional pipeline of face recognition by obtaining face representation from a nine-layer neural network by explicitly employing a 3D model of face for classification. This approach uses non-conventional convolution layers by involving more than 120 million different parameters. The representations that are learned by modelbased alignment work remarkably well to challenging facial dataset even with a simple classifier. Sun et al. [3]: In this paper, the author posits that the face identification and verification can be used as signals to solve the differences of intrapersonal variations and interpersonal variation. Using deep convolutional networks, the DeepID2 (Deep IDentification-verification) features are learned. By increasing the interpersonal variations and by reducing the intrapersonal variation using the DeepID2 features, new unseen identities in the training dataset can be generalized. This will be essential when performing face recognition on untrained datasets.

2.2 Emotion Recognition Correa et al. [4]: This research work presents a system that is capable of emotion recognition through facial expressions. Three neural network architectures were subjected to various classification tasks, and the best performing network was further optimized. The architectures were validated based on their accuracy in percentage. Pramerdorfer [5]: This research work reviews image-based facial expression recognition using CNNs and highlights algorithmic differences and their performance impact. By identifying and overcoming one of the bottlenecks, the model was able to obtain a test accuracy of 75.2% on FER2013 dataset by forming an ensemble of modern deep CNNs Alizadeh and Fazel [6]: This research work classifies facial emotion into seven different categories. By training the existing CNN models with different depths and combining raw pixel data with HOG features, a novel CNN model was trained.

34

K. Arun Kumar et al.

Arriaga et al. [7]: This research work implements a general convolutional neural network (CNN) to classify the emotion and gender of a person from facial details. A real-time enabled guided back-propagation visualization technique is used in order to uncover the dynamics of the weight changes and evaluate the learned features. The visualization of previously hidden features helps bridge the gap between slow performance and real-time architectures.

2.3 Counseling Yang [8]: This research work analyzes the effect of psychotherapy on college students. A comparison study between different tools based on before the treatment and after the treatment is recorded. The method used in identifying the mental health of college students is imagery communication. Shinozaki et al. [9]: The work discusses a counseling agent which could replace an actual counselor. The proposed approach is to create a domain-specific agent that can be used by people to connect with the agent on a more empathetic level. This agent will be useful to people without even requiring in-depth knowledge. Kim et al. [10]: This paper proposes an intelligent approach to replace a conventional attendance system and identify the emotion of a person in the class and to increase the interactions among students by providing the necessary counseling. Wang et al. [11]: The proposed approach discusses the development of peer counseling through a Web application where the user will be assigned a peer helper who will help in resolving the issues rather than conventional face-to-face counseling. Yun and Yuan [12]: This research work talks about a Web counseling system that can be used by people through multiple channels. The two different channels that are being covered by the solutions are visitor entry and the other one is personal entry. The visitor entry in the system can address multiple issues ranging from “Mental Health Knowledge,” “Developmental Knowledge,” “Case Introduction,” etc. The latter covers the “Special Counseling Platform,” “In-time Counseling.” Hendra and Kusumadewi [13]: This research addresses a counseling system for college students using case-based reasoning. The proposed approach begins by analyzing the case-based reasoning. Then in further steps, the flow continues with data retrieval, data analysis, and then the case-based reasoning model. This approach can be used to identify the problems that the students are facing by changing different parameters and updating the knowledge base for further detection. Horii et al. [14]: A comparison study between the performance of the virtual counseling agent (VCA) the authors have proposed and the already existing ELIZA and CRECA. The observations with the results show that the proposed approach is much better than ELIZA and is more generalized in the responses rather than context narrowing. This VCA has many features from using the API of Google for cloud audio to “independent counseling content fields.”

Human Annotation and Emotion Recognition for Counseling System …

35

3 Algorithms Used for Face Recognition and Facial Emotion Recognition 3.1 Algorithms Used for Face Recognition 3.1.1

DeepFace

DeepFace [15] algorithm uses a convolutional neural network with nine layers. The face processing is done using a 3D alignment of the face. The accuracy of this model on Labeled Faces in the Wild (LFW) was 97.35%. This was the first approach to use only nine layers and achieve great accuracy.

3.1.2

FaceNet

FaceNet [1] algorithm uses a triplet mining method to train more or less aligned matching or non-matching face patches. This method uses only 128 bytes per face and achieves state-of-the-art face recognition performance. It directly optimizes the embeddings using deep convolutional neural networks instead of optimizing the intermediate bottleneck layer. The accuracy of this model on a private dataset was 99.63%.

3.1.3

VGGFace

VGGFace [16] algorithm uses triplet loss functions like FaceNet and was trained on multiple large-scale datasets that were collected from the Internet. The model produced an accuracy of 98.95% on the different datasets that were tested.

3.1.4

SphereFace

SphereFace [17] algorithm learns discriminative features of the face with angular margin using the angular softmax loss (A-Softmax), and the proposed architecture has 64-layer ResNet which was able to produce an accuracy of 99.42% in multiple datasets that are available online.

36

K. Arun Kumar et al.

3.2 Algorithm Used for Facial Emotion Recognition 3.2.1

Fully Convolutional Neural Network

The FCNN uses a rectified linear unit [18] as an activation function on top of the nine convolutional layers. The performance and stability of the neural network are improved using batch normalization [19] by adjusting and scaling the activations on the input layer. The total number of parameters in this model is approximately 500,000, and this approach achieved an accuracy of 66% on the FER2013 dataset. The dataset contains 35,887 grayscale images where each image belongs to one of the following classes {“angry,” “disgust,” “fear,” “happy,” “sad,” “surprise,” “neutral”}.

3.2.2

Xception

This approach is a combination of residual modules [20] and depth-wise separable convolutions [21]. Residual modules are used to transfer the memory components from the initial layers to the final layers. It is also used to modify the mapping between different layers to ensure the features are learned. Depth-wise separable convolutions ensure that the model is deployed without much loss in the accuracy and also works well with the model with fewer parameters.

4 Dataset Description 4.1 Labeled Faces in the Wild Almost all the different algorithms developed for a facial recognition system test their algorithms using the Labeled Faces in the Wild because of its robust collection of multiple faces in challenging environments like lighting conditions and occlusion. This dataset is used for one-to-one mapping of faces for recognition. Multiple faces were collected from different sources to create this dataset which contains more than 13,000 images so far. Each and every face in the dataset is labeled with the person’s name in the image. More than 1600 people in the images have multiple distinct images in the dataset. Different sets of the same images are included using different types of aligned images. The images also include funneled images that give remarkable accuracy when compared to normal images without the funneling using most face verification algorithms. Additional 300 images of manual data collection were done to validate the accuracy of the model in real time. Dataset used for training the facial emotion recognition model is the FER2013 dataset. It consists of a total of 28,709 grayscale images out of which 3850 of them are chosen for validation process and 3850 are chosen for testing process, each of

Human Annotation and Emotion Recognition for Counseling System …

37

which belongs to one of the following classes: “Angry,” “Disgust,” “Fear,” “Happy,” “Sad,” “Surprise,” “Neutral.” Sample questions from questions corpus and their associated emotion. Sample question

Expected emotion

Describe the most irritating person you’ve known in your life

Angry

Have you ever touched someone after picking your nose without washing your hands?

Disgust

What do you fear the most?

Fear

What is the best moment you ever shared with your best friend?

Happy

What worries you most about the future?

Sad

If you had ten seconds for a wish, what would you ask?

Surprise

Did you have breakfast this morning?

Neutral

5 Proposed Methodology The proposed methodology extracts feature from the face of an image or video and converts it into vectors of 128 dimensions. This type of converting an image into vectors is called a face embedding. The model used for training is a CNN that makes use of triplet loss function. This approach makes use of calculating the distance between the vectors to identify the similarity between the images or video. The lower the distance value, the more the similarity, and the higher the distance value, the less the similarity. The distances are calculated using the Euclidean method, and it is then used for validation of the photographs. Once the face embeddings are done, we need a classification algorithm to train these embeddings to classify the images. The best-suited algorithm for classification will be to use a support vector machine. This machine learning algorithm classifies the face embeddings into different classifications. Once the face is identified, the next step is to identify the emotion from the facial expression. For this, we will be using convolutional neural networks. A typical convolutional neural network has multiple layers including the convolutional layers, pooling layers, fully connected layers, and hidden layers. The model produces several activation features based on the different sets of learnable filters in the image. There are numerous advantages to the convolution operation. The correlation among the adjoining pixel of an image is learned using the local connectivity. The total number of parameters to be learned can be reduced by using a weight sharing mechanism in the feature map. The next layer in the pipeline is the pooling layer. This layer is used to reduce the size of the feature map. It also reduces the computational cost of the network. The commonly used down-sampling strategies are average pooling and max pooling, which are nonlinear. The final layer in the pipeline is the fully connected layer. This layer is used to make sure that all the previous layers are connected entirely and the

38

K. Arun Kumar et al.

final output is converted to one-dimensional feature maps for feature representation and classification. Architecture Diagram

Trained Model Details Batch size

32

No. of epochs

10,000

No. of classes

7

Learning rate

0.01

Iterations

900 (continued)

Human Annotation and Emotion Recognition for Counseling System …

39

(continued) Optimizer

ADAM

Patience

50

Momentum

0.9

Activation function

Rectified linear units (ReLU)

CNN model

MiniXception

Layers

Nine convolutional layers (two pooling layers)

6 Results and Discussion Observed accuracy versus epoch, and epoch versus loss graphs. As we observe, the accuracy improvements become almost constant as the number of epochs increases. And we also observe that the loss decreases and becomes almost constant as the number of epochs increases. These observations help us identify the optimal number of epochs required for training the model.

The following is the observed normalized confusion matrix for emotion recognition using the proposed CNN architecture.

40

K. Arun Kumar et al.

Sample Output Results Face recognition Person one: Koushik Person two: Arun Kumar

Emotion: happy

(continued)

Human Annotation and Emotion Recognition for Counseling System …

41

(continued) Emotion: neutral

Emotion: surprise

7 Conclusion In this paper, we created a Web application that hosts a set of questions to people who are to take up a counseling session. The application records the person through the camera set up to the workstation and streams it to the server that is hosting the service, where the trained deep learning model identifies the person and annotates the person. The model responsible for the processing of facial emotion recognition service fires when the client starts with their first question. A consolidated analysis of the emotion classification is made based on the deviation of the observed facial emotion from the expected facial emotion.

References 1. F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, in CVPR (2015), pp. 815–823 2. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deep face: closing the gap to human-level performance in face verification, in IEEE Conference on CVPR (2014)

42

K. Arun Kumar et al.

3. Y. Sun, X. Wang, X. Tang, Deep learning face representation by joint identification-verification. CoRR, abs/1406.4773 (2014) 4. E. Correa, A. Jonker, M. Ozo, R. Stolk, Emotion recognition using deep convolutional neural networks. https://github.com/isseu/emotion-recognition-neural-networks/blob/master/ paper/Report_NN.pdf (2016) 5. C. Pramerdorfer, Facial expression recognition using convolutional neural networks: state of the art, arXiv:1612.02903 (2016) 6. S. Alizadeh, A. Fazel, Convolutional neural networks for facial expression recognition, arXiv:1704.06756 (2017) 7. O. Arriaga, P.G. Plöger, M. Valdenegro, Real-time convolutional neural networks for emotion and gender classification, arXiv:1710.07557 (2017) 8. P. Yang, Effect of imagery communication psychotherapy-based group counseling on mental health and personality of college students (2012), pp. 800–803, https://doi.org/10.1109/ITiME. 2012.6291424 9. T. Shinozaki, Y. Yamamoto, K. Takada, S. Tsuruta, Context-based reflection support counseling agent, in Proc. of the SITIS (2012), pp. 619–628 10. 10. Y.-D. Kim, S. Kwon, J.K. Kim, W.Y. Jung, Customized attendance system for students’ sensibility monitoring and counseling, IIAI 4th International Congress on Advanced Applied Informatics, Okayama (2015), pp. 703–704 11. Z. Wang, Y. Chi, H. Chen, R. Xin, Web peer counseling system, International Conference on Educational and Information Technology, Chongqing (2010), pp. V1-535–V1-537 12. Z. Yun, and T. Yuan, The Application of Web in Mental Counseling for College Students, 2010 Second International Conference on MultiMedia and Information Technology, pp. 194–197 13. S. Hendra, S. Kusumadewi, Case-based system model for counseling students, (ICSITech), Yogyakarta (2015), pp. 213–218 14. T. Horii, Y. Sakurai, E. Sakurai, S. Tsuruta, A. Kutics, A. Nakagawa, Performance comparison of client-centered counseling agents, SITIS (2018), pp. 601–608 15. Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: closing the gap to human-level performance in face verification, in CVPR (2014), pp. 1701–1708 16. O.M. Parkhi, A. Vedaldi, A. Zisserman, et al., Deep face recognition, in BMVC, vol. 1 (2015), p. 6 17. W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, L. Song, Sphereface: deep hypersphere embedding for face recognition, in CVPR, vol. 1 (2017) 18. X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (2011), pp. 315– 323 19. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in International Conference on Machine Learning (2015), pp. 448–456 20. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 770–778 21. A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H.A. Mobilenets, Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861 (2017) 22. K.S. Gautam, T, Senthil Kumar, Video analytics-based intelligent surveillance system for smart buildings. Springer Soft Comput., 2813–2837 (2019) 23. K.S. Gautam, T. Senthil Kumar, Video analytics-based facial emotion recognition system for smart buildings. Int. J. Comput. Appl., 1–10 (2019) (Taylor and Francis)

Enhancing Intricate Details of Ultrasound PCOD Scan Images Using Tailored Anisotropic Diffusion Filter (TADF) Suganya Ramamoorthy, Thangavel Senthil Kumar, S. Md. Mansoorroomi, and B. Premnath Abstract Polycystic ovary disorder (PCOD) is a prevalent reproductive and endocrinology chaos found in 6–10% of the female inhabitants. PCOD caused infertility in women if not detected early. PCOD is analyzed by using ultrasound B-scan modality images. One of the challenging tasks in medical image analysis is speckle noise reduction problem. To eradicate unwanted noise in the B-scan representation of image which is low cost and noninvasive for noise removal, the conventional methods namely—Gabor, Laplacian, medium, and curvlet were adopted. The abovementioned methods have drawbacks which has not been able to sense fragile details noises that hold the miniature detail of cysts. To conquer the above-said problem, a new speckle reduction method, i.e., tailored anisotropic diffusion filter (TADF), is proposed. The key purpose of this research work is to enhance intricate details of ultrasound PCOD scan images by using TADF. In the proposed method, a diffusivity method and global gradient threshold is applied in all the four directions to an image to eliminate noise and preserves intricate features concurrently. The result of this approach has been tested on 120 test images out of 400 images in a database. The PSNR and CNR parameters are evaluated and compared to the existing speckle reducing methods. These experimental outcomes specify that the planned TADF can successfully diminish speckles while enhancing image boundaries for preserving intricate types like benign PCOD, chocolate polycystics, and lesions at its early stages. S. Ramamoorthy · B. Premnath Department of IT, Thiagarajar College of Engineering, Madurai 625015, India e-mail: [email protected] B. Premnath e-mail: [email protected] T. Senthil Kumar (B) Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India e-mail: [email protected] S. Md. Mansoorroomi Department of ECE, Thiagarajar College of Engineering, Madurai 625015, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_4

43

44

S. Ramamoorthy et al.

Keywords PCOD · Tailored anisotropic diffusion filter · Speckle noise · Ultrasound images · Global threshold

1 Introduction One of the most prominent industries where engineering research can play a huge role is called medical industry. Much care should be taken in diagnosing medical images. Otherwise, it leads to wrong treatment and dangerous to human life. The misdiagnosis does not related with their research expertise or engineering practice, but it is inherent nature of medical imaging being difficult to analyze. With the drastic development of medical image acquisition and its storage techniques, various modalities of medical image understanding by engineers/researchers have become a smart and challenging task in medical arena. Medical dataset/radiology images are growing abundantly each day. Each modality (CT/MRI/Ultrasound/X-ray) can occupy more spaces for storage. So, basically two challenges are involved in handling medical images. (1) storage and (2) processing. But nowadays, storage is not a much worried problem. But processing medical images involved five phases, namely image preprocessing/noise removal, image registration/overlapping, image segmentation, image classification, and image retrieval. Out of five phases, noise removal is major task in any kind of medical image modalities Speckle/noise is a vital crisis in B-scan images which are defined as multiplicative signals. The signal obtained from acquisition devices and noises statistically control each other. It enhances binary gray level of similar/identical portion. Unwanted noise called speckles is a rough model produced because of positive and negative interfering of spread signals reflected from human body portion which is in homogeneity during image acquisition [1]. Since, the unwanted noise further reduced picture contrast which creates vagueness in medical details. So, dealing unwanted signal noise in B-scan images is a decisive and challenging job in medical imaging arena. Polycystic ovary disorder is a prevalent reproductive and endocrinology mess occurred in 6–10% of the female inhabitants. The size and count of follicles in each ovary plays a major role in every woman. But, the occurrence of 11 or more number of tiny (immature) follicles present in every ovary about 2.3-9.6 mm in thickness and occupied more space in ovary is called PCOD. There are different types of polycystic ovaries, namely chocolate cyst, cyst, lesion masses, multiple follicle, immature follicle in terms of size and growth. The reason for this PCOD in early stage is due to the following reasons—Lack of awareness, food habits, stress, social media, culture, and above all lack of hygiene. PCOD causes infertility problem among women. This causes plenty of fertility research centers/hospitals in all big cities. Care should be taken in women’s health to avoid this PCOD problem. Even though several medical image modalities, namely CT scans, MRI, are available, ultrasound B-scan modality is a competent aid to conclude PCOD in an effective manner. Capturing B-scan image in uterus again involves multiple challenges. Image acquisition devices used for capturing B-scan images involves high sound and laser

Enhancing Intricate Details of Ultrasound PCOD Scan Images …

45

beam inference problems. Sometime, this unwanted noise will hide the cyst or small minute/tiny immature follicles from investigation. But, it holds more noises due to various factors involved in image capturing instruments. Many radiology centers are using transducer to power a 3–9 MHz phased array. This seems to be less harm to human body. So, the conventional noise removal methods are hard to eradicate speckles present in B-Scan image. Generally, noise demolishes worth of B-scan mode pictures, thus dropping its capability of physician eye difficult to differentiate complex facts in investigation. Limited numbers of research works are carried out in PCOD.

2 Background Work World Health Organization (WHO) approximates to PCOD which has exaggerated 120 million women (4.4%) worldwide in 2012. Internationally, occurrence of PCOD is extremely erratic, varying from 2.3% to as high as 25%. In India, professionalist alleges 10% of the young girls to be pretentious by PCOD, and so far, no correct in print numerical statistics on the occurrence of PCOD in India is presented. Few statistical investigations for the dominance of PCOD across India and foreign countries are shown in the NIH [2] Table 1 Table 1 infers the statistical analysis of PCOD affected by women population in Mumbai [3], Pondicherry, Andhra Pradesh, Kerala, and across south India which is surveyed and tabulated [4]. Dominance of PCOD in women across foreign countries like University of Alabama, USA, Spain, and Lesbos is tabulated. At the same time, several researchers carried out research in the early detection of PCOD through Table 1 Few statistical investigations for the dominance of PCOD across India and foreign countries are produced below Discussion

Mumbai

Pondicherry

Andhra Pradesh

South India

Kerala

36% among adolescent girls

Male infertility 2017 CIMAR study

Statistical analysis—Dominance of PCOD across India Dominance of PCOD

11.97% women of reproductive age

11.76% 2014 study

9.13% among adolescents

Statistical analysis—Dominance of PCOD across foreign countries

Dominance of PCOD

University of Alabama Birmingham USA

Caucasian women from Spain

Greek island Lesbos

400 samples 6.6%

154 samples 6.5%

192 samples 6.77%

46

S. Ramamoorthy et al.

Table 2 Literature review—Speckle reduction methods Author

Speckle reduction methods

Gupta et al. [7]

Nonlinear estimator based on Bayesian theory [7]

Abd-Elmoniem et al. [8]

Tensor-oriented nonlinear diffusion approach [8]

Chen et al. [9]

Region-based spatial preprocessing approach [9]

Djemal [10]

Minimization of total variation [10]

Zhang et al. [11]

Laplacian-based triangular nonlinear approach and shock filter (LPNDSF) [11]

Zhang. et al. [12]

Multi-orient recognition for noise-signal partition into nonlinear approach [12]

Chao et al. [13]

Diffusion model which incorporates both the local gradient and the gray-level variance [13]

Suganya et al. [14]

Modified Laplacian pyramid nonlinear diffusion filter [14]

Yeslem et al. [15]

Noise removal on B-scan modality images [15]

Nair et al. [16]

A robust anisotropic filter with low arithmetic complexity of images [16]

several image preprocessing techniques [2]. PCOD recognition seems to be a significant research domain. Several investigators provided solution that has been explored for PCOD recognition. Few literature reviews are given in Table 2. Though a lot of effort has been completed on noise removal in B-scan images, the majority of the research work in literature believes the diffusion-based nonlinear filter which is helpful for suppressing the speckles. However, it lacks to identify small intricate characteristics like multiple cysts and gash lesion that leads to unavoidable health issues. By the Above-said literature in Table 2, the improvement of anisotropic nonlinear diffusion filter is superior. In order to focus on delicate/intricate textures variations like lesions, multiple cysts, and recover boundary details for B-scan modality, this work proposes tailored anisotropic diffusion filter (TADF). The primary objective of our research work is to improve intricate features of ultrasound PCOD scan images by using TADF. In the proposed method, a tailored diffusivity method and global gradient threshold value is calculated in all the four directions of B-scan image to eliminate noise and retains intricate details concurrently. The paper is organized as follow: introduction, literature survey, motivation, and the proposed methodology of TADF filter for preprocessing. The results and discussions are illustrated, and finally conclusion and limitations are explained.

Enhancing Intricate Details of Ultrasound PCOD Scan Images …

47

3 Proposed Work—Tailored Anisotropic Diffusion Filter (TADF) 3.1 Anisotropic Diffusion Filter Image smoothing with edge protection is the challenging task in ultrasound medical imaging. Nonlinear diffusivity function also named Perona–Malik [5, 17] filter employed in order to eradicate image noise/speckle noise present in an image, devoid of removing important pieces of the image, mainly boundaries, lines, and intricate portions which are significant information in medical images. Anthropometric data like patient name, age, and gender are present in an image which is not under consideration. Mathematically, it is denoted as follows ∂ I (x, y, ti ) = div ∇.(c(x, y, ti)∇ I ) = ∇c.∇ I + c(x, y, ti) ∂t

(1)

Generally, c(x, y, ti) monitors the range of diffusivity function and called image gradient function. This gradient function is helpful for preserving boundaries in the image.

3.2 Tailored Anisotropic Diffusivity Approach The subsequent phase in the research work is tailoring level of diffusivity function. It depends up on the nature of modality images, size, and dimension of the image. An anisotropic diffusivity approach in every low pass sheets of Laplacian domain of pyramid restrains noise by protecting its boundaries [6]. Nonlinear diffusion filtering eliminates B-mode scan speckles obtained from image just adjusting the input image by differential mathematical expression, as Perona and Malik [5] stated the anisotropic diffusivity function as mentioned mathematically [5]: ∂ I (x, y, t) = ∇.(c(x, y, ti)∇ I ) I (t = 0) = Io ∂t

(2)

where I(x, y, ti) defined an raw query representation of an image, ti—the timestamp, c(x, y, ti)—diffusivity method, and Io—raw input representation. For linear diffusivity method, the range of diffusivity function c(x, y, ti) is same for all B-scan modality locations. For anisotropic approach, diffusivity method in mathematical expression (3) randomly falls down the value of global gradient [6]. Thus, nonlinear diffusion is diminishing diagonally in texture area with huge global threshold values and enhances inside the surface area with lower gradient values. 2 C2 (∇ I ) = exp 1 − ∇ I 2 /(2λ + 1)

(3)

48

S. Ramamoorthy et al.

where λ is steady and said to be gradient global threshold value. It participates in a major place in finding the level of flattening in the anisotropic diffusivity approach. The two-dimensional anisotropic diffusivity method is transformed into the other form: ⎡ ⎤ CN ∇N Ii,n j .∇N Ii,n j + CS ∇S Ii,n j .∇S Ii,n j n+1 n ⎦ Ii, j = Ii, j + (∇t).⎣ (4) +CE ∇E Ii,n j .∇E Ii,n j + CW ∇W Ii,n j .∇W Ii,n j Subscripts N—North direction, S—South direction, E—East direction, and W— West direction explaining way of global threshold, ∇North Ii, j = Ii−1, j − Ii, j

(5)

∇South Ii, j = Ii+1, j − Ii, j

(6)

∇East Ii, j = Ii, j+1 − Ii, j

(7)

∇West Ii, j = Ii, j−1 − Ii, j

(8)

And, the gradient global value is measured mathematically by ∇ I = 0.5 ×

∇ IN 2 + ∇ IS 2 + ∇ IW 2 + ∇ IE 2

(10)

The proposed TADF utilizing a global threshold value at every subbands is measured using tailored anisotropic diffusion with Gaussian equations. If the global gradient threshold value is expressed high, the diffusivity approach will react as a smoothing filter, then multiple cysts details will be unseen as a speckle. If it is set low, then we can retrieve fine details and boundaries and also preserve it.

4 Experimental Results The investigational dataset of proposed work is collected from Meenakshi Mission Hospital, GEM hospital, Devaki Radiology center, Coimbatore, Tamilnadu (Table 3). Noise removal is the major task in any medical imaging. In this paper, TADF filter with global gradient threshold level 4 is used to eliminate speckle in a PCOD input shown in Fig. 1. Figure 2a, b shows PCOD scan images before preprocessing and after preprocessing by TADF. From the above experiment, it is noticed that TADF works openly on ultrasound PCOD raw image dataset. The outcome derived from TADF is very spiky in border protection. Tables 4 and 5 emphasize the assessment

Enhancing Intricate Details of Ultrasound PCOD Scan Images … Table 3 Details of dataset used for proposed approach Description

Value

Scanning mode

B-mode

Dataset

PCOD scan images

Transducer

3–9 MHz phased array transducer

Image size

(800 × 600) pixels

Training set Testing set

400 images 120 images

Fig. 1 Result of tailored anisotropic diffusion filter–global threshold level = 4

Fig. 2 a Before preprocessing. b After preprocessing

49

50

S. Ramamoorthy et al.

Table 4 Contrast values ranges for TADF with conventional filters Filters

PCODs Endometriosis

Cystadenomas

PCOD

Dermoid cysts

Noisy

6.11

3.34

4.50

ND

7.53

7.01

6.07

6.10 7.09

SRAD

7.41

6.63

8.02

9.14

LPND

8.98

8.81

9.01

10.11

TADF

9.02

11.31

10.81

11.09

Table 5 Peak signal-to-noise ratio range for TADF with other conventional filters Filters

PCODs Endometriosis

Cystadenomas

PCOD

Dermoid cysts

Noisy

5.11

4.68

6.56

ND

7.02

7.45

7.13

6.73 7.01

SRAD

8.95

9.67

10.46

10.49

LPND

10.31

11.19

11.16

13.98

TADF

12.44

13.52

13.5

14.18

of contrast ratio and psnr average ranges for the PCOD ultrasound input image using TADF along with obtainable nonlinear diffusion filters. From Table 4, the result concluded that the values of CNR specify that proposed methodology provides better speckle recognition by protecting its outer boundary structure and improving the visibility of ultrasound scan images. For example, the CNR ranges for unwanted noisy PCOD are 4.50, ND and SRAD results for tiny cyst are 6.07 and 8.02, respectively. But TADF applied straightly on B-scan dataset range of contrast value is 10.81 which is much higher than conventional methods. From the above-tabulated results, it is proved. The small intricate parts like multiple cysts are clearly identified through TADF approach. Besides, TADF protects boundaries, maintaining minute arrangement while maximally eliminating noise during preprocessing phase. Table 4 infers peak signal-to-noise ratio and states that TADF is working good compared to conventional techniques. PSNR is helpful for edge preserving. The PSNR ideals for PCOD speckle scan image is about 6.56. The LPND produces superior outcome balanced to SRAD and ND, while TADF gives high PSNR standards for PCOD. Thus, the proposed TADF method results better view of minute formation (i.e., delicate textures called lesion) in every image. The outcome of this project is examined by gynecologist Dr. Mahalakshmi Sivakumar, Madurai Meenakshi Mission Hospital, Madurai, and concluded that the proposed TADF is helpful for removing only speckle noise by leaving intricate details of images and preserved its edges.

Enhancing Intricate Details of Ultrasound PCOD Scan Images …

51

5 Conclusion Thus, this approach is used efficiently to eliminate the noise present in the PCOD ultrasound modality through TADF. A tailored diffusivity function and global threshold is applied in all the four directions an input ultrasound scan medical image to eradicate noise and holds intricate features concurrently. The result of TADF approach is tested for 120 test samples out of 400 images in a database. The result shows that the TADF is used to improve the presentation of other modality in the medical domain. Hence, noise removal is accomplished using TADF approach. The concert of TADF approach is tested by 400 dataset by physician Dr. S. Mahalakshmi in MMH Hospital, Madurai, India, and found that the result had an accuracy of 98% speckle free. The accuracy of a noise reduction algorithm is measured by physician based on the CNR and PSNR results.

References 1. D Dewailly et al., Diagnosis of polycystic ovary syndrome (PCOS): revisiting the threshold values of follicle count on ultrasound and of the serum AMH level for the definition of polycystic ovaries. 26(11), 3123–3129 (2017) (Oxford) 2. M.Y. Risvan, S. Suresh, K. Balagurusamy, Siddha elixir and aeitology of polycystic ovarian syndrom. Adv. Tech. Biol. Med. 5(4) 2017). ISSN 2379-1764 3. M. Kumara, R. Walavalkar, M. Shaikh, A. Harshal, Prevalence of poly-cystic ovary syndrome among women in Mumbai and association of its symptoms with work hours. Int. J. Innov. Res. Sci. Eng. Techno. 6(7) (2017) 4. R. VidyaBharathi, S. Swetha, J. Neerajaa, V. Madhavica, D.M. Janani, An epidemiological survey: effect of predisposing factors for PCOS in Indian urban and rural population. Middle East Fertil. Soc. J. 22(4), 313–316 (2017) 5. P. Perona, J. Malik, Scale-space and edge detection using anisotropic diffusion. IEEE Trans. Pattern Anal. Mach. Intell. 12(7), 629–639 (1990) 6. P.J. Burt, E.A. Adelson, The Laplacian pyramid as a compact image code. IEEE Trans. Commun. 31(4), 532–540 (1983) 7. N. Gupta, M.N.S. Swamy, E. Ploltkin, Despeckling of medical ultrasound images using data and rate adaptive lossy compression. IEEE Trans. Med. Imaging 24(6), 743–754 (2005) 8. K.Z. Abd-Elmoniem, A.M. Youssef, Y.M. Kadah, Real-time speckle reduction and coherence enhancement in ultrasound imaging via nonlinear anisotropic diffusion. IEEE Trans. Biomed. Eng. 49(9), 997–1014 (2002) 9. Y. Chen, R.M. Yin, P. Flynn, S. Broschat, Aggressive region growing for speckle reduction in ultrasound images. Pattern Recogn. Lett. 24(4-5), 677–691 (2003) 10. K. Djemal, Speckle reduction in ultrasound images by minimization of total variation, in IEEE International Conference on Image Processing, vol. 3 (2005), pp. 357–360 11. F. Zhang, Y.M. Yoo, L. Zhang, L.M. Koh, Y. Kim, Multiscale nonlinear diffusion and shock filter for ultrasound image enhancement, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2 (2006), pp. 1972–1977 12. F. Zhang, Y.M. Yoo, L.M. Koh, Y. Kim, Nonlinear diffusion in Laplacian pyramid domain for ultrasonic speckle reduction. IEEE Trans. Med. Imaging 26(2), 200–211 (2007) 13. S.-M. Chao, Y.-Z. Jungli, D.-M. Tsai, W.-Y. Chiu, W.-C. Li, Anisotropic diffusion-based detail preserving smoothing for image restoration, in 17th IEEE International Conference on Image Processing (2010), pp. 4145–4148

52

S. Ramamoorthy et al.

14. R. Suganya, S. Rajaram, D. Gandhi, An efficient method for speckle reduction in ultrasound liver images for e-health applications, in Distributed Computing and Internet Technology: 10th International Conference ICDCIT (2014), pp. 311–321 15. Dr. A.S. Yeslem. Bin-Habtoor, Dr. S.S. Al-amri, Removal speckle noise from medical image using image processing techniques. Int. J. Comput. Sc. Inf. Technol. 7(1) (2016) 16. R.R. Nair, E. David, S. Rajagopal, A robust anisotropic diffusion filter with low arithmetic complexity for images. EURSIP J. Image Video Process. 1 (2019) 17. E.K. Barthelmess, R.K. Naiz, Polycystic ovary syndrome: current status and future perspective, vol. 6 (NIH Public Access in National Institute of Health, 2015), pp. 101–119

Suganya Ramamoorthy Associate Professor at Thiagarjar College of Engineering, Madurai, since 2006. She earned her doctorate in Information and Communication Engineering from Anna University, Chennai, in 2014. Her research concentrated on the role of classification and retrieval of ultrasound liver images using machine learning algorithms. Her areas of interest include Medical Image Processing, Big Data Analytics, Internet of Things, Theory of Computation, Compiler Design, and Software Engineering. She is a co-author of six book chapters in IGI Global Publishers. She has published 20 articles in the peer-reviewed International Journals and over 25 conference proceedings.

Dr. Thangavel Senthil Kumar is an Associate Professor at the Department of Computer Science and Engineering at Amrita School of Engineering and Amrita Vishwa Vidyapeetham, Coimbatore. He received doctorate in Information Communication Engineering from Anna University, Chennai. His areas of interest include Video Surveillance, Cloud Computing, Software Engineering, Video processing, Wireless Sensor Networks, Big Data Computing, Embedded Automation, and Deep learning. He is guiding scholars with Amrita Vishwa Vidyapeetham in the area of Video Analytics and Intrusion detection system. He is a reviewer for Elsevier Computers and Electrical Engineering Journal. He is associated with projects funded by Department of Science and Technology, IBM, Ministry of Tribal Affairs, IBM.

LSTM and GRU Deep Learning Architectures for Smoke Prediction System in Indoor Environment S. Vejay Karthy, Thangavel Senthil Kumar, and Latha Parameswaran

Abstract In places where more number of people are present, it is difficult for people to escape that place during emergency situations. So, there arises a need to detect the severity of smoke as early as possible as smoke develops rapidly, and the officials must be alerted on the situation immediately. So, in this paper, we present a method that would make use of the camera feeds and take the images frame by frame to detect the presence and progress of smoke. We extract some features of smoke from the images. The features obtained are then fed into both of these models (LSTM, GRU). Both LSTM and GRU are mainly use the knowledge of previously obtained output over a period of time for prediction rather than just using only the output of the most recent layer like other RNN models. We use these two models as the data that is input to the model is a time series data. We discuss the advantages and compare the accuracy of the results obtained by using both LSTM and GRU methods. This system can be developed and used in smart spaces that would help for quicker response from officials. Since we use cameras that are already installed in a building, this eliminates the need to use sensors. Keywords LSTM · GRU · Smoke prediction · Smart spaces

1 Introduction Smoke and fire accidents happen due to various reasons. In situations like these, people are most likely to panic and their immediate response would be to move out S. Vejay Karthy · T. Senthil Kumar (B) · L. Parameswaran Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India e-mail: [email protected] S. Vejay Karthy e-mail: [email protected] L. Parameswaran e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_5

53

54

S. Vejay Karthy et al.

of that place. So, alerting people about the situation at the earliest is very important. Many of these devices primarily aim to reduce burden and difficulties faced by us. The speciality of these devices is that they need not be with all the people, rather one device in an area would provide the required functionality. This had led to the concept of smart spaces. Smart spaces are certain places or environments that are made more secure or have easier and efficient accessibility with the help of some IoT devices. A camera is one such device which we can find in many crowded places and in places where security is necessary. Since these cameras capture live feed, they can be used to analyse the presence and severity of smoke in that place. Many researchers have done research and produced remarkable results in detection of smoke in a place. This motivated us to design a model that can predict the severity of smoke in the near future. This would enable us to send an alert signal to the nearby fire brigade so that they can arrive at that spot as soon as possible. We try to predict the severity which is categorized as normal, high and very high. For this, we make use of two deep learning algorithms, namely LSTM and GRU. Both these algorithms have been successful in predicting the results for time series data. Here, since the severity of smoke depends on the value in the previous second, these images form a time series data. In this paper, we discuss the working of these models and how they can be used for this particular problem. We also compare the results of the predictions of both these models.

2 Related Works 2.1 LSTM LSTM model has been used in predicting the next move of a person in crowded places. They do not capture dependencies between multiple correlated sequences. A new architecture was developed in which LSTMs were connected to the corresponding nearby sequences and also by introducing a pooling layer. This enabled sharing of hidden states among LSTMs of spatially proximal sequences [1]. The data for this model were the previous movements of people which were time series input data. Recurrent neural networks (RNNs) have been used in automatic speech recognition (ASR) and have provided promising results. But they do not perform well in places where there is more noise. Results from LSTM model that have been developed [2] provide significant improvements when compared to simple RNN model. Since the audio data for speech recognition are time series data, LSTM was used. In the area of precipitation nowcasting, fully connected LSTM (FC-LSTM) has been extended to convolutional LSTM (ConvLSTM) [3]. Here, all the inputs, cell outputs, hidden states and gates of the ConvLSTM are 3D tensors and with spatial dimensions, that is, rows and columns as their last two dimensions. Results show

LSTM and GRU Deep Learning Architectures for Smoke …

55

that ConvLSTM outperforms FC-LSTM and ROVER algorithm. This model may be useful for spatiotemporal data. In predicting accurate wind speed, researchers have used LSTM to find lowfrequency sublayers [4]. LSTM model with single hidden layer has been used for this prediction purpose. The efficiency of LSTM has been shown with this model. The wind speed is a time series data. Various LSTM variants have been developed over the recent years. Some of them include coupled input and forget gate (CIFG), no forget gate (NFG) and no output activation function (NOAF). Their performance on different data sets have been analysed and summarized [5]. This shows that each variant of LSTM has its own merits and demerits depending on the data set on which it is used.

2.2 Smart Space In the past recent years, terms such as digital city and intelligent city have also come into existence. The major differences between these widely used terms from their origin have been discussed [6, 7]. This helps us to get a clear idea of a smart space. Many aspects need to be considered into account when we design a device for any environment. Explicit representation, context querying and context reasoning play important roles while in smart spaces [8]. Smart space can be improved with the introduction of semantic web into it which is termed as sematic space. Mobile apps can also add to the advantage of devices in smart spaces. Meeting rooms in corporate offices can be improvised into smart space [9]. Apart from using technologies like video camera and openCV software, an IoT-based approach may be implemented. This may improve the efficiency and prove to be cost effective. Predictions about the movement of people in any environment may be possible by using sensors kept in different places in the environment [10]. LSTM models may be developed for prediction purpose. Using these predictions in the smart space, necessary actions can be taken to avoid mishaps.

2.3 GRU Different variants in GRU model have been discussed and compared. They have been tested on IMDB and MINT dataset, and their results have also been discussed to give an insight of which variant performs better [11]. GRU has been used in a variant of RNN which is a gated feedback RNN. The advantages of GRU and comparison with LSTM have been made for language modelling problem, and the results of LSTM and GRU variant have also been detailed [12].

56

S. Vejay Karthy et al.

Traffic flow prediction has been modelled using both LSTM and GRU. The benefits of using LSTM and GRU for such a time series data have been discussed. The results of both the models have also been analysed [13].

3 Methodology The proposed smoke detection model architecture is presented in Fig. 1.

3.1 Process For our proposed model, the input is obtained as live feed from a video camera. According to set frames per second (fps), the images are obtained from the input. These images are then passed into a noise reduction layer where some noises from the images are removed. The preprocessed image is then fed into another layer for extracting the necessary features. The data obtained are then used as input for both the LSTM and the GRU models which are discussed later in this paper. The models then predict the result.

Fig. 1 Schematic diagram showcasing the workflow of the proposed model

LSTM and GRU Deep Learning Architectures for Smoke …

57

3.2 LSTM Long short-term memory—A LSTM cell contains four primary components, a memory cell, an input gate, a forget gate and an output gate with the activation function as sigmoid function (Fig. 2). The forget gate (ft ) regulates the flow of data by keeping only relevant and required information overtime and discarding the irrelevant information. It is done so by taking input from previous hidden state and current input and then passing it into a sigmoid function. The input gate (it ) regulates the input to the LSTM cell. It is done so by taking input from the previous hidden state and also the current input and then passing it into a sigmoid function and also passing the input from the previous hidden state and the current input through a tanh function. The outputs from both are then multiplied. The cell state gets point-wise multiplied by the forget vector. Some values may be dropped if they are multiplied by values near zero. Then, point-wise addition with the input gate’s output is done upon which new values are updated. The output gate (ot ) finds the hidden state. The previous hidden state and the current input are given to sigmoid function. Then, the newly modified cell state is passed to the tanh function. The tanh output is multiplied with the sigmoid output to produce the final output. The output is the hidden state. The recurrent connection between the current hidden layer and the previous hidden layer is W. The weight matrix mapping the inputs to the hidden layer is denoted by U. C’ indicates hidden state which is calculated based on the previous hidden state and current input.

Fig. 2 Diagram illustrating the gates and functioning of a LSTM cell

58

S. Vejay Karthy et al.

i t = σ xt U i + h t−1 W i f t = σ xt U f + h t−1 W f ot = σ xt U o + h t−1 W o Ct = tanh xt U g + h t−1 W g i t = σ f t ∗ Ct−1 + i t ∗ Ct h t = tanh(Ct ) ∗ ot

3.3 The LSTM Model The extracted features are given as input to the model. The data obtained are split into train and test data. From here, we use two different models. The data for training are then fed into a sequential model. One model contains only one layer of LSTM with cell count as 128, and the other model contains four (stacked) LSTM layers with cell count as 256, 128, 64 and 32. Each layer has a dropout of 30%. A dense layer is then added to them. The model is then compiled with binary cross-entropy loss with “Adam” optimizer. It is presented in Fig. 3.

Fig. 3 Schematic diagram showcasing the workflow of both the LSTM models

LSTM and GRU Deep Learning Architectures for Smoke …

59

Fig. 4 Diagram illustrating the gates and functioning of a GRU cell

3.4 GRU Gated Recurrent Unit—A GRU cell contains two basic gates: an update gate and a reset gate (Fig. 4). The Update gate—It has the responsibility of deciding how much memory from the previous state to keep. The Reset gate—It does the role of combining the previous output with the new input. Equations

z t = σ Wz . h t−1 , xt rt = σ Wr . h t−1 , xt h t = tanh W. rt ∗ h t−1 , xt h t = (1 − z t ) ∗ h t−1 + z t ∗ h t

3.5 The GRU Model The extracted features are given as input to the model. The data obtained are split into train and test data. From here, we use two different models. The data for training are then fed into a sequential model. One model contains only one layer of GRU with cell count as 128, and the other model contains four (stacked) GRU layers with cell count as 256, 128, 64 and 32. Each layer has a dropout of 30%. A dense layer is then added to them. The model is then compiled with binary cross-entropy loss with “Adam” optimizer. Optimization can also be explored in future [14, 15]. It is presented in Fig. 5.

60

S. Vejay Karthy et al.

Fig. 5 Schematic diagram showcasing the workflow of both the GRU models

4 Experiment and Analysis We carried out our experiments using Google Colab. The model was developed using Python language. The input for the model was videos that showed the development of smoke from its initial stages. These videos were chosen in such a way that the place in which the smoke was present was an indoor environment. This is because we primarily focus on smart spaces. We used four videos that were around two minutes each. These videos were fed as input to the model. The features of smoke mainly the area of smoke and colour of smoke were extracted from the images using semantic segmentation method with unet architecture. The data thus obtained was given as input to both the LSTM and GRU models. As discussed earlier in this paper, we trained both models using four layers (stacked) of LSTM and GRU. In addition to this, we also tried using only a single layer of LSTM and GRU to compare their performance. We took the train and test split of data to be 80% and 20% respectively. On analysing the results, the accuracies of both LSTM and GRU are found to be around 84–86% in both the models. The model accuracy graph shows that both the models obtain a significant accuracy value within the first 30 epochs. It is clear from Figs. 6 and 7 that the accuracy of the models during validation is better for the model with stacked layers when compared to the model with only a single layer. The model loss for LSTM and GRU is 30% in both the models. The model loss during

LSTM and GRU Deep Learning Architectures for Smoke …

Fig. 6 Performance of LSTM and GRU models with stacked layers

Fig. 7 Performance of LSTM and GRU models with a single layer

61

62

S. Vejay Karthy et al.

Table 1 Comparison of metrics Metrics

Accuracy

Validation accuracy

Loss

Validation loss

LSTM—single layer

0.86

0.85

0.26

0.28

LSTM—stacked

0.86

0.845

0.27

0.29

GRU—single layer

0.835

0.825

0.31

0.29

GRU—stacked

0.84

0.84

0.32

0.30

validation for the model with a single layer is not as good as the model with stacked layers which can be observed from Figs. 6 and 7. The summary of the observations after 100 epochs is shown in Table 1.

5 Result and Conclusion To conclude, we used two deep learning methods for a smoke prediction model in this paper. We can clearly see that there is no certain winner. Both LSTM and GRU gave good results. But from the observations, we can say that GRU did not perform any better than LSTM as throughout the epochs, and LSTM gave a slightly better prediction than GRU. We can also observe that during validation, model with stacked layers of both neural networks gave better predictions. Acknowledgements This proposed work is a part of the project supported by Department of Science and Technology (DST/TWF Division/AFW for EM/C/2017/121) project titled “A framework for event modelling and detection for Smart Buildings using Vision Systems”. The authors would also like to acknowledge Mrs. Anitha Vadivel B.E, MBA—Data Analyst Consultant for her suggestions on data preprocessing and deep learning approaches to be used.

References 1. A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, S. Savarese, Social LSTM: human trajectory prediction in crowded spaces, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016) pp. 961–971 2. F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J.R. Hershey, B. Schuller, Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR, in International Conference on Latent Variable Analysis and Signal Separation (Springer, Cham, 2015), pp. 91–99 3. S.H.I. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W. Woo, Convolutional LSTM network: a machine learning approach for precipitation nowcasting, in Advances in Neural Information Processing Systems (2015), pp. 802–810 4. H. Liu, X. Mi, Y. Li, Smart multi-step deep learning model for wind speed forecasting based on variational mode decomposition, singular spectrum analysis, LSTM network and ELM. Energy Convers. Manage. 159, 54–64 (2018)

LSTM and GRU Deep Learning Architectures for Smoke …

63

5. K. Greff, R.K. Srivastava, J. Koutník, B.R. Steunebrink, J. Schmidhuber, LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016) 6. Cocchia, A, Smart and digital city: A systematic literature review, in Smart City (Springer, Cham, 2014), pp. 13–43 7. A. Visvizi, M.D. Lytras, Rescaling and refocusing smart cities research: from mega cities to smart villages. J. Sci. Technol. Policy Manage. 9(2), 134–145 (2018) 8. X. Wang, J.S. Dong, C.-Y. Chin, S.R. Hettiarachchi, D. Zhang, Semantic space: an infrastructure for smart spaces. IEEE Pervasive Comput. 3(3), 32–39 (2004) 9. J. Patel, G. Panchal, An IoT-based portable smart meeting space with real-time room occupancy, in Intelligent Communication and Computational Technologies (Springer, Singapore, 2018), pp. 35–42 10. Y. Kim, J. An, M. Lee, Y. Lee, An activity-embedding approach for next-activity prediction in a multi-user smart space, in 2017 IEEE International Conference on Smart Computing (SMARTCOMP) (IEEE, 2017), pp. 1–6 11. R. Dey, F.M. Salemt. Gate-variants of gated recurrent unit (GRU) neural networks, in 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) (IEEE, 2017), pp. 1597–1600 12. J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Gated feedback recurrent neural networks, in International Conference on Machine Learning (2015), pp. 2067–2075 13. R. Fu, Z. Zhang, L. Li, Using LSTM and GRU neural network methods for traffic flow prediction, in 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC) (IEEE, 2016), pp. 324–328 14. K.S. Gautam, T. Senthil Kumar, Video analytics-based intelligent surveillance system for smart buildings, in Springer Soft Computing (2019), pp. 2813–2837 15. K.S. Gautam, T. Senthil Kumar Thangavel, Video analytics-based facial emotion recognition system for smart buildings. Int. J. Comput. Appl. 1–10 (2019) (Taylor and Francis)

S. Vejay Karthy is currently pursuing his B.Tech in computer science and engineering in Amrita School of Engineering, Coimbatore. He is passionate and interested in areas of machine learning and data science. He is currently improving his expertise with active participation in data science.

64

S. Vejay Karthy et al. Dr. Thangavel Senthil Kumar is Associate Professor at the Department of Computer Science and Engineering at Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore. He received doctorate in Information Communication Engineering from Anna University, Chennai. His areas of interest include video surveillance, cloud computing, software engineering, video processing, wireless sensor networks, big data computing and embedded automation. He is guiding scholars with Amrita Vishwa Vidyapeetham in the area of video analytics and intrusion detection system. He is a reviewer for Elsevier Computers & Electrical Engineering Journal. He is associated with projects funded by Department of Science and Technology, IBM, Ministry of Tribal Affairs, IBM. Dr. Latha Parameswaran is currently Professor in the Department of Computer Science and Engineering. She completed her master’s degree from PSG College of Technology and her Ph.D. from Bharathiar University. She was in the software industry for ten years prior to joining Amrita. Her areas of research include image processing, information retrieval, image mining, information security and theoretical computer science. In addition to teaching and serving as Chair of the CSE department, she is currently guiding Ph.D. research scholars in Amrita, Bharathiar, and Anna Universities. She is in the doctoral committee for many Ph.D. scholars in various universities.

A Mobile-Based Framework for Detecting Objects Using SSD-MobileNet in Indoor Environment K. K. R. Sanjay Kumar, Goutham Subramani, Senthil Kumar Thangavel, and Latha Parameswaran

Abstract Object detection has a prominent role in image recognition and identification. Emerging use of neural networks approaches toward image processing, classification and detection for increasing amount of complex datasets. With the collection of large amounts of data, faster and more efficient GPUs and better algorithms, computers can be trained conveniently to detect and classify multiple objects within an image with high accuracy. Single-shot detector-MobileNet (SSD) is predominantly used as it is a gateway to other tasks/problems such as delineating the object boundaries, classifying/categorizing the object, identifying sub-objects, tracking and estimating object’s parameters and reconstructing the object. This research demonstrates an approach to train convolutional neural network (CNN) based on multiclass as well as single-class object detection classifiers and then utilize the model to an Android device. SSD achieves a good balance between speed and certainty. SSD runs a convolution network on the image which is fed into the system only once and produces a feature map. SSD on MobileNet has the highest mAP among the models targeted for real-time processing. This algorithm includes SSD architecture and MobileNets for faster process and greater detection ratio. Keywords Object detection · Deep learning · SSD · MobileNet

K. K. R. Sanjay Kumar · G. Subramani · S. K. Thangavel (B) · L. Parameswaran Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore, India e-mail: [email protected] K. K. R. Sanjay Kumar e-mail: [email protected] G. Subramani e-mail: [email protected] L. Parameswaran e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_6

65

66

K. K. R. Sanjay Kumar et al.

1 Introduction Object recognition and detection are crucial components in a vast range of implementation in many fields of tracking and identification of objects. The model’s efficiency is built upon different parameters in addition to the total number of weight parameters, total count of trained images, the model architecture and the amount of computing power required to test the model. Nonetheless, the above-mentioned technique for improving the accuracy of the model does not automatically improve the efficiency in terms of speed and size. Recognition work must be carried out in an orderly manner on a computationally restrained platform for some real-world implementation on applications like augmented reality (AR), self-driving car and robotics. To train a single-shot detector model for finding the key points would require a vast number of input images of the object. The coordinates of the key points must be designated and located in identical order. A dataset having several hundred images would not be enough for training the network. Therefore, the dataset must be increased via augmentation of available data. Before starting augmentation, the key points must be designated in the images, and the dataset should be divided into training and control parts. Detecting objects when someone carries them when entering or leaving a room is very important and helpful including safety and security. It will help people who might forget their personal belongings when leaving the building. There are two ways of training objects. These are online and offline. The online method trains the detector to create a set of constructed images in the initial frame. Later, the detector finds the objects required in the next few frames. The detector is being updated online using the latter frames. The offline method runs the detector to detect still images, and a filter correlates the objects in the video frames. The rest of the paper is structured as follows. Section 2 of this research paper describes the related work that other researchers in this industry have contributed. Section 3 introduces various algorithms that have been used generally by others. Section 4 describes the dataset that we have used in order to train the model efficiently. Section 5 explains the model that we have developed using cloud and how the SSD architecture works to make an effective model. Section 6 then infers the results that we have achieved after implementation. Section 7 consists of conclusion and then acknowledgement. Finally, it presents the sources that we have referenced for our research.

2 Related Work There is a great concern on construction of effective neural networks in the recent researches including but not limited to optical character recognition, object tracking and object extraction from video.

A Mobile-Based Framework for Detecting Objects …

67

2.1 Object Detection and CNN Wong et al. [1]: This paper uses comprehensive architecture which comprises the Tiny SSD type of neural nets for actual time object identification. This SSD consists of two parts, fire sub-network stack and enhanced SSD-based convolution. One of the major problems which the authors faced was to determine the standard microarchitecture to make an equity between speed and performance. The dataset used consists of different types of pictures that have been annotated with bounding boxes thoroughly with 20 various types of objects. The Tiny SSD research work by Wong has made 61.3% mAP, while the Tiny Yolo microarchitecture has made 57.3% mAP. Wu and Zhang [2]: The authors work on SSD model which utilizes the architecture having 16 layers called VGG-16 as ground infrastructure, and some features included to the borderline of the ground network. The authors admit that single-shot detector has achieved as the best metric for the problem. SSD has greater frames per second, and it also identifies early fire even though it has greater detection efficiency. When training ceases, it is discovered that the fire identification efficiency is one, except the small scale of fire which is 0.57. For smoke identification, efficiency to detect many images which are used to test is 0, but 13 in 28 are zero smoke images, so the detection rate for having a disability in identifying object has a percentage of 13. The ability of the model to detect smoke for the remaining 73 images has given up to 97.88%. Zhao and Li [3]: This paper proposes an efficient method for detecting objects with low resolution. This problem is effectively tackled by their research model known as residual super-resolution single-shot network (RSRSSN). There are three stages in this proposed model which are feature representation, feature mapping and feature combination. The features of this SSD are used for this very low-resolution detection (VLRD). The RSRSSN is already trained using both types of images, i.e., low-quality images and the corresponding high-resolution pictures. The results have proved that RSRSSN has 73.5 mAP, while the vanilla SSD has 68.6 mAP. He et al. [4]: The authors had used Faster RCNN as the approach for detection. They experimented on the enhancements of reinstating VGG-16 using ResNet-101. The recognition is implemented using both the models, so that the advantages can only be associated to better networks. This paper has set the path for the automation of driving vehicles.

2.2 Deep Learning Wang and Yu [5]: This research paper mainly expresses the average prediction of models of vehicles and the generalization ability of the actual road vehicle images. These detectors were used to estimate if the object detection algorithm can be utilized for vehicle identification and classification.

68

K. K. R. Sanjay Kumar et al.

2.3 Single-Shot Detector Chen et al. [6]: This paper provides an intelligent approach toward transportation by constructing traffic surveillance, vehicle detection and counting through FastSSD. Unwanted objects in situations are prevented using loop detectors that are virtual. The estimation amount is greatly reduced, and the efficiency is increased by using loop identifiers. The model used is more agile on other computing platforms which is considered to have greater efficiency. Wang and Du [7]: The authors used SSD established on CNN which can be utilized for SAR object identification, and training convolutional neural network is crucial for single-shot detector. Training an efficient convolutional neural network model requires a greater quantity of images. The synthetic aperture radar identification training inputs are very much deficient to grasp an efficient CNN model. Cao et al. [8]: The goal of this paper is to provide fast detection for small objects using feature-fused SSD. This paper introduces SSD with two multilevel feature fusion models. Experimental inference displayed that these two components get greater mAP in the PASCAL Visual Object Classes Challenge 2007 than the singleshot detector baseline by 1.5 and 1.6 value.

2.4 Deep Learning and Object Detection Zhou et al. [9]: This paper looks upon the effect of deep learning and its applications and impacts of deep learning datasets by using the faster or recent datasets. The mean average precision (MAP) for each class used in their example is high for classes of large size compared with the low MAP for small classes [10]. Galvez et al. [11]: This paper gives an introduction into convolutional neural networks which are used in object detection. The paper also discusses a robust technique through which previously trained models are utilized for feature extraction and tuning. The SSD with MobileNetV1 has mAP of 0.051, and Faster RCNN combined with Inception V2 has 0.440. Tang et al. [12]: This research firstly introduces various approaches of object detection, discusses the relationship and distinguishes various methodologies and the deep learning strategies used in object detection. Finally, this research work makes a precise analysis of the objections faced in object detection based on deep learning.

2.5 Android and Deep Learning Xu et al. [13]: The research depicts the distribution of deep Learning in Android application using various frameworks. The main goal and methodology include research

A Mobile-Based Framework for Detecting Objects …

69

goals, workflow overview and application analysis. There are three major characteristics of the DL apps, i.e., views of the application on smartphones, aspect of deep learning and analysis of frameworks. On the whole, this paper has started out the first experiment to study Android framework on understanding the utilization of deep learning and the connection between the analysis and practice [14]. Zou et al. [15]: The paper draws an evolution of object detection in traditional detectors like Viola–Jones detectors, HOG detector and deformable part-based model (DPM) to CNN-based two-stage detectors, e.g., RCNN, SPPNet, Fast RCNN, Faster RCNN, Feature Pyramid Networks and CNN-based one-stage detectors such as RetinaNet.

3 Algorithms Used for Object Detection 3.1 Object Detection Using Hog Features The histogram of oriented gradients could be a feature descriptor utilized in the domain of image processing for the aim of object identification. The technique estimates the instances that orient the gradient in localized parts in a picture. HOG converts a pixel-based representation into a gradient-based one and is often used with multi-scale pyramids and linear classification techniques. HOG features are easy to use, fast and can be customized.

3.2 Region-Based Convolutional Neural Network In RCNN, the image is first partitioned estimated into 2000 region recommendations, and then, CNN is implemented for each region, respectively. The size of the regions is decided, and therefore, the precise region is inserted into the artificial neural network. Time is the biggest obstacle to this method. Since every region within the image is applied CNN independently, training time is some eighty-three hours and forecast time is some fifty-one seconds.

3.3 YOLO (You Only Look Once) The bounding boxes and the class probabilities are predicted by a single CNN. An image is taken by YOLO, and it is divided into an N × N grid. Each cell in the grid predicts an m bounding boxes. A confidence score is shown by the method of how certain the bounding boxes surround the object of interest.

70

K. K. R. Sanjay Kumar et al.

Table 1 Various classes utilized for object detection, resolution and the number of images S. No.

Classes

Resolution (px)

No. of images

1

Human

640 × 480, 1920 × 1080, 2304 × 1296

5981

2

CPU

640 × 480

235

3

Monitor

640 × 480

235

4

Chair

640 × 480, 4608 × 3456

1360

5

Laptop

640 × 480

1600

3.4 Single-Shot Multibox Detection (SSD) SSD primarily uses VGG16 architecture. The SSD is basically comprised of two steps. The first one is to extract feature maps. The next step is to implement convolution filters. Each and every prediction comprises a boundary box. The highest score is considered to be the class for the bounded object.

4 Dataset Description It consists of 9400 images. The images were generated using our experimental setup which consists of 4 MP and 1.3 MP IP camera which has a powerful 20× Optical and 16× Digital zoom. The dataset consists of images with lower resolution to higher resolution. The images were captured at different lighting conditions, different orientations and occlusion [16]. It consists of five classes as in Table 1. It also consists of 22 videos taken by IP camera in the closed environment. It has videos with two different resolutions. 1. 640 × 480 px 2. 2304 × 1296 px.

5 Proposed Model Single-shot multibox detector (SSD) is one of the prominent algorithmic approaches used in the field of object identification. In MobileNetV2, there are two types of components. One is the residual component which has a stride value of 1. The another one is a block component with stride of 2 which is utilized for downsizing (Fig. 1). For both types of blocks, there are three layers. The first layer is based on a 1 × 1 convolution with ReLU6. The second layer is the depthwise convolution. The last layer is a 1 × 1 convolution without any nonlinearity. The first layer which is the key layer in the beginning of the neural network is the 1 × 1 convolution. The primary purpose of the first layer is to broaden the

A Mobile-Based Framework for Detecting Objects …

71

Fig. 1 Building block of MobileNetV2

quantity of channels in the images before it moves into depthwise convolution. The layer responsible for depthwise always has more channels which is responsible for output than those for the input. The value which is default for expansion is given by expansion factor that is 6. The depthwise layer, which unlike a regular convolution, does not combine the input channels. The depthwise layer performs separately on each channel. The depthwise convolution generates an output image having three channels. All the three channels have their own set of weights. The depthwise convolution is mainly used for the purpose to filter the input channels. Projection layer—it shows the data with a large number of dimensions in a tensor with a much less number of dimensions. This layer is called as bottleneck layer as it decreases the load of data which flows over the network (Fig. 2). The responsibilities of the layers which have its components in the MobileNet architecture are to make a change in pixels by using the inputs into characteristics that portray as the contents of the image and pass these along to the other layers.

72

K. K. R. Sanjay Kumar et al.

Fig. 2 SSD-MobileNet

MobileNets are considered as a great feature extracting architecture in the second neural network.

6 Results and Discussion Many of the recent researches on the study of object detection and extraction are developed on Android framework using image recognition and processing algorithms. SSD-MobileNet uses shallow layers of component to predict tiny objects and deeper layers of neural networks in order to predict large objects, as small objects do not require larger receptive fields and also for bigger receptive fields can be complicated for many small objects. By providing an image into the model, every prior images will get annotated using bounding boxes and with the corresponding class labels. As there is a need for a huge amount of images for the purpose of training, we have taken about 5000 images from various sources. For training data, we have considered all kinds of methods for training and decided to use IBM’s Cloud Annotations CLI (CACLI) as it eliminates the need for GPU and over-consumption of time. We stored the training images in cloud object storage. For annotating the images, we have used cloud annotation tool in order to specify the object’s position manually. We trained them using the above-mentioned command-line interface for 500 iterations. The output model after completion of the training are downloaded. We created an Android application which uses the system camera to detect object with the help of the model we trained previously (Fig. 3). It can be noted that objects which are small in size are not considered or detected when using these algorithms. Smaller targets have their detection efficiency slightly reduced such as the chair which is far away from the object of focus (Fig. 4). After training, some effect on the test set is shown in Fig. 5.

A Mobile-Based Framework for Detecting Objects …

73

Fig. 3 Visualization of our model’s performance through the training process

Fig. 4 Accuracy versus MACs for various architectures

The image shows detection of two distinct chairs present in this scenario. The model annotates both the chair accurately and labels them right. This sets an example that the model created is marked by exactness and accuracy of the object. This image depicts three objects where two of them are overlapped with each other. The model meticulously identifies both the two overlapped chairs and the separate one (Table 2).

74

K. K. R. Sanjay Kumar et al.

Fig. 5 Detection of chair

Table 2 Type of images that we have trained and obtained in the model

Type

Category

# of images

Positive

Chair

991

Positive

Rolling chair

724

Negative

Desk

341

Negative

Door

765

Negative

Phone

913

Negative

Computer mouse

472

Negative

Notebook

291

7 Conclusion In this paper, we created an Android application which uses a model which is the result of trained images by using cloud services. The application detects objects which were trained earlier. Objects got detected and annotated using our model, and it is displayed using our mobile Android application. The user interface has been designed to integrate with the TensorFlow model. Objects are effectively detected and extracted in any type of situation. The foreground and background objects are efficiently detected. Acknowledgements This proposed work is a part of the project supported by DST (DST/TWF Division/AFW for EM/C/2017/121) project titled “A framework for event modeling and detection for Smart Buildings using Vision Systems”.

A Mobile-Based Framework for Detecting Objects …

75

References 1. A. Wong, M.J. Shafiee, F. Li, B. Chwyl, Tiny SSD: A Tiny Single-Shot Detection Deep Convolutional Neural Network for Real-Time Embedded Object Detection (Department of Systems Design Engineering, University of Waterloo, DarwinAI, 2018) 2. S. Wu, L. Zhang, R-CNN for small object detection, in Asian Conference on Computer Vision (2016), pp. 214–230 3. X. Zhao, W. Li, Residual super-resolution single shot network for low-resolution object detection (2016) 4. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition (2016) 5. H. Wang, Y. Yu, A vehicle recognition algorithm based on deep transfer learning with a multiple feature subspace distribution (2018) 6. L. Chen, Z. Zhang, L. Peng, Fast single shot multibox detector and its application on vehicle counting system (2018) 7. Z. Wang, L. Du, SAR target detection based on SSD with data augmentation and transfer learning (2019) 8. G. Cao, X. Xie, W. Yang, Q. Liao, G. Shi, J. Wu, Feature-fused SSD: fast detection for small objects (2018) 9. X. Zhou, W. Gong, W. Fu, F. Du, Application of deep learning in object detection (2017) 10. K.S. Gautam, S.K. Thangavel, Video analytics-based intelligent surveillance system for smart buildings. Soft Comput. 2813–2837 (2019) 11. R.L. Galvez, A.A. Bandala, E.P. Dadios, Application of deep learning in object detection (2017) 12. C. Tang, Y. Feng, X. Yang, C. Zheng, Y. Zhou, The object detection based on deep learning (2017) 13. M. Xu, J. Liu, Y. Liu, F.X. Lin, Y. Liu, X. Liu, A first look at deep learning apps on smartphones (2019) 14. K.S. Gautam, S.K. Thangavel, Video analytics-based facial emotion recognition system for smart buildings. Int. J. Comput. Appl. 1–10 (2019) 15. Z. Zou, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years (2019) 16. S. Frizzi, R. Kaabi, M. Bouchouicha, J.-M. Ginoux, E. Moreau, F. Fnaiech, Convolutional neural network for video fire and smoke detection, in IECON 2016 42nd Annual Conference of the IEEE Industrial Electronics Society (IEEE, 2016), pp. 877–882

K. K. R. Sanjay Kumar pursuing B.Tech in Computer Science and Engineering in Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore. His areas of interest are cloud computing, image processing and deep learning.

76

K. K. R. Sanjay Kumar et al. Goutham Subramani pursuing B.Tech in Computer Science and Engineering in Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore. His areas of interest are image processing, cloud computing and deep learning.

Dr. Thangavel Senthil Kumar is an Associate Professor at the Department of Computer Science and Engineering at Amrita School of Engineering, Amrita Vishwa Vidyapeetham, and Coimbatore. He received a doctorate in Information Communication Engineering from Anna University, Chennai. His areas of interest include video surveillance, cloud computing, software engineering, video processing, wireless sensor networks, big data computing and embedded automation. He is guiding scholars with Amrita Vishwa Vidyapeetham in the area of video analytics and intrusion detection system. He is a reviewer for Elsevier Computers and Electrical Engineering Journal. He is associated with projects funded by Department of Science and Technology, IBM, Ministry of Tribal Affairs, IBM. Dr. Latha Parameswaran is currently the Professor and Chairperson of the Department of Computer Science and Engineering. She completed her Master’s Degree from PSG College of Technology and her Ph.D. from Bharathiar University. She was in the software industry for ten years prior to joining Amrita. Her areas of research include image processing, information retrieval, image mining, information security and theoretical computer science. In addition to teaching and serving as Chair of the CSE department, she is currently guiding Ph.D. research scholars in Amrita, Bharathiar and Anna Universities. She is in the doctoral committee for many Ph.D. scholars in various universities.

Privacy-Preserving Big Data Publication: (K, L) Anonymity J. Andrew and J. Karthikeyan

Abstract The explosion in variety and volume of information in the public domains provides an enormous opportunity for analysis and business purposes. Availability of private information is of explicit interest in sanctionative highly tailored services tuned to individual desires. Though this is highly favorable to the individuals, the conventional anonymization techniques still possess threats to the privacy of individuals through reidentification attacks. The focus of this paper is to propose a privacy-preserving approach called (K, L) Anonymity that combines k-anonymity and Laplace differential privacy techniques. This coherent model guarantees privacy from linkage attacks as the risk is mitigated through experimental results. The proposed model also addressed the shortcomings of other traditional privacy-preserving mechanisms and validated with publicly available datasets. Keywords k-anonymity · Privacy-preserving · Differential privacy · Data privacy · Data publication

1 Introduction In the present information-driven world, securing the privacy of people’s data is absolutely critical to information caretakers, both as a moral thought and as a legitimate necessity. The information gathered by governments and companies conceivably holds incredible insights for logical research and business purposes, yet to release its maximum capacity it must be conceivable to share information. Sharing or distributing information while at the same time safeguarding individuals’ privacy has J. Andrew (B) Department of Computer Science and Engineering, Karunya Institute of Technology and Sciences, Coimbatore, India e-mail: [email protected] School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India J. Karthikeyan School of Information Technology, Vellore Institute of Technology, Vellore, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_7

77

78

J. Andrew and J. Karthikeyan

pulled in the consideration of analysts in the statistics, computer science and other lawful fields for quite a long time. Data anonymization is one of the prevalent privacy-preserving mechanisms for data publication. Having anonymization as the base, various approaches are evolved to deal with data privacy. k-anonymity approach was first introduced by Sweeney [1]. This approach generalizes the individual’s data in the dataset and makes the record indistinguishable from at least k − 1 records. Differential privacy is another type of privacy-preserving mechanism which adds arbitrary noise to the records to prevent the actual data from the leak. Despite the fact that these systems have protected privacy to some extent, it has numerous drawbacks. A significant drawback in the k-anonymity approach is linkage attacks which lead to privacy leaks. Differential privacy suffers from poor data utility as the records are anonymized with random noise. In this paper, we propose (K, L) Anonymity a privacy-preserving framework that reaps the benefits of k-anonymity and Laplace differential privacy techniques. The proposed model benefits the stoicism indistinguishability from k-anonymity and stochastic indistinguishability from Laplace differential privacy. This model addressed the drawbacks of traditional privacy-preserving mechanisms such as multiple sensitive attributes, reidentification attacks and data utility. The k-anonymity technique is utilized to anonymize the few quasi-identifiers and to create clusters. Differential privacy mechanism is utilized to add noise to the clusters to further reduce the risk of privacy breach and to provide better utility of the data. The structure of the article is as follows. The related works carried out on privacypreserving techniques are presented in Sect. 2. Section 3 discusses the proposed algorithms. The experimental analysis and performance analysis are presented in Sect. 4. Finally, the conclusion and future scopes are given in Sect. 5.

2 Background 2.1 Related Works Sweeney first proposed the k-anonymity [2] model for tabular data publication. Since then different versions of k-anonymity techniques are proposed by various researchers. Some noteworthy techniques are l-diversity [3] and t-closeness [4] where k-anonymity generalizes the individual records and makes it indistinguishable from to at least k − 1 records, l-diversity and t-closeness brings in diversity among the sensitive attributes in the records to prevent the privacy risk. However, these techniques are prone to reidentification attacks and inference attacks. Sei et al. [5] proposed an anonymization method by combining l-diversity and tcloseness. This method doesn’t separate the quasi-identifiers and sensitive identifiers but considers it as sensitive quasi-identifiers. Anonymization algorithm based on generalization and reconstruction algorithm to reduce the errors in the reconstruction

Privacy-Preserving Big Data Publication: (K, L) Anonymity

79

of data for analytical purposes are used in this method. This method maintains the quality of data while preserving privacy. Wong et al. [6] proposed a (α − k) anonymity model to address the reidentification and sensitive relationship privacy attacks. This model ensures the k-anonymity property is satisfied along with the α-disassociation on the sensitive attributes. Global and local recoding based algorithms are proposed to optimize the data transformation in order to satisfy the (α − k) anonymity property. The proposed model also addressed the time complexity of the model. Attribute disclosure issue of k-anonymity model is addressed in [7]. The author has introduced a privacy model called p-sensitive k-anonymity. The model property is said to be satisfied if there are at least k − 1 distinct quasi-identifiers groups and each group contains at least p different sensitive attributes. This privacy model is not suitable for every dataset. This model expects the sensitive attributes to be uniformly distributed in the dataset. This drawback is addressed by p+ -sensitive k-anonymity model proposed in [8]. In this model, the k-anonymity property has to be satisfied and each set of quasi-identifiers should have at least p-sensitive attributes. The model uses top-down local recoding algorithm to transform the datasets. The model mainly used generalization and suppression techniques. Andrew et al. [9] proposed a Mondrian based multidimensional anonymization technique. The author has addressed the multidimensionality of the dataset where the traditional k-anonymity model failed to achieve better utility on such datasets. The Mondrian based techniques brought the dimensionality among the anonymization group thus the data utility is enhanced. (k, ε) anonymity model is proposed by Holohan et al. [10]. This privacy model comprises traditional k-anonymity and differential privacy techniques. In this model, the quasi-identifiers are divided into two groups based on their relationship to the sensitive attributes. The k-anonymity technique is applied to one group to anonymize the data and differential privacy technique is applied to other groups to reduce the relationship among the sensitive attributes. The combination of two techniques helps to control the information loss during anonymization. Soria-Comas et al. [11] also combined the k-anonymity and differential privacy techniques to address the data utility issue of differential privacy. In this model, all the attributes are considered as quasi-identifiers and they are anonymized using kanonymity technique then ε-differential privacy is applied on the anonymized dataset to further limit the privacy breach also to enhance the data utility. The microaggregated dataset which is the output of k-anonymity is given as the input for differential privacy to reduce the information loss [12]. δ-DOCA is a privacy-preserving approach based on differential privacy mechanism is proposed by Leal et al. [13] to protect the privacy of data streams. This approach takes microaggregated data as input. δ controls the sensitivity of the noise to be added to the anonymized data. Thus, it regulates the data privacy and data utility in data streams. The workflow of this approach is as follows: the unprecedented data streams are clustered online based on their attributes, the anonymization techniques like k-anonymity is applied to the clusters to generate microaggregate the data. Then

80

J. Andrew and J. Karthikeyan

based on the value of δ the Laplace noise is added and then privacy preserved data can be published for research and other purposes. Li et al. [14] proposed a random sampling technique to bridge the gap between kanonymity and differential privacy. Differential privacy techniques generally applied to outputs to protect the original result. This lacks data utility. So, the author proposed a random sampling method that is to be added at the beginning of the data perturbation. This provides an opportunity to remove the sensitive attributes that do not satisfy the k-anonymity property. The author also addressed the problem of (ε, δ)—differential privacy that required minimum δ value to provide privacy protection. This is solved through the random sampling method. (k, δ)-anonymity model [15] claims to preserve the privacy of trajectory data through k-anonymity and a positive value for δ. However, this model is criticized that the value for δ is inversely proportional to the trajectory k-anonymity [16]. K-aggregation [17] is a method based on k-anonymity and differential privacy techniques. This method focused on reducing the error rates which occurs while adding noise. It is done by dividing the attributes based on continuous and discrete, Laplace noise and exponential noise are added to the attributes, respectively. Thus, it produced a synthetic dataset with improved precision.

2.2 k-Anonymity k-anonymity is a popular anonymization technique to release tabular data. The records of the tabular can be classified into identifiers, sensitive attributes, quasi attributes, and non-sensitive attributes. Identifiers in tabular data should be removed before the release as it directly identifies the individual such as SSN, PAN, E-mail, etc. Nonsensitive attributes doesn’t need any protection because it is a common value for almost all the other records. The attributes required to be protected are sensitive and quasi attributes. Sensitive attributes such as disease details, salary information, etc., identifies an individual and reveals their private information. Quasi attributes are identifiers which are not directly identifying an individual but combining with other attributes will reveal the private information. The property of k-anonymity technique is to make the records indistinguishable from at least k − 1 records. This property is achieved through generalization and suppression techniques. In generalization, the numerical and categorical attributes are generalized between the specific range. The attribute values are masked partially or completely in suppression techniques. A table that satisfies the k-anonymity property is said to be anonymized data or microaggregated data.

Privacy-Preserving Big Data Publication: (K, L) Anonymity

81

2.3 Differential Privacy While k-anonymity is dependent on the dataset, Cynthia Dwork introduced differential privacy [18] mechanism which is independent of datasets. Differential privacy perturbs the dataset by adding noise. It is widely used in query processing system where the output of the query is independent of the datasets. The differential noise is added to the output of the query. This mechanism later used in privacy-preserving data mining (PPDM) and data collection. Differential privacy for privacy preservation adds either Laplace noise or Gaussian noise to the dataset. The Laplace noise for differential privacy is given below. Pr Aq,d ∈ r ≤ eε Pr Aq,d ∈ r where d, d represents the datasets before after and applying differential noise. r denotes the response. The Gaussian differential privacy equation is given as below. σ ≥

1.25 2 r f /ε 2 log δ

Gaussian differential privacy provides randomness among the data to protect privacy. The noise to be added to the data are limited by the parameter ε. Larger the value of ε higher the privacy and less data utility. So the ε value should be determined properly based on the privacy requirement.

3 Proposed Model In this section, the proposed anonymity model (K, L) Anonymity is explained. The algorithms and methods used in the model are presented. We also outlined the workflow of the proposed framework. The architecture of the proposed privacy-preserving framework is shown in Fig. 1. At first, the data source has to be determined where we can collect personal information from the user such as healthcare information, personal details, bank details

Fig. 1 Privacy-preserving (K, L) anonymity model

82

J. Andrew and J. Karthikeyan

etc. Such data can be used for analytic purposes so it is essential to protect the privacy of the data before producing it for analytic purposes. Preprocessing of the data should be done to identify the numerical, categorical, and text data. The proposed model is designed to work with numerical and categorical data. The next step is anonymization process where k-anonymity technique is used to generalize the table. The heuristic algorithm to generalize the tabular data is presented in Algorithm 1. The output of the generalized table is then fed as the input to the differential privacy model. Laplace noise is added to the generalized table to further limit the privacy breach. The heuristic generalization algorithm is used to generalize the quasi attributes in the dataset. First, the quasi attributes and sensitive attributes should be identified based on the impact of privacy. Generalization and suppression techniques are used to anonymize numerical and categorical data. Private user information table identified quasi attributes, the k value is provided as the input. For all distinct sequences of private tables, frequency of the QA is identified. Then based on the value of k the attributes are either generalized or suppressed. Finally, the generalized table is given as the output. Algorithm 1 Heuristic Input: PT—Private Table; QA—Quasi Attributes {A1 , A2 , A3 , …, An }, k, Hierarchies Output: Generalized Table PT[QA] w.r.t k Initialization: |PT|≥k 1. freq ← ∀ distinct sequence S of PT[QA], all QA in each sequence 2. while (∃ S in freq < k) do 2.1 let Aj freq most number of distinct values 2.2 freq ← Generalize Aj 3. freq ← suppress S in freq < k 4. freq ← enforce k requirement on suppressed tuples in freq 5. return Generalized table ← construct from freq Algorithm 2 shows the steps involved in the proposed (K, L) Anonymity model. The output of Algorithm 1 is given as the input, i.e. the k-anonymized table of equivalence classes. ε parameter is also provided as the input that determines the level of noise to added to the dataset. Then the Laplace noise and exponential noise are added to every equivalence class of generalized table. The distortion and precision are calculated at every step to check the accuracy and indistinguishability of the records. Finally, the (K, L) anonymized table is generated which is produced for analytics and research purposes. Algorithm 2 (K, L) Anonymity Input: Generalized Table GT, ε Output: (k, L) anonymized table

Privacy-Preserving Big Data Publication: (K, L) Anonymity

83

Initialization: eps = [0.5, 1] 1. for each [equivalence class] of GT do 1.1 Apply Laplace mechanism to each records of GT 1.2 for e in eps 1.2.1. calculate distortion to check indistinguishability 1.3 Apply Exponential mechanism to each records of GT 1.4 for e in eps 1.4.1. calculate precision 2. join GT and differential privacy records 3. return (K, L) anonymized table.

4 Experimental Analysis This section discusses the implementation, dataset, experimental results details. The evaluation of the proposed model (K, L) is also presented through the distortion, precision, and NCP.

4.1 Dataset The (K, L) anonymity model is experimented on the de facto standard Adult dataset [19]. The dataset contains 14 attributes and 48,842 instances. The dataset contains numerical and categorical attributes. Attributes like gender, nationality, etc., are transformed into categorical attributes. Table 1 shows the different types of attributes considered from the dataset. Table 1 Attribute classification Attributes

Age, workclass, weight, education, marital_status, occupation, relationship, race, sex, native, salary, name, and SSN

Sensitive attributes

Salary

Non-sensitive attributes

Relationship, workclass, weight

Quasi attributes

Age, education, marital_status, race, occupation

Identifiers

SSN, name

84

J. Andrew and J. Karthikeyan

4.2 Performance Metrics The distortion metrics are used to measure the amount of noise added to the dataset and its impact on privacy and utility. Distortion is generally calculated to eliminate the noise from the original dataset but in this paper, the distortion is calculated to regulate the noise to be added to the dataset. Distortion =

n 2 h=2 Vh V12

Precision is calculated from the results and the ground truth of the records from the original space. Precision is used to measure the distortion. Higher precision implies lower distortion. The distortion and precision values are calculated for different values of ε. NA N i=1

Prec(RT) = 1 −

h j=1 |DGHAI |

|PT||N A |

Normalized Certainty Penalty (NCP) [20] metric is used to calculate the information loss of the proposed mode. The NCP ANum (G) is to calculate the information loss of numerical attribute whereas NCP ACat (G) is to calculate the information loss of categorical attribute. NCP(G) shows the formula to calculate the loss from all the quasi attributes. The parameters G, A, u, d and w represents G-class, A-attribute domain, u-lowest common ancestor, d-total attributes, and w-weight. NCP ANum (G) = NCP ACat (G) = NCP(G) =

maxGANum − minGANum max ANum − min ANum 0, card(u) |ACat | ,

d

card(u) = 1 otherwise

wi · NCP Ai (G)

i=1

5 Results This section discusses the results obtained from the proposed (K, L) anonymity model. Figure 2 shows the precision and distortion calculation graph for the generalization process. The graph shows that for the precision increases for every increase

Privacy-Preserving Big Data Publication: (K, L) Anonymity

85

K VS PRECSION & DISTORTION

0.7 0.6 0.5 0.4 0.3 0.2 0.1

precision

0 0

20

40

60

80

DistorƟon 100

120

Fig. 2 Precision and distortion calculation of k-anonymity model

in k value and it is stable after k value 50. Similarly, distortion increases for k value and becomes stable after k = 50. Figures 3 and 4 show the Laplace noise and exponential noise added to the generalized table. First, the Laplace noise is added based on the ε values 0.5 and 1. It Laplace Mechanism 1.2 1 0.8 0.6

1

2

3

4

Epsilon 0.5

5

6

5

6

Epsilon 1

Fig. 3 Laplace noise (ε) versus indistinguishability Exponential Mechanism 1.2 1.1 1 0.9

1

2

3 Epsilon 0.5

Fig. 4 Exponential noise (ε) versus indistinguishability

4 Epsilon 1

86

J. Andrew and J. Karthikeyan

Table 2 Distortion and precision calculation for differential privacy Performance calculation

RMSE

Precision

ε = 0.5

ε =1

ε = 0.5

ε = 0.1

Laplace mechanism

1.420244724

1.369691936

0.942155642

0.719604531

Exponential mechanism

1.059399751

1.045548452

0.71175

0.71175

30

40

1

NCP

0.8 0.6 0.4 0.2 0 0

10

20

50

mulƟ-dimensional k-anonymizaƟon (k,km) anonymity

(k,L) anonymity

Fig. 5 Performance evaluation on NCP

is noticed that the records are indistinguishable but it lacks uniformity. Later, exponential noise is introduced to the dataset to regularize the noise and it found that the records are indistinguishable for both the values of ε. Table 2 contains the performance evaluation details of differential privacy mechanism module used in the proposed model. Root Mean Squared Error value is considered as distortion and precision values are calculated. Performance evaluation of the proposed model shown in Fig. 5. The model is compared with other state-of-art privacy-preserving techniques against the information loss metric NCP. The techniques compared are multidimensional anonymization [21], (k, k m ) anonymity [22] and proposed (K, L) anonymity model. It is observed that the proposed model performs better in terms of information loss. NCP value denotes the amount of information loss during anonymization. The proposed approach outperforms the other two approaches.

6 Conclusion Privacy has become a significant issue recently. Releasing personal information for analysis without removing personally identifiable information will lead to a serious privacy breach. In this paper, a novel (K, L) anonymity model is presented. This model is developed based on the traditional privacy-preserving approached such

Privacy-Preserving Big Data Publication: (K, L) Anonymity

87

as k-anonymity and differential privacy. The proposed model benefitted from the approaches to preserve privacy and to provide better data utility. Laplace mechanism and exponential mechanisms are utilized to add differential noise to the data. Experimental analysis shows that the proposed system provides better data utility compared to other approaches. Through differential privacy mechanism the proposed model is capable of resisting any reidentification or linkage attacks. It is believed this model will enable data publication without privacy breach and also with less information loss.

References 1. L. Sweeney, k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.Based Syst. 10(05), 557–570 (2002) 2. L. Sweeney, Achieving k-anonymity privacy protection using generalization and suppression. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 571–588 (2002) 3. A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, L-diversity: privacy beyond k-anonymity, in 22nd International Conference on Data Engineering (ICDE’06) (2006), p. 24 4. N. Li, T. Li, S. Venkatasubramanian, t-Closeness: privacy beyond k-anonymity and l-diversity, in Proceedings—International Conference on Data Engineering (2007), pp. 106–115 5. Y. Sei, H. Okumura, T. Takenouchi, A. Ohsuga, Anonymization of sensitive quasi-identifiers for l-diversity and t-closeness. IEEE Trans. Dependable Secur. Comput. 16(4), 580–593 (2019) 6. R.C.-W. Wong, J. Li, A.W.-C. Fu, K. Wang, (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing, in BT—Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 Aug 2006, pp. 754–759 7. T.M. Truta, B. Vinay, Privacy protection: P-sensitive k-anonymity property, in ICDEW 2006— Proceedings of the 22nd International Conference on Data Engineering Workshops (2006) 8. X. Sun, L. Sun, H. Wang, Extended k-anonymity models against sensitive attribute disclosure. Comput. Commun. 34(4), 526–535 (2011) 9. J. Andrew, J. Karthikeyan, J. Jebastin, Privacy preserving big data publication on cloud using Mondrian anonymization techniques and deep neural networks, in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS) (2019), pp. 722– 727 10. N. Holohan, S. Antonatos, S. Braghin, P. Mac Aonghusa, k-anonymity with epsilon-differential privacy 11. J. Soria-Comas, J. Domingo-Ferrer, D. Sánchez, S. Martínez, Enhancing data utility in differential privacy via microaggregation-based k-anonymity. VLDB J. 23(5), 771–794 (2014) 12. J. Soria-Comas, J. Domingo-Ferrer, Differentially private data sets based on microaggregation and record perturbation, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10571 LNAI (2017), pp. 119–131 13. B.C. Leal, I.C. Vidal, F.T. Brito, J.S. Nobre, J.C. Machado, δ-DOCA: achieving privacy in data streams, in Data Privacy Management, Cryptocurrencies and Blockchain Technology (2018), pp. 279–295 14. N. Li, W.H. Qardaji, D. Su, Provably private data anonymization: or, k-anonymity meets differential privacy. CoRR, abs/1101.2604, 49, 55 (2011) 15. O. Abul, F. Bonchi, M. Nanni, Never walk alone: uncertainty for anonymity in moving objects databases, in Proceedings—International Conference on Data Engineering (2008), pp. 376– 385

88

J. Andrew and J. Karthikeyan

16. R. Trujillo-Rasua, J. Domingo-Ferrer, On the privacy offered by (k, δ)-anonymity. Inf. Syst. 38(4), 491–494 (2013) 17. B.C. Tai, S.C. Li, Y. Huang, K-aggregation: improving accuracy for differential privacy synthetic dataset by utilizing k-anonymity algorithm, in Proceedings—International Conference on Advanced Information Networking and Applications, AINA (2017), pp. 772–779 18. C. Dwork, Differential privacy, in Proceedings of 33rd International Colloquium on Automata, Languages and Programming (2006), pp. 1–12 19. UCI Machine Learning Repository: Adult Data Set. [Online]. Available: https://archive.ics.uci. edu/ml/datasets/adult. Accessed 02 Mar 2019 20. G. Ghinita, P. Karras, P. Kalnis, N. Mamoulis, A framework for efficient data anonymization under privacy and accuracy constraints. ACM Trans. Database Syst. 34(2) (2009) 21. T. Takahashi, K. Sobataka, T. Takenouchi, Y. Toyoda, T. Mori, T. Kohro, Top-down itemset recoding for releasing private complex data, in 2013 11th Annual Conference on Privacy, Security and Trust, PST 2013 (2013), pp. 373–376 22. G. Poulis, G. Loukides, A. Gkoulalas-Divanis, S. Skiadopoulos, Anonymizing data with relational and transaction attributes, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8190 LNAI, no. PART 3 (2013), pp. 353–369

Comparative Analysis of the Efficacy of the EEG-Based Machine Learning Method for the Screening and Diagnosing of Alcohol Use Disorder (AUD) Susma Grace Varghese, Oshin R. Jacob, P. Subha Hency Jose, and R. Jegan

Abstract Identification of alcohol use disorder (AUD) in patients has always been strenuous owing to several prejudice involved in the diagnosis method. Unbiased and infallible methods would help to make the assessment of AUD much more reliable and efficient. In this paper, electroencephalography (EEG) derived features are analyzed and machine learning is performed on the data to automatically identify AUD patients. The EEG data were preprocessed and converted into frequency domain and further, the EEG standard bandwidths were extracted. The most discerning features were extracted using a feature selection technique that categorizes based on rank. A compact set of most discriminant features was extracted and used for identifying patients with AUD. This work could assist in automatically detecting alcoholism from EEG data in a reliable and objective manner. Two machine learning methods have been compared and it is observed that random forest classifier gives an accuracy of 64.5% and decision tree classifier gives an accuracy of 70.96%. Analyzing the results, it is concluded that using this method, Delta and Alpha power could be used as the objective parameters in brain–computer interface in identifying an AUD patient. Keywords Decision tree classifier · Alcohol use disorder · Brain signals · Alcohol abuse · Electroencephalography · EEG · Machine learning · Random forest algorithm

S. G. Varghese (B) Department of Biomedical Instrumentation, Karunya Institute of Technology and Sciences, Coimbatore, India e-mail: [email protected] P. S. H. Jose Department on Instrumentation Engineering, Karunya Institute of Technology and Sciences, Coimbatore, India e-mail: [email protected] R. Jegan Purdue University, West Lafayette, IN, USA e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_8

89

90

S. G. Varghese et al.

1 Introduction Alcoholism is a burning issue globally and it tops the lists of addiction in several countries. AUD is the condition of heavy alcohol consumption. As per the Institute of Alcohol Abuse and Alcoholism (IAAA), in the USA, approximately 17 million adults above 18 years suffers from AUD in the year 2012 [1] Alcohol consumption below 48 grams per day is deemed to be the safe limit [2]. Alcohol intake of more than the safe limit would eventually lead to three stages, namely Alcohol dependency (AD), alcohol use disorder (AUD), and alcohol abuse (AA). Alcohol abuse leads to alcohol dependency which is the severe form. The human body especially the brain suffers extensively due to alcohol abuse. Brain holds several systems that cumulatively support the overall functioning of the brain. Neurons act as the communication pathway between all these systems. Neurotransmitters act as the message carrier between these neurons. The overall cognition and the mood depend upon the neurotransmitters which have a communication speed arranged by the healthy brain. Alcohol causes delay in neurotransmitters which causes delay in responses which in turn tampers the efficiency of cognitive functions. Alcohol use disorder identification test (AUDIT) is the traditional questionnairebased method to screen AUD patients [3]. However, this method is highly biased as most of the AUD patients lack the ability to fairly answer and judge their alcohol intake which in turn could lead to incorrect assessment. Electroencephalography is the most common technique adopted to measure the electrical activity of the brain. EEG specifies the cataloguing of electrical activities of the brain and measures its functional state. EEG data has been used successfully by experts to diagnose several disorders and conditions of the brain. However, trying to analyze raw EEG to identify AUD is strenuous and unreliable. An automatic classification system to identify AUD patients will be a game changer in the regard. Literature reveals that resting state EEG (REEG) data shows drastic difference in neuronal activities among various regions of the brain in AUD patients and healthy subjects. AUD is closely associated to the EEG power bands such as Alpha, Beta, Theta, Delta, and Gamma. Medication and family history of alcohol abuse lead to higher Beta-band power [4]. On a study conducted on people with family history of alcohol abuse, Alpha voltage was seen to be very low in alcoholics [5]. Literature also shows that in the case of heavy drinkers left hemisphere shows low synchronization of Alpha and Beta waves [6]. Several machine learning (ML) techniques have displayed reliable results [7, 8] in venturing into clinical applications of AUD screening. Implementing this in the clinical setup requires more optimization and proof that EEG could be used as a robust parameter in AUD patient classification. The frequency ranges of instinctual activities in the brain spans from 0.5 to 30 Hz. It is divided as follows as per an international convention [9]. Table 1 shows the frequency range of different brain signals and its corresponding characteristics. This study aims to find out the most relevant difference in brain signals between AUD patients and healthy subjects and also to put forth and simpler ML-based method compared to the existing one in the literature. The standard EEG power bands are

Comparative Analysis of the Efficacy of the EEG-Based Machine … Table 1 Brain signals and corresponding characteristics

91

Brain wave

Frequency range (Hz)

State

Gamma

>30

Profound attention

Beta

12.5–30

Attention, anxiety

Alpha

7.5–12.5

Comfortable and awake with closed eyes

Theta

3.5–7.5

Sleep, dreaming

Delta

0.5–3.5

Profound sleep

used as the parameters. The proposed ML method involves preprocessing, feature extraction, dimensionality reduction, split of train and test data, and finally, the result is validated using confusion matrix.

2 Method 2.1 Dataset The dataset used here is from a study that deals with the connection of genetic susceptibility to alcoholism. A 64 Electrode EEG at 256 Hz is used to procure the data. A total of 122 Subjects is divided into alcoholics and control subjects. Every patient underwent 120 tests, both visual and verbal simulation with varying time duration of simulation. The data is from State University of New York’s Henri Begleiter Neurodynamics Laboratory, New York. The dataset is available on Kaggle and poses no usage constraint. A detailed description of the dataset is found in the literature [10]. A standard 10–20 system was used to locate the electrodes. The positioning and distance of the electrodes are detailed in the literature [11]. The subjects were exposed to several stimuli in order to gauge brain activity. The dataset uses Snodgrass and Vanderwart image sets. Either a single or two stimuli were shown to the participants. Subjects were shown images back to back and they have to register if the first image is identical to the second or not and the activity of the brain during these stimuli is recorded.

2.2 Noise Removal A major issue with EEG datasets is the presence of noise and artifacts. The scalp electrodes intended to grasp optimum brain signals also ends up fetching unwanted noise and noise from blinking of eyes, heartbeat, and muscle movement. Hence, it is imperative to clean the dataset before we proceed. Here, Bell and Sejnowski’s linear decomposition method is used to eradicate noise from EEG data. Independent

92

S. G. Varghese et al.

component analysis (ICA) lets us select independent components in a data. EEG data are liner and independent since the unwanted noise such as from blinking of eyes are not joined to the source. EEG data could also be established by numerical simulation and thus meets all the criteria for implementing ICA [12]. Hence, ICA is used to clean the data before preprocessing.

2.3 Preprocessing 2.3.1

Converting from Time to Frequency Domain

After cleaning the dataset using ICA, the time domain signals must be converted into frequency domain. Here, it is achieved using fast Fourier transformation. The brain signals are extremely complex and involve multiple frequencies bands and phases and a single signal in itself contains several frequencies. It is imperative to convert to frequency domain so that we can extract relevant frequencies. The Fourier transform of function f (x) is the function F(ω), ∞ f (x)e − iωxdx

F(ω) = −∞

FFT is a quicker algorithm for computing DFT as it minimizes the computations essential for N points from 2 N2 to 2 N Log N. Log is considered to be the base 2 logarithm.

2.3.2

Characterizing Frequency Bands

Power spectral density (PSD) is used to characterize the brain signals into desired frequency ranges. This aids in eliminating irrelevant frequencies and establishes the frequency range contained in a particular wave. PSD lets us to find out the presence of frequency bands in the brain signal characterized as Alpha, Beta, Theta, Delta, and Gamma. As per the formula, the magnitude squared of the cross-spectrum of two EEG electrodes is calculated and then divided by the product of the PSD of individual signal: 2 Cxy ( f ) = Sxy ( f )/Sx ( f ) S y ( f ) Here, f shows the frequency, S y is the power spectral density of y, S x is the power spectral density of x, and S xy denotes the cross-spectral density of two EEG electrodes.

Comparative Analysis of the Efficacy of the EEG-Based Machine …

93

The value thus obtained from the area under the curve is then averaged to get the EEG bandwidths for each subject for all 64 electrodes.

2.4 Feature Selection Feature selection is performed to boost the accuracy and to negate the possibility of overfitting. Here, we have used correlation and backward elimination to select the most relevant features to pass through the classifier. Features with high correlation is said to be linearly dependent and has the same effect on the dependent variable. This lets us eradicate one or more variable if it shows high correlation. The formula is: rxy = Cov(x, y)/SxSv where r is correlation coefficient and Cov is covariance Backward elimination is one of the significant data mining tasks. It takes into account the least relevant feature and eradicates it at each iteration and eventually gives only the features that are highly relevant. Iteration will be terminated based on a conditioned stopping behavior. The two highly relevant features that are identified by the model are accounted further as the parameter in screening AUD patients.

2.5 Classification Here, classification is carried out after selecting the most prominent features. We have used random forest and decision tree classifier. We have split the data around 20–80% between test and train stages. The two classifiers used here are essentially similar and we aim to identify the one that gives greater accuracy. A decision tree is based upon a whole dataset and uses every features and variable of importance. Random forest on the other hand, casually selects unique features to create several decision trees and later average the results obtained. The random forest process is defined in the literature [13]. Literature describes it to draw n-tree bootstrap trials for the original data. For each sample, an unshrunk regression or classification tree grows and at individual node, instead of selecting the best division out of all forecast, predictors are desultorily sampled and the best split from those variables is selected. Later, the predictions of the n-tree trees are combined in order to predict new data. Decision tree algorithm aims to develop a training setup which can be used to predict variables by learning from previously created decision rules from training data. The root of the decision tree holds the finest attribute. Upon splitting the data, it should be ensured that each subset holds data with similar attribute value.

94

S. G. Varghese et al.

These two classifiers were used on the selected features and the accuracy has been compared

2.6 Validation Confusion matrix is used to obtain an overview of the prediction results by the model. Here, the output would be two or more classes and it measures the performance for machine learning classification. This gives a table having four diverse combinations of actual and predicted values. The output gives us the accuracy value and also the combination value of true positive, true negative, false positive, and false negative. Confusion matrix values could be further used in the calculation of accuracy, precision, recall, and specificity. However, we only use the accuracy value for the screening of AUD patients.

3 Results Based on the selected features of EEG power band, random forest Classifier (RFC) and decision tree classifier (DTC) were trained on 97 participants and tested on 25 participants. The result post the classification is depicted in Tables 2 and 3. The accuracy score of confusion matrix for random forest classifier for our model is 64.51%. Figure 1 shows the graphical representation of test results from random forest classifier. Decision tree shows a higher accuracy of 70.96%. Figure 2 shows the graphical representation of test results from decision tree classifier. The frequency bands between 0.5–3.0 Hz and 3.5–7.5 Hz show the highest correlation in the model, i.e., Delta and Theta bands show highest correlation. Backward elimination extracted the bandwidths 7.5–12.5 and 0.5–3.5 Hz as the two most prominent features. Hence, Table 2 RFC confusion matrix

Table 3 DTC confusion matrix

Predicted value

Actual value Alcoholic

Nonalcoholic

Alcoholic

16

6

Non alcoholic

5

4

Predicted value

Actual value Alcoholic

Nonalcoholic

Alcoholic

16

6

Non alcoholic

3

6

Comparative Analysis of the Efficacy of the EEG-Based Machine …

95

Fig. 1 Test result of RFC

Fig. 2 Test result of DTC

Alpha and Delta waves were used by the model in the classification of train and test data.

4 Conclusion This work displays that machine learning methods could be reliably used in the screening and diagnosing of patients with AUD. This machine learning method reduces the human error and error in judgment by the patients themselves while answering AUDIT. The results have to be improved further by dividing the EEG frequency bands into smaller subdivisions. Furthermore, in the future, we will implement more classifiers and also perform k-fold and grid search to boost the accuracy.

96

S. G. Varghese et al.

If the feeblest of the frequency could be picked up with EEG electrodes, then it could be possible to ascertain the locations of the brain that are majorly affected and hence predict the damage in the long run. This stream of work could be implemented to other realms of EEG diagnosis. Dyslexia and pain analogy will be the two areas that would be focused on for further work based on this model.

References 1. Alcoholism NIAAA, Alcohol use disorder (2012). http://www.niaaa.nih.gov/alcohol-health/ overview-alcohol-consumption/alcoholuse-disorders 2. O.A. Parsons, S.J. Nixon, Cognitive functioning in sober social drinkers: A review of the research since 1986. J. Stud. Alcohol 59(2), 180–190 (1998) 3. S.A. Maisto, R. Saitz, Alcohol use disorders: Screening and diagnosis. Am. J. Addict. 12(s1), s12–s25 (2003) 4. P. Coutin-Churchman, R. Moreno, Y. An˜ez, F. Vergara, Clinical correlates of quantitative EEG alterations in alcoholic patients. Clin. Neurophysiol. 117, 740–751 (2006) 5. C.L. Ehlers, E. Phillips, M.A. Schuckit, EEG alpha variants and alpha power in Hispanic American and white non-Hispanic American young adults with a family history of alcohol dependence. Alcohol 33(2), 99–106 (2004) 6. E.A. de Bruin, C.J. Stam, S. Bijl, M.N. Verbaten, J.L. Kenemans, Moderate-to-heavy alcohol intake is associated with differences in synchronization of brain activity during rest and mental rehearsal. Int. J. Psychophysiol. 60(3), 304–314 (2006) 7. V. Bajaj, Y. Guo, A. Sengur, S. Siuly, O.F. Alcin, A hybrid method based on time–frequency images for classification of alcohol and control EEG signals. Neural Comput. Appl. 27, 1–7 (2016) 8. W. Mumtaz, P.L. Vuong, L. Xia, A.S. Malik, R.B.A. Rashid, Automatic diagnosis of alcohol use disorder using EEG features. Knowl. Based Syst. 105, 48–59 (2016) 9. M. Teplan, Fundamentals of EEG measurement. IEEE Meas. Sci. Rev. 2, 1–11 (2002) 10. X.L. Zhang, H. Begleiter, B. Porjesz, W. Wang, A. Litke, Event related potentials during object recognition tasks. Brain Res. Bull. 38(6), 531–538 (1995) 11. G. Deuschl, A. Eisen, Recommendations for the practice of clinical neurophysiology: Guidelines of the international federation of clinical neurophysiology (Elsevier, Amsterdam, 1999) 12. T. Nakada, Integrated human brain science: Theory, method, applications, Tech. Rep. INC9606, (Institute for Neural Computation, San Diego, CA, 1996) 13. A. Liaw, M. Wiener, Classification and regression by random forest. R News 2(2/3), 18–22 (2002)

Smart Solution for Waste Management: A Coherent Framework Based on IoT and Big Data Analytics E. Grace Mary Kanaga and Lidiya Rachel Jacob

Abstract It is expected that 70% of the world’s population, over six billion people, will live in cities and surrounding regions by 2050. The dearth of the efficient waste management system is one of the major problems of society; there is an ultimate need to address this problem. Waste management is important, mainly because scrapped waste can cause health, safety, economic, and environmental problems. The IoT-based companies and government are taking up new inventive steps for better waste management. In this paper, we propose an efficient framework for smart waste management (SWM) based on IoT using Big Data Analytics. The proposed framework involves several steps that start with the data generated from IoT-based smart dustbins, aggregation and cleaning of the data, and generation of an optimized schedule for waste collection based on the sensor data. The proposed system will be implemented using Hadoop framework and IBM InfoSphere Streams. In the Indian context, most researchers and industries are working with IoT-based smart trash bins, but this full-featured system does not exist. Since the huge volume of data will be generated from the sensors, handling these data with traditional techniques will not be efficient and Big Data Analytics is essential in this context. Hence, the proposed system will be more scalable and efficient. Keywords Smart waste management · Smart bins · IoT · Big Data Analytics · Machine learning · Optimized schedule

1 Introduction Smart city can be defined as a city with basic infrastructure to provide a clean and sustainable environment through smart solution applications. Margue et al. [1] have E. Grace Mary Kanaga (B) · L. R. Jacob Karunya Institute of Technology and Sciences, Karunya Nagar, Coimbatore, Tamil Nadu 641114, India e-mail: [email protected] L. R. Jacob e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_9

97

98

E. Grace Mary Kanaga and L. R. Jacob

proposed a multi-level IoT-based smart city architecture, and waste management is taken as a case study. They have dealt with managing both indoor and outdoor bins. Swachh Bharat Abhiyan’ is the India’s biggest cleanliness campaign that aims to accomplish the vision of ‘Clean India’ [2]. The budding smart cities in India are also giving prime importance to smart waste management as the need of today. Smart waste management systems using IoT as well as Big Data Analytics can be implemented in order to achieve the target of a clean and smart India [3].

1.1 Waste Management There are mainly two aspects of dealing with waste. The first one is waste minimization, which involves various steps for the reduction of waste that is produced. The second one implies the effective management of the waste that is produced. Here, the focus is on processing waste after it is created, rather than waste minimization. The authors, Esmaellian et al. [4], have done a review on future waste management in smart cities and also discussed the success factors for implementing smart waste management [5]. Specifically, the problem of electrical and electronic waste management is addressed by Gu et al. [6].

1.2 High Priority Regions Regions with high urban priorities can be defined as areas that are mostly affected by waste, especially when the collection is not done properly. This includes places with specific facilities such as schools, hospitals, fuel stations, and factories [7]. Trash boxes placed in such areas are called trash boxes with high priority. Hence, high importance is given to high priority regions during the collection and processing of waste, using the smart waste management system [8].

2 Motivation In Uttar Pradesh, 19180 TPD of waste is generated and tops in the list in generating waste where the total waste generated is 133,760 tons (TPD) per day in all states and union territories in India. Maharashtra State (17,000) and Tamil Nadu State (14,532) ranks next in waste generation. As per the survey posted by Chaitanya Mallapur, of the total waste generated, only 91,152 TPD was collected and only 25,884 TPD was processed. If this problem is not dealt efficiently with better policies and practices for waste management, the total waste generation will be 165 million tons by 2031, and by 2050, it is projected to be 436 million tons [2]. The main

Smart Solution for Waste Management …

99

issue in the disposal of garbage in a nonhygienic way leads to health problems and environmental degradation.

2.1 Challenges in Waste Management in India Apart from the above-mentioned ones, many more challenges are faced in India in all the phases of waste management [8, 9]. The phases of waste management are collection, segregation, transportation, processing and disposal. In India, the smart waste management activities mostly concentrate on waste disposal as well as waste treatment and processing, whereas lesser attention is given to the collection, segregation and transportation phases [10]. But addressing the below challenges at the initial phases itself would largely simplify the efforts and cost incurred in the later stages: • Collecting waste has been done using fixed routes and schedules that require a lot of manual planning. Containers are collected on a set schedule whether they are full or not. This causes unnecessary costs, poor equipment utilization, wear and tear on the roads, and excessive emissions. • Waste bins in high traffic areas get filled up more quickly than the other ones. • Sudden temperature changes in waste bins can cause hazards like fire. • The waste container can be moved out of the assigned area or tipped over. • The waste should not be collected too early so that resources are wasted but also not too late when unsightly over-filling can occur. • Proper segregation of waste is required for proper treatment of waste. As a solution to these problems, we propose a smart waste management system based on the Internet of Things using Big Data Analytics. This approach would address the waste management challenges in its collection, segregation, and transportation phase.

3 Existing Techniques Some of the existing waste management techniques across the world, which concentrates on smart collection and transportation, are summarized in Table 1. But, this system again is not well equipped with the smart collection, segregation, and transportation techniques. Our proposal for smart waste management system individually addresses the challenges in each of these phases.

100

E. Grace Mary Kanaga and L. R. Jacob

Table 1 Existing waste management techniques Existing smart solutions

Sensors used

Network connection

Intelligence

Enevo [11]

Ultrasonic sensors, acceleration sensor, temperature sensor

E-sim card with 3G connectivity (power: Li-ion battery)

Easy to use Web services, analytics and reports

U-dump M2M [12]

Ultrasonic sensors, temperature sensor

Sim card and cellular network

Cloud services

TS wasTe [13]

Ultrasonic sensor, temperature sensor

GPRS, Sigfox, WiFi or Web user ZigBee interface—device config, statistics records, alarms, and user management

Corio waste management [14]

Ultrasonic level sensors—by SmartBins

Wireless technology

OnePlus universal monitoring system

4 The Proposed Framework Based on the needs of smart waste management, we propose an efficient framework to analyze IoT Big Data to establish smart waste management. The basic functionalities required in such a framework are sensing, monitoring, control, storage, and backup. An efficient framework for analyzing the huge amount of data generated by the sensors in the IoT-based smart dust bins using Big Data Analytics is proposed in this work. The main intelligence in the proposed smart waste management system is to harness the power of Big Data Analytics to provide solution for the waste management and to generate the optimal schedule for waste collection. The stakeholders communicate with each other or with the central system through an easy to use Web interface, from where the reports and statistics of the waste collection can also be obtained. Predictions regarding the management can also be done via analytics. The prioritization of areas for waste collection adds to the efficiency in the implementation of the system. The system architecture is defined in Fig. 1. Each component of the architecture is described as follows.

4.1 Data Generation and Collection This is the first phase of the waste management system. It involves the waste bins in which initially the waste is deposited. From there, the waste collection trucks help in transporting the waste across to local waste collection points. The ultrasonic sensors are attached in the waste bins to sense the level of garbage in the bin, and it alerts

Smart Solution for Waste Management …

101

Fig. 1 Smart waste management proposed framework

the corresponding stake holders when the threshold is reached. It enables the smart trucks to collect the waste from such bins immediately. Collection trucks: Based on the bin level, an optimized schedule will be generated, and this real-time schedule will be communicated to the drivers in the collection truck. Hence, the waste collection truck follows an optimum path to collect the waste from the bin and transported for further process. Local collection points: This is where the waste collected from the local smart bins are stored for further processing. Any unsegregated or mixed waste that arrives at the collection point is further segregated.

4.2 Communicator A high-speed wireless transmission medium is used to transmit the data generated by the IoT to the central processing system. The mostly used communicators are WiFi, 3G, LTE, and WiMax.

102

E. Grace Mary Kanaga and L. R. Jacob

4.3 Data Management and Processing This is performed by the Big Data Analytics platform. Hadoop ecosystem may be employed for implementing the data management and processing system for smart waste management. It receives the data generated by the sources and processes them according to the rules and criteria predefined. It helps in visualizing the data and in real-time analysis and produces optimized schedules according to which the trucks should collect the waste from the bins. The schedule is generated, following the specified algorithm. It also keeps track the route and location of trucks as well as bins, and dynamically generates optimized routes, which are a combination of the shortest distance as well as real-time location information of the truck.

4.4 Data Interpretation The generated and analyzed data is interpreted to be used in the various stages of waste management for decision-making purposes, as well as for viewing and communication of the stakeholders involved, with the system. The stakeholders involved in smart waste management are [15] the city administrators, the waste truck drivers, sanitation specialist, managers of dumps and recycling factories, and the citizens. The entire smart waste management system assists the waste management procedures in its every phase. The smart trucks allow the dynamic routing and transportation of waste. The waste is sent for processing from the local collection points. The priority-based waste collection involves the immediate response to alerts related to waste bins located in high priority areas. The bins are assigned certain priority based on the regions in which they are located. We propose a model for the management of the bins based on their priority as well. Let bi denote the waste bin, where i ranges from 1, 2 … n, and n is the number of bins. T j represents each truck where the value of j ranges from 1, 2 … m, m being the number of trucks. b denotes the capacity threshold of the bins, which is measured in the form of levels. t denotes the maximum capacity load of the truck. Each truck is assigned priority levels from low to high. When a bin reaches its capacity threshold, an alert is generated for that bin. According to the alerts, a route r is generated for waste collection from bins, based on the priority of bins. If there are more than one bins with the same priority on the route, then routing is done which is based on the distance of the bin with the truck. This route, composed of the order in which the bins are emptied, is specified in the function routing (t j ) where t j denotes the truck. The algorithm is depicted in Fig. 2. The input of the algorithm is the set of smart bins (bi ) and trucks (t j ). The output is the route r for each truck devoted to the collection of bins. A route r is the sequence of bins that a particular truck visits.

Smart Solution for Waste Management …

Input: bi,tj, n,m Output: rtj If (cbi > θb) then ai alert (bi) Endif rtj routing(bi)

103

//bins, trucks, no. of bins, no. of trucks //Route of each truck composed of the order in which waste //bins are emptied //The capacity threshold θ occurs for the bin bi; it is almost full // An alert is generated by by the bin bi

// Routing : Get the route of the truck tj for the bin i whichgenerated alert //Store emptied bins for the truck tj vj visited (rtj) If (ak = true) and pre(priority (bk)= high) then // Scheduling: an alert ak is generated by the bin bk which is a high priority bin f nearest(bk, tj) //find the truck f nearest to the bin If cj< θt) then //Capacity threshold θt of the truck is not reached for truck f // Routing : truck rg routing (init (bk) , rf – vg ) f collects //waste from the (rf – vy) bins including the bin bk Endif Endif return r

Fig. 2 Scheduling algorithm

The algorithm employs a routing function routing() which utilizes to generate the initial route for visiting the alert generated bins. The Visited() function generates the bins that are visited by truck t j according to the route r, and is depicted by vj . When a new bin becomes full, during the collection process, the algorithm excludes the visited bins and performs rerouting process starting over for the remaining bins [depicted by (r − v)]. The algorithm also incorporates a nearest() function, which indicates the routing process and selects the truck with the nearest location to the location of the bin which generates an alert. An init() function for initialization has also been incorporated in the algorithm. This model minimizes the effort of each truck, accompanied by an on-time service for high priority bins. The algorithm for routing() is represented in Fig. 3. The input of the algorithm is each truck t j, and it outputs the route r for the truck. Combining the bins bi which generated the alert, a route is generated, depending on their priority. The function sort_priority() sorts the bins according to their distance with the truck.

5 Simulation Details The proposed framework has been simulated using IBM Bluemix cloud. Bluemix is a cloud platform as a service (PaaS), which is developed by IBM. It supports several

104

Routing () Input: tj Output: r

E. Grace Mary Kanaga and L. R. Jacob

// truck tj

//Route truck tj composed of the order in which waste bins are //emptied

For each bin bi where a alert (bi) //For each bin bi which generated alert // the bins are sorted according to their prirg sort_priority(bi) ority If (c(bins with same priority) > 1)) then //If more than one bin with same priority rg sort_disb(bi) // then those bins are sorted according to their distance //from the truck EndIf EndFor return r

Fig. 3 Priority-based routing algorithm

programming languages and services, using which we can build, deploy, manage, and run applications on the cloud. The services are employed to simulate the framework [19] are Watson IoT platform, Cloudant NoSQL DB, Availability Monitoring, Node-Red, and IBM SDK for node.js. The flow of the data in the application through these services is represented using Fig. 4. The smart bins have been simulated in Watson IoT platform. They were named BD and NBD for biodegradable and non-biodegradable bins, respectively, and were provided with unique identification numbers, level metrics, and were also provided with authentication facility for providing security. The bins were provided with levels for metrics which specify the filling level of the bin. The level value ranges from 0 to 3, which represents the immediate collection requirement of the bin from low to high.

Fig. 4 Simulation framework

Smart Solution for Waste Management …

105

The bins were also provided with priority specification, representing the priority of the region in which the bins belong to. The proposed work has been successfully simulated, and this real-time information from these devices were received and stored in the IBM cloud dashboard. The regions were specified for the application, within which the devices were connected to the dashboard. Various cards were added for usage, devices, and analytics. The device information was successfully simulated dynamically and visualized using real-time charts, bar charts, donut charts, and gauges. The work for further real-time analytics of the generated sensor information and implementation of the scheduling and routing algorithm is in progress. The proposed scheduling algorithm will be implemented using the data collected from the bins, and a dynamic routing will be generated. This will be implemented on a Hadoop framework to further increase the efficiency.

6 Conclusion and Future Scope The automation of systems will definitely solve the shortcomings involved in manual labor and provide an efficient and accurate system. The use of cutting-edge technologies also opens an arena for a larger scope of improvement and enhancements in the future as well. The smart waste management system also comes with certain drawbacks and challenges which are yet to be addressed. One such unaddressed issue includes the physical security of the bins and trucks and any damage to the truck. Another challenge is about how to tackle any unforeseen events and accidents on the road where the truck is being routed. These issues can be addressed by providing additional security features to the physical components, as well as by providing surveillance mechanisms for the same.

References 1. P. Marques, D. Manfroi, E. Deitos, J. Cegoni, R. Castilhos, J. Rochol, E. Pignaton, R. Kunst, An IoT-based smart cities infrastructure architecture applied to a waste management scenario. Ad Hoc Netw. 87, 200–208 (2019) 2. S.K. Ghosh, Swachhaa Bharat Mission (SBM)—a paradigm shift in waste management and cleanliness in India, in International Conference on Solid Waste Management, 5IconSWM 2015. Proc. Environ. Sci. 35, 15–27 (2016) 3. D. Vij, Urbanization and solid waste management in India: present practices and future challenges, in International Conference on Emerging Economies—Prospects and Challenges (ICEE-2012). Proc. Soc. Behav. Sci. 37, 437–447 (2012) 4. B. Esmaeilian, B. Wang, K. Lewis, F. Duarte, C. Ratti, S. Behdad, The future of waste management in smart and sustainable cities: a review and concept paper. Waste Manag. 81, 177–195 (2018) 5. A. Bashir, S.A.B. Ab, R. Khan, M. Shafi, Concept, design, and implementation of automatic waste management system. Int. J. Recent Innov. Trends Comput. Commun. 1, 604–609 (2013)

106

E. Grace Mary Kanaga and L. R. Jacob

6. F. Gu, B. Ma, J. Guo, P.A. Summers, P. Hall, Internet of Things and Big Data as potential solutions to the problems in waste electrical and electronic equipment management: an exploratory study. Waste Manag. 68, 434–448 (2017) 7. M.M. Rathorea, A. Ahmada, A. Paul, S. Rho, Urban planning and building smart cities based on the Internet of Things using Big Data Analytics. Comput. Netw. 101, 63–80 (2016) 8. A. Medvedev, P. Fedchenkov, A. Zaslavsky, T. Anagnostopoulos, S. Khoruzhnikov, Waste management as an IoT enabled service in smart cities, in International Conference on Internet of Things and Smart Space. https://doi.org/10.1007/978-3-319-23126-6_10 9. T. Anagnostopoulosa, K. Kolomvatsosc, C. Anagnostopoulosd, A. Zaslavskyb, S. Hadjiefthymiadesc, Assessing dynamic models for high priority waste collection in smart cities. J. Syst. Softw. 110, 178–192 (2015) 10. T.K. Ghatak, Municipal solid waste management in india: a few unaddressed issues, in International Conference on Solid Waste Management, 5IconSWM 2015. Proc. Environ. Sci. 35, 169–175 (2016) 11. http://postscapes.com/waste-management-sensor-company-enevo-collects-158m-in-funding/ 12. http://www.urbiotica.com/en/product/u-dump-m2m-2/ 13. http://www.tst-sistemas.es/en/products/tswaste/ 14. http://ecubelabs.com/blog/bin-level-sensors-5-reasons-why-every-city-should-track-theirwaste-bins-remotely/ 15. K.A. Monika, N. Rao, S.B. Prapulla, G. Shobha, Smart dustbin—an efficient garbage monitoring system. Int. J. Eng. Sci. Comput. 6, 1173–1176 (2016)

Early Detection of Diabetes from Daily Routine Activities: Predictive Modeling Based on Machine Learning Techniques R. Abilash and B. S. Charulatha

Abstract Diabetes is a condition that affects the capacity of the body to absorb blood glucose, otherwise referred to as blood sugar. Despite careful observation and management, diabetes could raise sugar level in the blood, which can increase the risk of serious complications, including heart-related disease and many more. The aim of this study is to develop a system which can predict whether a person has a chance of getting affected by diabetes from their daily routine activities using two machine learning algorithms, namely random forest and Naive Bayes classification and finally comparing the accuracy metrics of the prediction. Keywords Classification · Diabetes · Machine learning · Naïve Bayes · Prediction · Random forest

1 Introduction Most of the research on diabetes says improper food habits are one of the main source for getting diabetes; unfortunately, most of the people do not know their daily food habits and routine activities that leads to diabetes which in then become a root cause for most diseases. However, a good diet intake and exercise will lead to a healthy life. Also, the world is filled with different cultured food habitants spread across. Typically, a human being body’s behavior changes with the food behaviors, climatic conditions and physical movements. Considering the above factors, the research started with the aim to predict the possibility of occurring diabetes which in turn helps the person to change his/her daily routine activities and food habits. So the research is proof to real world that a predicting the diabetes is possible before it R. Abilash (B) Department of IT, Jawahar Engineering College, Chennai, India e-mail: [email protected] B. S. Charulatha Department of CSE, Rajalakshmi Engineering College, Chennai, India e-mail: [email protected]

© Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_10

107

108

R. Abilash and B. S. Charulatha

occurs. This will help to other research in early prediction of varieties of disease by training and predicting with similar kinds of datasets.

2 Literature Survey Globally, by 2030, it is projected that diabetes will be the 7th leading cause of death [1]. Recently, numerous algorithms are used to predict diabetes, including the traditional machine learning method (Kavakiotis et al. [2]), such as support vector machine (SVM), decision tree (DT), logistic regression and so on. Polat and Günes [3] distinguished diabetes from normal people by using C4.5 and neuro fuzzy inference [3]. Yue et al. [4] used quantum particle swarm optimization (QPSO) algorithm and weighted least squares support vector machine (WLS-SVM) to predict type 2 diabetes [4] Duygu and Esin [5] proposed a system to predict diabetes, called LDA-MWSVM. In this system, the authors used linear discriminant analysis (LDA) to reduce the dimensions and extract the features [5]. In order to deal with the high dimensional datasets, Zou et al. used principal component analysis (PCA) and minimum redundancy maximum relevance (mRMR) to reduce the dimensionality [6]. Georga et al. [7] focused on the glucose, and used support vector regression (SVR) to predict diabetes, which is as a multivariate regression problem [7]. Moreover, more and more studies used SVM method to improve the accuracy (Kavakiotis et al. [2]). Ozcift and Gulten [8] proposed a newly ensemble approach, namely rotation forest, which combines 30 machine learning methods [9]. Han et al. [9] proposed a machine learning method, which changed the SVM prediction rules, which transforms the basics of SVM decisions into comprehensible and transparent rules, used to solve imbalance problem. Machine learning methods are widely used in predicting diabetes, and they get preferable results. Decision tree is one of popular machine learning methods in medical field, which has grateful classification power. Random forest generates many decision trees. Neural network is a recently popular machine learning method, which has a better performance in many aspects. So in this study, we used decision tree, random forest (RF) and Naïve Bayes to predict the diabetes.

3 Model Creation 3.1 Dataset Description For this experiment, the authors collected dataset from the southern part of India, where people are diverse in nature and food habits. The 725 samples collected were based on observations made by people of different age groups, different genders, diabetic patients and non-diabetic patients. The detailed description of the dataset is

Early Detection of Diabetes from Daily Routine Activities … Table 1 Characteristics of collected samples

109

Data set characteristics

Multivariate

Number of instances

725

Attribute characteristics

Characters, integer

Number of attributes

29

Associated tasks

Classification

Missing values?

Yes

Data collection started date

04.01.2019

Data collection ended date

26.03.2019

Area

Medical

published in https://archive.ics.uci.edu/ml/datasets.php. Observations are made from 29 attributes which are grouped under the following categories personal details, food habits, common food habits, daily routine activities, sleep, work culture, clinical questions about health, family medical history (Table 1).

3.2 Pre-processing Multivariate data samples collected have missing values and observed datasets are for classification purposes. The missing values were filled out using the statistics imputation method. Mean replacement is pursued to fill the missing values. The missing values are replaced by the mean value of the corresponding variable in this process.

3.3 Feature Selection The selection for the subset of features is based on information theory with univariate selection method chi-squared. The goal should be to disable a feature if it provides little or no additional information beyond that subsumed by the other features. The χ 2 algorithm is based on the χ 2 statistic, and consists of two phases. In the first phase, it begins with a high significance level. In the second phase, merge the pair of adjacent intervals with the lowest χ 2 value. Merging continues until all pairs of intervals have χ 2 values exceeding the parameter determined by sigLevel. Empirical result on the data is shown below. Among 28 features, 7 features are removed from the original datasets due to very low score after algorithm implementation (Table 2).

110

R. Abilash and B. S. Charulatha

Table 2 Feature selection using univariate selection method chi-squared Column Feature name

Score

29*

Do you have diabetes?

39.67563

24

Are you peeing more often?

39.12735

1

Age

31.96918

20

Do you have high blood pressure?

21.43962

26

Do you have blurred vision?

18.02559

27

Do you experience tingling or numbness in your feet, hands and fingers?

9.267698

10

Do you smoke?

9.006207

23

Do you feel extreme hungry?

6.110004

9

Will you drink soft drinks?

4.704717

22

Are you reducing weight?

3.211272

3

Are you a vegetarian?

1.537422

13

Will you do exercise?

1.416841

7

Do you eat fast food?

1.225441

21

Do you have thyroid?

1.180661

11

Do you consume alcohol?

1.136515

8

Will you drink fruit juice?

0.978106

15

Sleeping time

0.698379

12

Will you get up early in the morning?

0.686154

14

Will you do household work other than cooking?

0.563781

25

Does your wound healing slowly?

0.544991

16

Late night sleeping habit

0.402666

Note

* Refers

output classification

3.4 Construction of Training and Testing Datasets The dataset which collected from diabetic and non-diabetic patients comprised of 29 attributes and 725 instances. The dataset are randomly divided into a training dataset and testing dataset consisting of 400 and 325, respectively. Among 29 attributes, 29th attribute is considered as output (either diabetic or non-diabetic).

3.5 Random Forest Random forest (RF) is a compilation of decision trees for classification and regression based on the bagging theory. Due to their state-of-the-art reliability, they are very popular when handling complex data while avoiding over-fitting due to their existence of bootstrapping. Essentially, the RF predictions are the average prediction of the combined CARTs used to train the system. The most important parameters

Early Detection of Diabetes from Daily Routine Activities …

111

for designing an RF model are the number of independent trees to grow and the number of randomly sampled characteristics used at each tree decision node. For the analysis, 12 separate trees with a Gini index are used for tree construction.

3.6 Gaussian Naïve Bayes The Gaussian Naïve Bayes algorithm is a special type of Naïve Bayes (NB) algorithm. It is used when there are continuous values for the features. It is also assumed that all the features follow the Gaussian distribution, i.e., the normal distribution.

4 Experimental Result 4.1 Model Implementation This section summarizes the results obtained using the Gaussian Naïve Bayes and random forest algorithm on the diabetes dataset. It was observed that on the given dataset random forest performs better than Gaussian Naïve Bayes. For all the necessary model building use for this paper, the number of features taken into considerations after feature selection is 21 among 28 features. Total of 400 instances was considered for training and 325 instances are considered for testing.

4.2 Predicting the Test Data After building the model, 325 instances are considered for testing. Prediction from the 325 instances from both RF model and Gaussian Naïve Bayes is obtained. Below are the results (Table 3). Table 3 Confusion matrix of both GNB and RF Confusion matrix

Gaussian Naïve Bayes

Random forest

array([[ 94, 22 ], [ 69, 140 ]])

array([[ 113, 3 ], [ 4, 205 ]])

112

R. Abilash and B. S. Charulatha

4.3 Evaluation Classification report is generated using the below code which includes evaluation metrics such as precision, recall and f 1 -score (Figs. 1 and 2; Tables 4 and 5).

Fig. 1 Comparison of accuracies of two depicted by a bar graph

Fig. 2 ROC curve indicating that TPR is greater algorithms for random forest

Early Detection of Diabetes from Daily Routine Activities …

113

Table 4 Metric matrix for Naïve Bayes 0.0 1.0

Precision

Recall

f 1 -score

Support

0.58

0.81

0.67

116

0.86

0.67

0.75

209

Microavg

0.72

0.72

0.72

325

Microavg

0.72

0.74

0.71

325

Weighted avg

0.76

0.72

0.73

325

Precision

Recall

f 1 -score

Support

0.0

0.97

0.97

0.97

116

1.0

0.99

0.98

0.98

209

Microavg

0.98

0.98

0.98

325

Microavg

0.98

0.98

0.98

325

Weighted avg

0.98

0.98

0.98

325

Table 5 Metric matrix for random forest

4.4 Performance Comparison of Random Forest and Naive Bayes Algorithm The aim of the predictive algorithms is to create a model which has high performance for predicting the unseen data. The model must be reliable so that the performance can be a reliable estimate. It is evident from the accuracy graph and ROC curve graph that random forest outperforms Naïve Bayes classifier in small dataset. It is well evident from the ROC curve that the true positive rate is greater for random forest which indicates the performance wise superiority than Naïve Bayes algorithm.

5 Conclusion and Future Enhancement In this paper, the performance of two different classifiers, namely the random forest and Naive Bayes, is studied and evaluated using different performance metrics for the data set collected. Similar to these predictions identifying the possibility of diseases like cancer, food-related diseases, thyroid without any autopsy in any organ of the body at an early stage. For further accuracy, the algorithms can be suggested with larger data set and with different algorithms which can be supplemented with clinical results.

114

R. Abilash and B. S. Charulatha

References 1. S.K. Sen, S. Dash, Application of meta learning algorithms for the prediction of diabetes disease. Int. J. Adv. Res. Comput. Sci. Manage. Stud. 2(12), 1–6 (2014) 2. I. Kavakiotis, O. Tsave, A. S, N. Maglaveras, I. Vlahavas, I. Chouvarda, Machine learning and data mining methods in diabetes research computational and structural. Biotechnol J 15, 104–116 (2017) 3. K. Polat S. Güne¸s, A hybrid approach to medical decision support systems: combining feature selection, fuzzy weighted pre-processing and AIRS Computer Methods and Programs. Biomedicine 88(2), 164–74 (2007) (Epub Sept. 19) 4. C. Yue, L. Xin, X. Kewen, S. Chang, An intelligent diagnosis to type 2 diabetes based on QPSO algorithm and WLS-SVM, in IITAW’08 International Symposium on Intelligent Information Technology Application Workshops (2008), pp. 117–121 5. D. Çali¸sir, E. Do˘gantekin, An automatic diabetes diagnosis system based on LDA-wavelet support vector machine classifier. Expert Syst. Appl. 38(7), 8311–8315 (2011) 6. Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, H. Tang, Predicting diabetes mellitus with machine learning techniques. Front. Gen. 9 (2018). https://doi.org/10.3389/fgene.2018.00515 7. E.I. Georga, V.C. Protopappas, D. Ardigò, D. Polyzos, D.I. Fotiadis, A glucose model based on support vector regression for the prediction of hypoglycemic events under free-living conditions. Diabet. Technol. Therapeut. 34–43 (Epub 2013 July 13). https://doi.org/10.1089/dia.2012.0285 8. A. Ozcift, A. Gulten, Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. J. Comput. Methods Programs Biomed. Arch. 104(3), 443–451 (2011) 9. L. Han, S. Luo, J. Yu, L. Pan, S. Chen, Rule extraction from support vector machines using ensemble learning approach: an application for diagnosis of diabetes. IEEE J. Biomed. Health Inf. 19(2) (2015)

Classification of Gender from Face Images and Voice S. Poornima, N. Sripriya, S. Preethi, and Saanjana Harish

Abstract Automated acknowledgement of a human gender is pivotal for various frameworks like data recovery, human–machine collaboration that processes human source data. Automatic gender classification has become relevant to an increasing amount of applications, particularly since the rise of social platforms and social media. Gender identification has been a widely researched area, but face attribute recognition from facial images still remains challengeable. We propose a methodology for automatic gender classification based on feature extraction from facial images. Our methodology includes three main iterations: preprocessing, feature extraction, and classification. Texture features, geometric moments, and histogram values of facial images are used to train the system. The eminent and efficient features are selected and trained using suitable machine learning technique classifiers, namely SVM, AdaBoost and random forest. The developed model is used to identify the gender of an individual with an accuracy of approximately 95% and can be deployed in various scenarios like hospital registration process, sports selection process, and airports to identify the gender of a person. Keywords Face · Gender · Feature extraction · Local binary pattern · Moments · Classification · SVM · AdaBoost · Rainforest

1 Introduction Gender classification has gained popularity because of its varied applications. In surveillance systems, the prior identification of gender can substantially reduce the search space required to find a specific person among many in a video feed [1]. This

S. Poornima (B) · N. Sripriya · S. Preethi · S. Harish Department of Information Technology, SSN College of Engineering, Chennai, India e-mail: [email protected] N. Sripriya e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_11

115

116

S. Poornima et al.

automation could be carried out with different biometrics. Soft and hard biometrics are face, fingerprint, hand, palm, periocular, lips, iris, gait features, etc. Of those, face and voice are the two most common and effective ones which have been used to develop the proposed system with texture and geometric difference of primary features in male and female. The knowledge of one’s gender can provide natural human–computer interaction. Due to erroneous reports, various levels of fraud have been reported in fields like sports or in places like hospitals or airports. Thus, automating gender identification is essential in current industry.

2 Related Works Various methods have been proposed for classifying gender in several controlled or uncontrolled datasets. The gender of an individual can be identified from facial images or by considering only periocular region [2, 3] refers to the facial region in the vicinity of the eye, including eyelids, lashes, and eyebrow or even by considering iris portion [4] and lips [5]. Researchers have gradually narrowed the region of interest of the face to explore the feasibility of using smaller, eye-centered regions for building a robust gender classifier around the periocular region alone. Gender differences are even found back of the hand (BH), forearm (FR), and palm (PL) of the upper limb [6] and muscles involved in smile [7]. In order to enhance the ability to estimate or classify these attributes from face images, many methods have been put forward in the past years. Individual’s gender can be predicted even by considering voice sample. Many methods were suggested to investigate the spatiotemporal patterns of the gait sequence [1, 8] to infer the gender of the associated subject. In order to isolate our region of interest (ROIs), we first normalize the given image to achieve accurate results. Filters are applied to remove the unwanted noise. After the region refinement, each crop is resized to the same resolution for each respective region. All images were resized using bilinear interpolation with no anti-aliasing, other pre-filtering, or dithering. The intent of applying these methods for scaling [9] is in the computational simplicity and proof of performance, even when using low-quality images. For feature extraction, normalized pixel grayscale intensities of the image are used as a baseline feature in almost all research papers. The features considered in research papers are histograms of oriented gradients (HOG), local binary patterns (LBP), and a discrete cosine transform (DCT) of the LBP features, geometric features [4] such as distX, distY, distCenter, iris Area, pupil Area, diffArea, area Ratio, texture features [6] such as BGM, BSIF, CRBM, force field transform, Gabor filters, GIST. Various classifiers used to identify gender of an individual are linear discriminant analysis (LDA) with a rank and k-nearest neighbor (NN) classifier, principal component analysis (PCA) with a rank-1 NN classifier, support vector machines [3, 10, 11], Bayes rule [8], artificial neural network [2], random forest [12], AdaBoost [13], etc.

Classification of Gender from Face Images and Voice

117

The paper organization is as follows: Section 3 describes the overall flow of the proposed system. Section 4 discusses the results obtained from different classifiers with various feature set and concluded in Sect. 5 with future work plan.

3 Proposed System The proposed system is modeled to determine the gender of an individual using biometric features. The system uses real-time dataset containing face and voice samples. The collected data was preprocessed to extract the facial region and converted to grayscale for equalization of RGB values. The identified face parts are cropped out of the original image and stored in a separate directory. Data is preprocessed to extract the required features such as texture features, geometric features, local binary pattern (LBP), and moment descriptors. Preprocessed data is divided into training datasets and testing datasets of different ratios, namely 80:20 or 70:30 and is given to different classifiers like support vector machine, AdaBoost, and random forest as shown in Fig. 1. Performance analysis of the classifiers is measured using accuracy, recall, f -measure, and precision.

3.1 Dataset Dataset consists of approximately 250 real-time audio and facial images of different individuals in different resolution 5 MP, 13 MP, from 100 male and female each. The dataset collection includes both zoomed-in and zoomed-out images taken at various distances and voice recorded from 2.5 to 4 s at predefined environment. Samples are shown in Fig. 2.

Dataset Collection (Face and Voice)

Preprocessing (Noise removal and Face Detection)

Feature Extraction (Texture, LBP, Moments, Pitch)

Gender Classification Fig. 1 Flowchart of gender detection system

118

S. Poornima et al.

Fig. 2 Dataset samples of face images, female and male audio

3.2 Preprocessing The preprocessing stage involves two main stages, namely noise removal using median filter [14] and face detection using Voila–Jones algorithm [15]. Preprocessed outputs are shown in Fig. 3. Pseudocode for Preprocessing Input: Original image Output: Preprocessed image For i←1 to number of images do 1. Apply Voila–Jones algorithm to detect face 2. Apply Haar cascade to the image 2.1 Perform integral image calculation 2.2 Apply AdaBoost algorithm 2.3 Perform cascading on the image 3. Crop and save the detected face from the image End

Fig. 3 Preprocessed sample outputs—face detection

Classification of Gender from Face Images and Voice

119

3.3 Feature Extraction Different and essential features have been extracted from preprocessed face images and voice samples. Features such as texture features (mean, variance) [6], GLCM features (entropy, energy, ASM, contrast, dissimilarity, homogeneity, correlation), histogram values of LBP, and moment descriptors (area, center of mass, object features, projection skewness) [16] are extracted from preprocessed face region and pitch from audio sample in order to identify the essential attributes in classification of gender. These features are obtained as decimal values and are stored in a csv file for training and testing. Pseudocode for Feature Extraction Input: The cropped face images Output: Integers which represent the facial features For i←1 to number of faces do Create an empty list Extract the texture features such as contrast, dissimilarity, etc., for image Append the texture features to a list Compute the LBP values for the image Compute histogram for the image using the LBP features Append the histogram values to the list Calculate raw moments of orders zero, one, two, and three. Append the raw moments to the list Write the list values to a csv file End

3.4 Classification The features extracted are stored in a csv file and given as input to various classifiers. The various classifiers, namely linear support vector machine (SVM), AdaBoost, random forest, and results, are compared with combination of extracted features. SVM is shown to be superior to traditional pattern classifiers (linear, quadratic, Fisher linear discriminant, nearest neighbor) as well as more modern techniques such as radial basis function (RBF) classifiers and large ensemble-RBF networks. Random forest works better with ensemble technique, when there are many redundant features discrimination. AdaBoost classifier works in an iterative process for tweaking of parameters to make the classification strong, but sensitive to noisy data.

120

S. Poornima et al.

4 Results and Performance Measures Standard performance measures such as accuracy, precision, recall, F-score, and false-positive rate are used to analyze the proposed system results obtained with train set and test set. Features extracted from face and voice such as texture, LBP histogram, moments, and pitch are given to each classifier for training and the test results are compared to find the better suitable gender classification attributes and classifier among SVM, AdaBoost, and random forest (Tables 1, 2, 3, and 4). When the real-time dataset is trained with image and voice features, considerable improvement in accuracy and reduction of error rate is identified in SVM classifier. The same process is repeated for AdaBoost and random forest classifiers. SVM obtained an accuracy of around 80%, above 85% in AdaBoost and above 95% in random forest classification. Overall, random forest proves to be better classifier for gender identification with these image and voice features. Also, the error rate is very less in random forest compared to SVM and AdaBoost. When AdaBoost Table 1 Performance measures obtained with SVM classifier Texture + LBP Moments Pitch Texture + LBP + Texture + LBP + Moments Moments + Pitch Accuracy (%)

75

62.5

70

75

Precision Male

0.71

0.62

0.65

0.71

0.80

Female 1.00

0.00

0.74

1.00

0.83

Error rate Recall f 1 -score

81.25

0.25

0.375

0.30

0.25

0.18

1.00

1.00

0.65

1.00

0.89

Female 0.33

0.33

0.74

0.33

0.71

Male

0.83

0.77

0.65

0.83

0.84

Female 0.50

0.00

0.74

0.50

0.77

Male

Table 2 Performance measures obtained with AdaBoost classifier Texture + LBP

Moments

Pitch

Texture + LBP + Moments

Texture + LBP + Moments + Pitch

79.16

95.83

62.5

75

87.5

Male

0.87

0.94

0.73

0.92

Female

0.67

1.00

0.44

0.58

0.86

0.208

0.041

0.375

0.25

0.125

Male

0.81

1.00

0.69

0.69

0.94

Female

0.75

0.88

0.50

0.88

0.75

Male

0.84

0.97

0.71

0.79

0.91

Female

0.71

0.93

0.47

0.70

0.80

Accuracy (%) Precision Error rate Recall f 1 -score

0.88

Classification of Gender from Face Images and Voice

121

Table 3 Performance measures obtained with Random forest classifier Texture + LBP Moments Pitch Accuracy (%)

95.83

93.75

0.94

0.91

0.75

0.94

0.95

Female

1.00

1.00

1.00

1.00

1.00

0.041

0.0625

0.18

0.04

0.03

Male

1.00

1.00

1.00

1.00

1.00

Female

0.88

0.83

0.57

0.88

0.92

Male

0.97

0.95

0.86

0.96

0.98

Female

0.93

0.91

0.73

0.95

0.96

Precision Male Error rate Recall f 1 -score

Texture + LBP + Texture + LBP + Moments Moments + Pitch

81.25 95.83

96.87

Table 4 Overall performance comparison in accuracy SVM

AdaBoost

Random forest

Texture features

75

75

95

Moments

62.5

95.83

93.75

Pitch

70

62.5

81.25

Texture + LBP

75

79.16

95.83

Texture + LBP + Moments

75

75

95.83

Texture + LBP + Moments + Pitch

81.25

87.5

96.87

is involved in classification, there is a huge difference in accuracy with moment descriptor features reducing the error rate. There is no consistency in accuracy when varying the train set. SVM and AdaBoost are better with pitch features compared to random forest. The comparison of different classifiers is represented using bar graph in Fig. 4. The ROC curve for SVM, AdaBoost, and random forest algorithm is represented in Fig. 5.

5 Conclusion The proposed system identifies the gender of an individual using face and voice characteristics. Real-time dataset is created containing face images of male and female, captured in different resolution between 5MP and 13MP camera at varying distance. Audio of male and female is recorded from 2.5 to 4 s in predefined environment. The system removes unwanted parts and noise from the input image and detects the face region using Voila–Jones algorithm. The different features such as texture from GLCM, moments, and LBP histogram are taken into extracted from preprocessed face and pitch from voice.

122

S. Poornima et al.

120 100 80 60 40

SVM

20

AdaBoost

0

Random forest

Fig. 4 Graph representing comparison of accuracy of different classifiers

Fig. 5 ROC curve for SVM, AdaBoost, and random forest

Classification of Gender from Face Images and Voice

123

The system is trained with these extracted features in different combinations using different classifiers such as SVM, AdaBoost, and random forest. The system response for the test set with all classifiers is compared and analyzed with every features involved. It is noted that random forest provides highest performance with an accuracy of approximately 95%. Furthermore, dataset size could be extended with images captured at different angles and having different emotions. For voice, other acoustic factors like mean frequency, mindom, maxdom, and skew could be considered. The major focus of extension is to identify the third category of gender. Similar features could be used to train a set of transgender images and voice, so that the system works for all sorts of data in gender identification. Large datasets could be analyzed computationally in next phase of work, to reveal patterns of gender identification using machine learning algorithms and automatic feature representations and extractions using deep learning networks.

References 1. S. Yu, T. Tan, K. Huang, K. Jia, X. Wu, A study on gait-based gender classification. IEEE Trans. Image Process. 18(8), 1905–1910 (2009) 2. J.R. Lyle, P.E. Miller, S.J. Pundlik, D.L. Woodard, Soft biometric classification using local appearance periocular region features. Pattern Recogn. 45(11), 3877–3885 (2012) 3. F. Alonso-Fernandez, J. Bigun, A survey on periocular biometrics research. Pattern Recogn. Lett. 82, 92–105 (2016) 4. V. Thomas, N.V. Chawla, K.W. Bowyer, P.J. Flynn, Learning to predict gender from iris images, in First IEEE International Conference on Biometrics: Theory, Applications, and Systems (IEEE, 2007), pp. 1–5 5. D. Stewart, A. Pass, J. Zhang, Gender classification via lips: static and dynamic features. IET Biomet. 2(1), 28–34 (2013) 6. F. Bianconi, F. Smeraldi, M. Abdollahyan, P. Xiao, On the use of skin texture features for gender recognition: an experimental evaluation, in Sixth International Conference on Image Processing Theory, Tools and Applications (IPTA) (IEEE, 2015), pp. 1–6 7. H. Ugail, Secrets of a smile? Your gender and perhaps your biometric identity. Biomet. Technol. Today 6, 5–7 (2018) 8. E.R. Isaac, S. Elias, S. Rajagopalan, K.S. Easwarakumar, Multiview gait-based gender classification through pose-based voting. Pattern Recogn. Lett. 126, 41–50 (2019) 9. J. Merkow, B. Jou, M. Savvides, An exploration of gender identification using only the periocular region, in Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS) (IEEE, 2010), pp. 1–5 10. Z. Stawska, P. Milczarski, Gender recognition methods useful in mobile authentication applications. Inf. Syst. Manage. 5(2), 248–259 (2016) 11. B. Moghaddam, M.H. Yang, Gender classification with support vector machines. U.S. Patent No. 6,990,217. U.S. Patent and Trademark Office, Washington, DC, 2006 12. E. Kremic, A. Subasi, Performance of random forest and SVM in face recognition. Int. Arab J. Inf. Technol. 13(2), 287–293 (2016) 13. S. Buchala, N. Davey, R.J. Frank, T.M. Gale, M. Loomes, W. Kanargard, Gender classification of face images: the role of global and feature-based information, in International Conference on Neural Information Processing (Springer, Berlin, Heidelberg, 2004), pp. 763–768 14. P.P. Acharjya, S. Mukherjee, D. Ghoshal, Digital image segmentation using median filtering and morphological approach. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4(1) (2014)

124

S. Poornima et al.

15. D. Yadav, S. Shashwat, B. Hazela, Gender classification using face image and voice. IOSR J. (IOSR J. Comput. Eng.) 17(5), 20–25 (2015) 16. W. Chen, P. Lee, L. Hsieh, Gender classification of face with moment descriptors, in The Eighth International Multi-Conference on Computing in the Global Information Technology, ICCGI (2013)

An Outlier Detection Approach on Credit Card Fraud Detection Using Machine Learning: A Comparative Analysis on Supervised and Unsupervised Learning P. Caroline Cynthia and S. Thomas George Abstract Credit card fraud is a socially relevant problem that majorly faces a lot of ethical issues and poses a great threat to businesses all around the world. In order to detect fraudulent transactions made by the wrongdoer, machine learning algorithms are applied. The purpose of this paper is to identify the best-suited algorithm which accurately finds out fraud or outliers using supervised and unsupervised machine learning algorithms. The challenge lies in identifying and understanding them accurately. In this paper, an outlier detection approach is put forward to resolve this issue using supervised and unsupervised machine learning algorithms. The effectiveness of four different algorithms, namely local outlier factor, isolation forest, support vector machine, and logistic regression, is measured by obtaining scores of evaluation metrics such as accuracy, precision, recall score, F 1 -score, support, and confusion matrix along with three different averages such as micro, macro, and weighted averages. The implementation of local outlier factor provides an accuracy of 99.7 and isolation forest provides an accuracy of 99.6 under supervised learning. Similary in unsupervised learning, implementation of support vector machine provides an accuracy of 97.2 and logistic regression provides an accuracy of 99.8. Based on the experimental analysis, both the algorithms used in unsupervised machine learning acquire a high accuracy. An overall good, as well as a balanced performance, is achieved in the evaluation metrics scores of unsupervised learning. Hence, it is concluded that the implementation of unsupervised machine learning algorithms is relatively more suitable for practical applications of fraud and spam identification. Keywords Outlier detection · Accuracy · Precision · Recall · F 1 score · Support

P. Caroline Cynthia Electronics and Communication Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu 641114, India S. Thomas George (B) Biomedical Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu 641114, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_12

125

126

P. Caroline Cynthia and S. Thomas George

1 Introduction A culture of a cashless society has been growing at a very faster pace. Usage of bank cards (credit or debit cards) has become one of the most preferred options for people to pay their bills and manage their finances using their laptops or smartphones. People all around the world are used to perform most of their daily tasks online such as booking cabs, shopping on e-commerce Web sites, and even recharging phone [4]. The Nilson Report, one of the most respected sources for news and analysis of the global car mobile payment industry, estimates that for every $100 spent, seven cents are stolen. It estimates the total credit card fraud market to be around $30 billion. [2]. Detecting whether a transaction is fraudulent or not using an efficient algorithm is a very impactful data science problem [3]. In this work, a comparative analysis of supervised and unsupervised machine learning is performed in a real-time credit card transaction dataset of more than 2 lakh transactions done by European cardholders in September 2013. In the first half of the work, unsupervised learning algorithms, namely local outlier factor and isolation forest are deployed after exploring the data using clustering. Histograms and correlation matrix are plotted in this technique. It provides clarity on the distribution of outliers and the correlation existing within the parameters of the dataset. In the second part of the work, supervised learning algorithms, namely support vector machine (SVM) and logistic regression algorithms are implemented. In this technique, a certain amount of data has been trained, tested, and cross-validated. In both the machine learning techniques, a set of evaluation metrics, namely precision, recall, F 1 -score, and support, are found, and finally, three different average scores micro, macro, and weighted average were generated to understand the performance of various methods.

2 Related Work The challenge associated with problems like fraud and spam detection is that the fraudulent pattern of data keeps varying at different instants of time and mostly fraud transactions are observed at a very sparse rate. Thus, there is a necessity of identifying fraudulent data lucidly using effective methods. In the paper, they have proposed a method to identify outliers by means of receiver operating characteristic curve (ROC curve) and precision–recall curves. It is reported that precision–recall curves are more suitable since precision gives a comparison between false positives and true positives (outliers) [1]. This reduces the problem of imbalance in the class. Credit card fraud occurs in several ways. It is stated that based on the kind of fraud faced by banks or credit card companies, various machine learning techniques can be performed to resolve it innovatively [2, 3]. A hidden Markov model was trained and applied for a set of credit card transactions to detect the fraudulent transaction. The

An Outlier Detection Approach on Credit Card …

127

effectiveness of this method was compared with few other commonly used methods. It was revealed that the accuracy of the system was up to 80% in HMM, and it had the capability to handle large data at a good pace [5]. But among all the existing techniques, it was observed that a class of ensemble methods yields best results to practical problems due its straightforward implementation. It was demonstrated that bagging classifier based on decision tree was excellent and applying the bagging ensemble algorithm is more advantageous in real-time fraud detection datasets [5, 6].

3 Materials and Methods 3.1 Description of the Work The work has been carried out in two different techniques. In the first part, an unsupervised machine learning technique where clustering of data is done by plotting histograms and correlation matrix after which outlier detection algorithms are deployed to the credit card dataset (Fig. 1). In the second part, the supervised machine learning technique is used. The data is explored and reduced by scaling features, plotting bar graphs for fraud data, and finally, algorithms such as support vector machine and logistic regression are deployed (Fig. 2). In the initial part by importing the necessary libraries and packages, the data is loaded and explored. Principal component analysis (PCA) is applied to 28 parameters out of the 31 parameters in the dataset in order to safeguard the user’s identities and sensitive features. Hence, the 28 transformed features, namely from V1 to V28 contain numerical values. 10% of the entire data is considered to save time and computational requirements. The shape of the data is printed in terms of count, mean, standard deviation, minimum, and maximum deviation. The histograms are plotted for every feature for which the x-axis denotes the attributes and the y-axis denotes the number of fraudulent data. It is inferred that when histograms are aligned with zero, then there are no outliers. When it is spread along the two sides of zero, it has many outliers. Thus, the number of fraud and valid cases in the dataset is determined. The outlier fraction is found using Eq. 1 given as Outlier fraction =

len (Fraud) float (len (Valid)

(1)

Therefore, the number of frauds and valid data along with the outlier fraction are found and displayed at the same time. Next, a correlation matrix is drafted which depicts correlation coefficients between variables which enables us to summarize the data crisply. This serves as an input for the advanced analysis of data which will enable to further reduce the data into an ordered, corrected, and simplified form. The new shape of the data is printed after which it is filtered. Now the data has been completely

128

P. Caroline Cynthia and S. Thomas George

processed to deploy machine learning techniques, namely local outlier factor (LOF) and isolation forest (IF) and run the classification metrics to obtain the scores of the evaluation metrics, namely accuracy, precision, recall, F 1 -score, and support. In the second half of the work, supervised machine learning techniques are implemented. Under supervised learning technique, training and testing of data in order to create a model are implemented. Initially, libraries such as NumPy and Pandas are imported. The data is explored and reduced differently in both the algorithms, support vector machine and logistic regression. In support vector machine, standard scalar is imported from the sklearn package which is used to standardize the features. It can be mathematically represented as follows: Z=

(x − u) s

(2)

where Z is the standardized score, x is the sample, u is mean of training samples, and s is the standard deviation of the training samples. After this, ‘Time’ and ‘Amount’ features are removed, and a new column is included as the ‘Net Amount’ which contains the new standardized values. In logistic regression, statistical measures such as count, mean, standard deviation, minimum, and maximum deviation are obtained. A fraud class histogram is plotted where x label is ‘Class’ and y label is ‘Frequency’. In both the algorithms, thirty percent of the entire class 1 (fraudulent transaction) and class 0 (valid transaction) data is selected in order to train the dataset. Subsequently, a training dataset is built and is fitted as a model, and then both the algorithms are deployed. It is further evaluated using grid search cross-validation, K-fold crossvalidation, and train_test_split to ensure the effectiveness of the model. Finally, a classification report consisting of accuracy, recall, F 1 -score, and support a confusion matrix is obtained.

3.1.1

Flow Diagram for Unsupervised Machine Learning Algorithm

3.1.2

Flow Diagram for Supervised Machine Learning Algorithms

3.2 Data Collection The dataset was obtained from a data science and AI workbench public platform that has been chosen for a research collaboration of Worldline and a Machine Learning Group on big data mining and fraud detection. The credit card transactions in the dataset were made by the European cardholders in the year 2013. The total number of transactions in the dataset is 2,84,807 out of which 0.172% accounts for fraud. There are a total of 31 features, namely V1, V2, V3 till V28 along with Time and Class feature. Features V1 to V28 are the principal components obtained with principal component analysis (PCA), and the only features which have not been transformed by PCA are ‘Time’ and ‘Amount’. The principal components are input parameters

An Outlier Detection Approach on Credit Card …

129

Fig. 1 Unsupervised learning methodology

Fig. 2 Supervised learning methodology

which are numerical values, and due to confidentiality issues the original features and essential information about the dataset were not revealed. ‘Time’ feature consists of the seconds elapsed between each transaction and the first transaction in the dataset. ‘Amount’ is the money which is transacted by the cardholder, and this feature is used for the purpose of example-dependent cost-sensitive learning. ‘Class’ is the response variable as it takes value 1 to be fraud and value 0 to be valid.

3.3 Classifiers Local Outlier Factor (LOF) The role this classifier is to figure out the fraudulent transaction (outliers) by means of calculating the ‘local deviation density’ of a

130

P. Caroline Cynthia and S. Thomas George

sample with respect to the neighboring samples local deviation density refers to the identification of datapoints which show abnormal deviation. The fraudulent data is dislocated from the pattern of the data. Isolation Forest Algorithm (IF) The role of this classifier is to find out fraudulent transactions by means of isolating the samples by choosing a feature in random, and then a split value is generated between the maximum and minimum values of the selected feature. Thus based on how far the data is isolated, the data is identified as fraudulent. Support Vector Machine (SVM) This classifier formally separates the fraudulent and valid data by means of a hyperplane which creates a binary classification, namely fraudulent and valid samples in the dataset. Thus based on the categorization of data, we can identify fraudulent transactions. Logistic Regression (LR) This classifier uses the concept of probability and decision tree to yield the output. It is also called a predictive analysis algorithm. It makes use of logistic functions to classify binary samples (fraudulent and valid samples)

4 Performance Metrics Precision It is a score that defines the ability of the classifiers to exactly reveal or label the valid datapoint as 0 and the fraudulent datapoint as 1. It is defined by the ratio of the total number of true positives (fraud samples) by the sum of true and false positives (fraud and valid samples) Precision =

tp tp + fp

(4)

where tp is the number of true positives (fraud samples) and fp depicts the number of false positives(valid samples). Recall It is score that measures the capability of the classifiers to identify all the true positives (fraud samples). The valid data is labeled as 0, and the fraud data is labeled as 1. It is defined by the ratio of the total number of true positives (fraud samples) by the sum of true positives (fraud) and false negatives (fraud samples indicated as valid samples). Recall =

tp tp + fn

(5)

where tp is the number of true positives (fraud samples) and fn the number of false negatives (fraud sample misrepresented as valid samples).

An Outlier Detection Approach on Credit Card …

131

F1 -Score It provides the weighted average of precision and recalls metric scores. It could be said that the classifier exhibits good performance when precision, recall, and F 1 score lie in the same range. The valid data is labeled as 0, and the fraud data is labeled as 1. The formula for the F 1 score is as follows: F1 - Score =

2 ∗ (precision ∗ recall) precision + recall

(6)

Support It is a score that defines the ability of the classifier to detect the total number of occurrences of each fraud and valid class in the dataset. Micro Average This score does a calculation of metrics globally by summing up all the true positives (fraud samples) and false positives (valid samples indicated as fraud samples) Macro Average This score calculates the metrics for each class (fraud and valid samples), and their respective unweighted mean is found. Weighted Average This score calculates the metrics for each class and subsequently finds the average weight of each class using support metric (the number of true instances for both fraud and valid class). Confusion Matrix It is one of the metrics which is used to measure the effectiveness of a classification algorithm. There are four values obtained in a confusion matrix, namely true positives, true negatives, false positive, and false negative through which the result is studied. Accuracy It is defined as the ratio of correct prediction of fraud made by the classifier to the total number of input data samples in order to figure out the effectiveness of the algorithm implemented in identifying the fraud transaction. Accuracy =

tp + tn tp + fp + tn + fn

(7)

where tp is the number of true positives (fraud samples), tn is the number of true negatives (non-fraud samples), fp is the number of false positives (non-fraud samples misrepresented as fraud samples) and fn is the number of false negatives (fraud sample misrepresented as non-fraud samples).

5 Results of Unsupervised Machine Learning Histograms are plotted for the 31 features of the dataset to understand the essential and non-essential features by observing the distribution of outliers (Fig. 3). X-axis refers to the attributes of the features, and the Y-axis refers to the number of frauds.

132

P. Caroline Cynthia and S. Thomas George

Fig. 3 Histograms of features

There are three different results obtained from the histograms, namely the outlier fraction, the number of fraud and number of valid cases outlier fraction which is 0.0017234, the number of fraud cases is 49, and the number of valid transactions is 28,432. A correlation matrix is plotted to identify the relationship present between the features (Fig. 4). The intensity of the color determines the positive and negative correlation. Positive correlations are displayed in the lighter color, and negative correlations are displayed in dark colors. The essential data obtained from the columns of the dataset is 28,481 and from a target (Class feature) is 30. A classification report for local outlier factor and isolation forest algorithms is generated. Here it is observed that both the algorithms have obtained balanced results since the values generated for precision, recall, and F 1 -score lie in almost the same or nearest range for valid, fraud, and all the three averages. The number of true and fraud occurrences is returned correctly in the support metric (Tables 1 and 2). The local outlier factor algorithm has attained an accuracy of 99.659422. The isolation forest algorithm has attained an accuracy of 99.750711.

6 Results of Supervised Machine Learning A classification report for support vector machine and logistic regression is generated (Tables 3 and 4). Here it is observed that the support vector machine algorithm has obtained highly imbalanced results since there is a large difference observed in the precision, recall, and F 1 -score metrics. The support metric has returned abnormally big and small values for number of fraud and valid occurrences. The support vector machine has attained an accuracy of 97.297291. The overall recall score obtained is 0.93089430.

An Outlier Detection Approach on Credit Card …

133

Fig. 4 Correlation matrix Table 1 Metrics score for local outlier factor Fraud/Valid

Precision

Recall

F 1 -score

Support

0

1.00

1.00

1.00

28,432

1

0.02

0.02

0.02

49

Micro avg.

1.00

1.00

1.00

28,481

Macro avg.

0.51

0.51

0.51

28,481

Weighted avg.

1.00

1.00

1.00

28,481

Table 2 Metrics score for isolation forest Fraud/Valid

Precision

Recall

F 1 -score

Support

0

1.00

1.00

1.00

28,432

1

0.28

0.29

0.28

49

Micro avg.

1.00

1.00

1.00

28,481

Macro avg.

0.64

0.64

0.64

28,481

Weighted avg.

1.00

1.00

1.00

28,481

134

P. Caroline Cynthia and S. Thomas George

Table 3 Metrics score for support vector machine Fraud/Valid

Precision

Recall

F 1 -score

Support

0

1.00

0.85

0.92

284,069

1

0.01

0.94

0.01

246

Micro avg.

0.85

0.85

0.85

284,315

Macro avg.

0.50

0.90

0.46

284,315

Weighted avg.

1.00

0.85

0.92

284,315

Table 4 Metrics for logistic regression Fraud/Valid

Precision

Recall

F 1 -score

Support

0

1.00

1.00

1.00

85,321

1

0.54

0.67

0.60

122

Micro avg.

1.00

1.00

1.00

85,443

Macro avg.

0.77

0.84

0.80

85,443

Weighted avg.

1.00

1.00

1.00

85,443

The logistic regression has attained an accuracy of 99.8712592. The confusion matrix has been obtained for the support vector machine and logistic regression algorithms. It has generated highly unstable values with respect to the true values. Thus, it can be concluded that the performance of these supervised classifiers is not very good for this credit card fraud detection dataset. The confusion matrix is generated for support vector machine is as follows:

241234 42835 17 229

The confusion matrix is generated for logistic regression as follows:

85251 70 40 82

7 Discussion This whole process has been carried out in the Python environment. The codes were launched and executed in Jupyter notebook from the Anaconda Navigator application (graphical user interface) which is a web-based interactive computing notebook IDE. All the libraries and packages such as NumPy and SciPy are imported from the sklearn

An Outlier Detection Approach on Credit Card …

135

package. There are a total of 492 frauds in a total of 2,84,807 scores which results in 0.712% of fraud cases. This creates an imbalance in the data due to less number of fraud transactions than the valid transactions.

8 Conclusion and Future Scope The goal of this paper is to identify the best-suited algorithm which accurately finds out a fraudulent transaction using supervised and unsupervised machine learning algorithms The effectiveness of four different algorithms, namely local outlier factor, isolation forest, support vector machine, and logistic regression, is measured by obtaining scores of evaluation metrics such as accuracy, precision, recall score, F 1 score, support, and confusion matrix. The accuracy result obtained for the Local outlier factor is 99.7, isolation forest is 99.6, support vector machine is 97.2, and logistic regression is 99.8. In the final analysis of this work, it is been established that the implementation of unsupervised machine learning algorithms is more suitable for this fraud detection since the accuracy of both the algorithms is high and there is a perfect balance observed among the results of the evaluation metrics. On the other hand, the supervised machine learning algorithms applied to yield a lesser accuracy when compared to unsupervised machine learning algorithms. There also exists a high imbalance in the results of the evaluation metrics. Thus, it can be concluded that the unsupervised machine learning technique provides an overall excellent efficiency in detecting the credit card fraud with the highest accuracy for this real-time dataset. These algorithms can be dumped in an ATM, and the image of the person can be captured. If the machine senses abnormal patterns in the transaction, then a warning call can be sent to the nearest police station or an intimation could be sent to that particular bank who had issued the credit card to the actual or real cardholder.

References 1. U. Porwal, S. Mukund, Credit card fraud detection in e-commerce: an outlier detection approach (2018). arXiv preprint arXiv:1811.02196 2. L. Delamaire, H. A. H. Abdou, J. Pointon, Credit card fraud and detection techniques: a review. Banks Bank Syst. 4(2), 57–68 (2009) 3. Y. Kou, C.-T. Lu, S. Sirwongwattana, Y.-P. Huang, Survey of fraud detection techniques, in IEEE International Conference on Networking, Sensing and Control, 2004 vol. 2 (IEEE, 2004), pp. 749–754 4. K. Chaudhary, J. Yadav, B. Mallick, A review of fraud detection techniques: credit card. Int. J. Comput. Appl. 45(1), 39–44 (2012) 5. M. Zareapoor, P. Shamsolmoali, Application of credit card fraud detection: based on bagging ensemble classifier. Procedia Comput. Sci. 48(2015), 679–685 (2015) 6. T.G. Dietterich, Ensemble methods in machine learning, in International Workshop on Multiple Classifier Systems (Springer, Berlin, Heidelberg, 2000), pp. 1–15

Unmasking File-Based Cryptojacking T. P. Khiruparaj, V. Abishek Madhu, and Ponsy R. K. Sathia Bhama

Abstract In the ever-shifting world of future technologies, there is a prolific increase in computing resources of end devices. Simultaneously, advancements in technology have resulted in attackers exploiting various vulnerabilities in the devices. One such attack which had commenced in a large scale is surreptitiously carrying out the process of cryptomining by targeting users without their consent. This process is called as cryptojacking. To overcome this problem, the proposed system deals with providing security features for detecting and controlling the act of file-based cryptojacking malwares. The proposed model analyses the presence of cryptojacking process in the victim’s system by two ways—one by analyzing the network traffic for miningrelated header information and the other by detecting anomalies in the CPU usage. The model detects cryptojacking using CPU-based outlier detection implemented using a mathematical model with an error rate of 2% and analysis of network packets which proves to be faster and meticulous than contemporary existing machine learning algorithms. Keywords Cryptojacking · Mining · Malware · Outlier detection

1 Introduction Cryptojacking is the process of utilizing someone’s computing resources in an unauthorized manner for the purpose of mining cryptocurrencies. This activity of surreptitious cryptomining process has been on the rise from the last decade. With the popularity of Ransomware attacks skyrocketing in the past years and many high-level T. P. Khiruparaj · V. Abishek Madhu · P. R. K. Sathia Bhama (B) Madras Institute of Technology—Anna University, Chennai, Tamil Nadu 600044, India e-mail: [email protected] T. P. Khiruparaj e-mail: [email protected] V. Abishek Madhu e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_13

137

138

T. P. Khiruparaj et al.

organizations getting hit by the attack, hackers who had explored several ways of eliciting ransom from targeted victims finally appeared to have come to a conclusion. However, Ransomware attacks involved some form of interaction from the victim side in order to successfully achieve the goal. This is when the cryptojacking model of attack started to spread among the community of such attackers. A simple attack model with little or no interaction from the user side for the attack to be successful made cryptojacking a household name among the attacker’s community as it had the same motive as that of Ransomware attacks. According to Kaspersky, around 13 million incidents of cryptojacking were recorded in the year 2018 alone. Also, Ransomware attacks were found to be on the decline in the year 2018 with cases of cryptojacking taking over them. The two main variants which have been given a significance are the in-browser and file-based cryptojacking activities. However, due to the consistent rise in inbrowser cryptojacking juxtaposed to file-based, an utmost importance is given to the former than to the latter. Consequently, the security methods imposed to curtail filebased cryptojacking malwares are very limited and unreliable. While file-based is more persistent compared to in-browser, attackers are now shifting toward file-based cryptojacking. Thus, it is undeniable that security features must be provided to detect and control the attack of cryptojacking at all levels in an organization. A successful implementation of file-based cryptojacking and its detection have been proposed in this contemporary paper. The primary motives of this work are as follows: • Building an attack structure for intrusion and identifying the tools and modules for carrying out the implementation • Incorporate the detection of file-based cryptojacking • Proposing detection methods for file-based cryptojacking.

2 Related Work Durrani et al. in [7] have proposed a machine learning solution based on hardwareassisted profiling of browser code in real time. The detection was made possible by comparing the value of HPCs when the system is being mined and when it is not and then classifying accordingly. This method provides an instant detection of illegal mining activities in a system based on the HPC values. A disadvantage is that if mining takes place in a controlled manner, then the HPC values do not get raised to the point of detection and so the detection fails. Also, when the Web noise includes new high load Web apps, the proposed method might fail to classify correctly. CapJack uses the latest CapsNet technology to detect in-browser malicious cryptocurrency mining activities [5]. CapJack provided a higher accuracy rate of detection for single device as well as cross devices and works effectively under multitasking environments. This model has the capability of recognizing mixed/overlapped samples which is essential to solve this problem. CapJack provides a very low instant

Unmasking File-Based Cryptojacking

139

detection rate. For cross models, which involve training and testing the data on different devices in different models, the rate of accuracy produced by this method deteriorates. The authors in [1] proposed a detection mechanism by dynamically analyzing the opcodes of cryptomining using a non-executable subject file. Four datasets (cryptomining, deactivated cryptomining, canonical and canonical injected sites) were taken and the classification tasks were performed using random forest (RF) classifier. The model classified the four different classes producing a high rate of accuracy. The dynamic analysis gives a high true positive rate during classification but since it is a process-based technique, the detection method can only be used for a single host and not for a network in its entirety. It is a reactive security model and hence the executable must be compiled or run at least once in order to create the opcode for the analysis and the number of opcodes generated depends heavily on the architecture being used by the machine. In paper [6], the authors propose existing counter measures and their limitations of in-browser cryptojacking and conclude with long-term countermeasures using insights from the analysis. Methods which follow blacklisting of IPs fail to detect Cryptojacking when an attacker creates new links that are not present in the public list of blacklisted URLs or use a relay server. The proposed approach overcomes this limitation by capturing and then inspecting the network traffic flowing between WebSockets. Through code-based analysis, results in malicious cryptojacking were detected with an accuracy of 96.4%. However, the method would fail to detect cryptominers using other popular blockchain-based mining protocols, such as Stratum, as the data in the frames used for detection would differ. Paper [2] introduces the different types of cryptojacking, giving an insight into its working and defining the various associated terms. They delve into the various stages involved and put forward the impact it has produced in the recent years. Finally, the authors discuss the ethics of cryptojacking and make their claim for various scenarios also providing some basic security measures. Hong et al. in [3] propose a novel detection system called CMTracker to detect the presence of mining scripts using behavioral analysis. One possible limitation of applying this technique is the performance overhead associated with the profiling of Web pages. Also, the authors acknowledge that there might be Web pages that evade the two profilers of CMTracker. Zareh et al. in [8] proposed a host-based approach called BotcoinTrap to detect bitcoin-mining botnets called botcoins at the lowest execution level by dynamically analyzing executable binary files. The detector monitors the current bitcoin block header which is the most important information in detecting bitcoin miners. As the block header is large, it cannot reside in the CPU registers and has to be read from the memory constantly by the malware during the mining process for calculating the hash value. This helps in detecting the malware executing the mining process. The study was conducted only for bitcoin miners and other type of miners were not analyzed.

140

T. P. Khiruparaj et al.

In paper [4], a machine learning-based detection method which utilizes the information produced by flow-level traffic measurements of NetFlow/IPFIX has been proposed. The information gathered from NetFlow/IPFIX is valuable and can be used for detection. The CART and C4.5 algorithm had produced a meticulous result of nearly 99% of accuracy. However, there is a notable drawback in machine learningbased detection. There are numerous traffic patterns that are generated by various cryptocurrencies. This makes feature selection a complicated process. Training different features for different currencies can be a protracted process and not preferable for real-time detection. Though there has been substantial research work in the area of browser-based cryptojacking attacks and their detection, there has been only a skimpy analysis of detecting and producing admissible protective measures for cryptojacking activities carried over through malicious files. The proposed system thus provides a reliable and feasible technique for the detection and control of cryptojacking activities implemented by utilizing a malicious file as an attack vector.

3 Materials and Methods 3.1 File-Based Cryptojacking Traditional file-based cryptojacking involves downloading and executing a malicious piece of code on the victim’s system without his/her knowledge. These types of malicious codes can be easily detected by advanced security checks. Thus, another workaround involves infecting the whole network where the victim is connected to and try to plant a backdoor on all the devices. After gaining access, cryptomining software is made to execute on the victims’ system through the backdoor and it continues with the resource-intensive process of cryptomining for prolonged periods of time.

3.1.1

Embedding Cryptominer with the Malicious Payload

Programs can be structured in a manner to execute both the (backdoor and crypto miner) software or executing one after another (payload and then crypto miner). Execution of payload by the victim creates a backdoor and simultaneously installs the necessary cryptomining software.

3.1.2

Executing Resource-Intensive Cryptojacking

Many popular cryptominers available in the market provide hackers numerous ways to carry over mining activities. Resource-intensive cryptojacking uses the victim’s

Unmasking File-Based Cryptojacking

141

computer resources for mining purposes. The more powerful the resource in the victim’s system, the more profitable the hacker becomes. As mentioned, the hackers generally target on a large scale, hence, the outcome of such activity can be lucrative to hackers while disastrous to many organizations and individuals.

3.1.3

Provision of Stealth Mode

Final step in the attack phase is to make the cryptominer undetectable and bypass the security features implemented by the system. In order to solve this challenge, the attacker can run specific commands once a backdoor is planted. In this way, the command will be considered as a legitimate one and there are various contemporary techniques to implement backdoor. The stealth nature is accomplished by making the backdoor as a rootkit. The rootkit can be implemented at any level such as kernel or even at a bootloader level. Rootkits are more prevalent among hackers than complicated exploitation which corresponds to specific versions of the distros. Figure 1 provides the interaction between various components in the system.

Fig. 1 System architecture

142

T. P. Khiruparaj et al.

3.2 Techniques for Detecting File-Based Cryptojacking There are exiguous numbers of techniques that are available for detecting file-based cryptojacking. Also, there is no single significant parameter to detect file-based cryptojacking. Hence, the proposed detection system subsumes multiple parameters to arrive at a reliable conclusion. One of these parameters includes the result of network analysis obtained by analyzing the packet header for a pattern which matches the existing headers of various mining protocols. The malware might not be of much use because in the case of cryptojacking, it is used only as a carrier and not as the core part which runs the miner. The malware executes the cryptominer in the shell after which it is no longer a significant part for carrying out the cryptojacking process. To detect cryptominers, one of the promising ways is to monitor the CPU usage continuously. This technique can show any anomaly in the continuous value with a specified threshold.

3.2.1

Network Analysis

Network analysis, nowadays, is used extensively for intrusion detection and provides meticulous results compared to host analysis. In network analysis, the packets that are sent out and received in the network are captured, decrypted and converted for further evaluation. Cryptojacking requires internetworking to operate on its own. If a model can analyze one end computing device successfully, then it can also be applied to the network. The model primarily, captures and dissects the packet headers and converts the information such as packet metadata into a readable format which it feeds as input to the subsequent step. Then, by using pattern matching, the model checks for relevant information supporting the presence of cryptomining in the packet headers captured using a network analyzer tool.

3.2.2

Anomaly Detection in CPU Usage Using Mathematical Model

The anomaly detection phase subsumes identifying values that does not fit under the provided range. This can be achieved through several techniques such as machine learning, mathematical model analysis, probabilistic analysis and opcode analysis. Hence, for faster and reliable results, the proposed model consists of mathematical models for analyses as it requires no training phase. Thus, it is quick to implement and more convenient in real-time detection. These models can be adjusted by providing threshold and by setting various parameters. To detect cryptominers, the CPU usage is continuously monitored, and this model is applied over time for detecting anomalies in the usage. The required condition is that all cores are to be used 95% of the time even when the computer is idle. By doing so, it can be confirmed that some process is running in the background. Hence, by combining this method with the previous method it is plausible to obtain a much precise result.

Unmasking File-Based Cryptojacking

143

4 Experimental Analysis This section consists of analysis of the two phases proposed by the system— installation of the backdoor followed by the execution of the cryptominer software in the victim’s device. These phases are depicted below as two different cases: • Case 1 Utilizing metasploit framework for payload creation and planting of backdoor for remote and persistent access. • Case 2 Installing a client cryptominer software called MinerGate for performing the mining activity in the end system.

4.1 Metasploit Metasploit is a combination of Msfvenom and Msfconsole frameworks which agglomerates copious modules and exploits. 1. Msfvenom provides standardized command line options for the creation of payloads. These payloads are malicious codes embodied into a file. 2. Once a payload is created and encoded, it must be executed in the victim’s computer to gain access into the system. Thus, to deceive victim, social engineering attacks are utilized. The metasploit meterpreter is used for establishing a connection to the victim’s system remotely. For this purpose, the reverse_tcp within the meterpreter is utilized which requires the attacker to set up a listener to which the target machines can connect. 3. Once the victim executes the payload on his/her system, a session is commenced on the meterpreter shell in the attacker’s system. This completes the act of intrusion into the system where the attacker now gained the shell access to the victim’s system through a backdoor. 4. After planting a backdoor, the connection can be made persistent by executing certain commands in the shell thereby providing anonymity for the backdoor.

4.2 Minergate Minergate is a popular online cryptomining vendor. It provides a miner client with an effortless command line interface (CLI) for mining purposes. After the execution of the previously mentioned attack steps, the attacker enters into the victim’s system and has access to the user’s shell. Cryptomining can be performed now by running a shell script which in turn loads the cryptominer/malware into the system anonymously. Once the miner has been installed, the mining process can be initiated. Instead of using cryptominer software, the attacker can also utilize miner malwares which provide an ease of intrusion into the system bypassing security checks. Some of the popular miner malwares include Adylkuzz and Beapy.

144

T. P. Khiruparaj et al.

Algorithm 1: Anomaly Detection 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

UsageFile = getUsageLog() decompose (UsageFile) separate into observed, trend, seasonal, remainder and return time_decompose (data, target, method = “twitter”, trend) data = decompose (UsageFile) for each target t in data for each row r in t time_series_decompose (t, r, method = “twitter”) Anomaly detection Initialize data, target, method = “iqr”, alpha, max_anoms splitRemainder (remainder) splits into remainder_l1, remainder_l2, anomaly and return anomalize (data, alpha, max_anoms) for each remainder r in data IQR (r) Adjust alpha to control the band Adjust max_anoms to control the % of outlier

Fig. 2 Detection of cryptojacking using IQR

4.2.1

Detection of Mining Process by Network Packet Analysis

The captured header information is compared with a predefined header format. As the mining process makes use of the Stratum protocol for communication between the miner client and the server, the packets are analyzed for Stratum protocol string formats. The captured packet’s data is not used for analysis as it changes from coin to coin. However, certain header information always remains constant throughout the

Unmasking File-Based Cryptojacking

145

transactions. The pattern matching provides more meticulous results. The matching process was able to accurately detect the transactions that were made to the cryptominer’s server in a considerably less amount of time.

4.2.2

Anomaly Detection in CPU Usage Using Mathematical Model

In this method, the mean CPU usage, of all cores is recorded. It is adequate to record the usage every 1 s. For the detection of outliers, the interquartile range (IQR) method is utilized (Fig. 2). It is used to detect anomalies in univariate dataset that follows an approximately normal distribution. The recorded CPU usage is passed as an input to the IQR test in the remainders which is implemented using R programming. The results are reliable for detecting high usage for prolonged periods. By analyzing a cumulative output of both, the detection of cryptojacking process in the network is derived with an error rate of about 2%.

5 Conclusion The results procured from the above methods have proven to be reliable and implementable in real time. Even though there are some limitations that curtail the meticulous output such as limitation by encrypted network traffic, those are applicable only to small-scale detections, carrying out the detection in large scale requires an efficient technique that can be modified quickly as needed. Thus, the above-mentioned algorithm provides a practical approach for detecting crytpojacking. Future work includes the control of cryptojacking in IoT networks, providing a low power controlling algorithm implemented as a security feature for IoT networks.

References 1. D. Carlin, P. O’kane, S. Sezer, J. Burgess, Detecting cryptomining using dynamic analysis, in 2018 16th Annual Conference on Privacy, Security and Trust (PST ) (IEEE, 2018), pp. 1–6 2. S. Eskandari, A. Leoutsarakos, T. Mursch, J. Clark, A first look at browser-based cryptojacking, in 2018 IEEE European Symposium on Security and Privacy Workshops (EuroS&PW ) (IEEE, 2018), pp. 58–66 3. G. Hong, Z. Yang, S. Yang, L. Zhang, Y. Nan, Z. Zhang, M. Yang, Y. Zhang, Z. Qian, H. Duan, How you get shot in the back: a systematical study about cryptojacking in the real world, in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (ACM, 2018), pp. 1701–1713 4. J.Z. i Muñoz, J. Suárez-Varela, P. Barlet-Ros, Detecting cryptocurrency miners with netflow/ipfix network measurements, in 2019 IEEE International Symposium on Measurements & Networking (M&N) (IEEE, 2019), pp. 1–6

146

T. P. Khiruparaj et al.

5. R. Ning, C. Wang, C. Xin, J. Li, L. Zhu, H. Wu, Capjack: capture in-browser crypto-jacking by deep capsule network through behavioral analysis, in IEEE INFOCOM 2019-IEEE Conference on Computer Communications (IEEE, 2019), pp. 1873–1881 6. M. Saad, A. Khormali, A. Mohaisen, End-to-end analysis of in-browser cryptojacking. arXiv preprint arXiv:1809.02152 (2018) 7. R. Tahir, S. Durrani, F. Ahmed, H. Saeed, F. Zaffar, S. Ilyas, The browsers strike back: countering cryptojacking and parasitic miners on the web, in IEEE INFOCOM 2019-IEEE Conference on Computer Communications (IEEE, 2019), pp. 703–711 8. A. Zareh, Botcointrap: detection of bitcoin miner botnet using host based approach, in 2018 15th International ISC (Iranian Society of Cryptology) Conference on Information Security and Cryptology (ISCISC) (IEEE, 2018), pp. 1–6

Selection of a Virtual Machine Within a Scheduler (Dispatcher) Using Enhanced Join Idle Queue (EJIQ) in Cloud Data Center G. Thejesvi and T. Anuradha

Abstract Cloud technology is a fast-growing technology which supports wide range of services with less cost. The majority of the costly services like ERP software tools, hardware equipments, and packages can be availed here with the basic concept of ‘What and How much we used, per for that only.’ There are different concerns to be considered in cloud computing for researchers; they are load balancing, security issues, and availability of resources. Join idle queue (JIQ) is a dynamic algorithm used for virtual machine allocation for incoming jobs within a datacenter. In that JIQ algorithm, some issues were found. In this paper, we proposed enhancements in JIQ algorithm and used in optimized load balancing across the virtual machines in the data center. Keywords Load balancing · Join idle queue · Scheduler · Virtual machines · Job allocation

1 Introduction After doing deep and detailed examination of existing load balancing algorithms, it was concluded that no algorithm yields best performance in cloud environment and limitations of existing algorithm are given as under. • Most of algorithms are not suitable for distributed environment because they follow centralized approach; • Most of algorithms are static in nature. While scheduling task to VM, they do not consider the VM processing capacity, number of task in the queue, each job size, etc.; and do not check prior overloading of system. G. Thejesvi (B) · T. Anuradha Department of Computer Science, Dravidian University, Kuppam, Andhra Pradesh 517425, India e-mail: [email protected] T. Anuradha e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_14

147

148

G. Thejesvi and T. Anuradha

Limitation of existing algorithms has become a motivation for proposing ‘Hybrid Model Load Balancing’ (HMLB) designed by combining characteristics of existing ‘Join Idle queue’ (JIQ) algorithm which is discussed below.

2 Join Idle Queue Algorithm Lu et al. in [1] have explained about the join idle queue (JIQ) scheduling method for proper job load adjusting. JIQ uses the architecture of using two-level scheduling [2]. To understand the idea of two-level scheduling, this architecture is designed with distributed scheduler. Numerous schedulers are utilized and the number of schedulers is less when respect to number of virtual machines. Each scheduler will keep and maintain set of some virtual machines. Total status of those virtual machines will be followed up by that scheduler only [3]. The two levels of architecture are discussed as follows; in first level, when a job is arrived, the scheduler will map to the idle queue and assign one idle virtual machine in the queue. At second level, when the assigned virtual machine finishes its job, it reports to any one of the schedulers. When a scheduler receives a job, it will check the idle queue and assigns that new job to one of the idle Vm from the idle queue. If no idle Virtual machine is found in the idle queue, then it will allocate to any one of the virtual machine. Here, allocation of job is not properly done and leads to affect the performance of the system [4]. Once task is finished by any virtual machine will update its idle status and reports to any one of the schedulers. So that, all idle virtual machines which are maintained in the idle queue, will reduce the identifying of vm for assigning new task.

Figure join idle queue

Selection of a Virtual Machine Within a Scheduler (Dispatcher) …

149

The pros and cons of this JIQ algorithm are discussed as follows: Advantages • • • •

The job allocation is distributed. Supports and apt for large system. Reduces the communication overhead for finding the idle Vm’s. Divides the additional task for identifying idle virtual machine to assign new job.

Disadvantages • • • •

Without considering the job requirements, virtual machine is assigned. Selection of scheduler is not defined. Selection of Vm if no idle Vm is found. If no idle vm’s are found in idle queue, then does not consider processing power of vm. • It does not support scalability. The main drawback of the existing JIQ algorithm is if no idle virtual machines are available, then the selection is not properly assigned which may leads to improper load balancing. The selection of user request to the particular scheduler among the different schedulers is not specified with proper mechanism. These drawbacks can reduce the overall performance of the system [5]. The main drawbacks of join idle queue were studied and analyzed. The drawbacks can be overcome by adding small features to the existing JIQ algorithm. The proposed enhanced JIQ algorithm is explained as follows. Enhanced Join Idle Queue (EJIQ) Algorithm Algorithm EJIQ ( ) { All available idle Vm’s are maintained in idle queue of a scheduler; When a new job arrives to a data center; { On checking the all schedulers, finds one scheduler based on its scheduler state and its Ɵme. The selected scheduler first checks its idle queue for vm’s. If (idle vm is found) then { New job is assigned to that idenƟfied virtual machine. Removes that vm from the idle queue; }

150

G. Thejesvi and T. Anuradha Else { The new job will be directed to the waiƟng queue of the scheduler; Refreshes frequently; } If (any vm finishes its job) then { Vm will update the status to its scheduler and joins to its idle queue } } }

2.1 Methodology for Selecting VM in a Scheduler 1. Now, datacenter controller (DCC) selects one scheduler based on the scheduler state and its state time. Each scheduler will maintain the information of the entire available idle virtual machines, jobs waiting queue, and job allocation table in that scheduler in the form of three tables as follows [2]. Vm

Idl

id

e

Jo Wait b ing

Ɵ

id

Vm

job

id

id

Ɵme

m e

(a) Idle queue (b) Waiting queue (c) Job allocation

Idle Queue Table The available idle virtual machines in that scheduler will be maintained in this idle queue along with its idle state time. Idle queue length =

(no. of virtual machines in the idle queue)

Selection of a Virtual Machine Within a Scheduler (Dispatcher) …

151

Waiting Queue Table The jobs are allocated to this particular scheduler and when no idle virtual machine (vm) is available in idle queue, then this job is directed to the waiting queue along with its waiting time. Waiting queue length =

(no. of jobs waiting in the waiting queue)

Job Allocation Table When a job is allocated to a particular virtual machine which is available in the idle queue, then job along with its allotted virtual machine is updated in the job allocation table. This job allocation table is updated at every refresh time. If any job completes, it will be removed from job allocation table and updated in the idle queue. 2. When a new job arrives to a particular datacenter, datacenter controller (DCC) will check the status of all schedulers and selects one scheduler. 2.1: if (idle queue is not empty) Assign the new job e first virtual machine in the idle queue; Removes that vm from the idle queue; Update the virtual machine (vm) and corresponding job id in the job allocation table; 2.2: if (idle queue is empty) Directs the new job to the waiting queue; Stores the job id and waiting time; Keeps checking the idle queue for every refreshing time; Else if (any idle vm is found) then Allocated Vm was removed from the allocation table; Sort all the jobs in the waiting queue based on the time; Assign first Job in the waiting queue = new Vm in the idle queue; Scheduler removes this job from the waiting queue; Then idle vm in the idle queue was removed by the scheduler; Updates Vm_id and Job_id in the allocation table; 3. The number of job in waiting queue and number of idle vm in idle queue are sorted and stored and the allocation also done in a same way.

152

G. Thejesvi and T. Anuradha

Algorithm Algorithm Hybrid ( ) { The available Idle Vm’s are maintained by each sceduler; While (data center receives a job request) { Data centre controller (DCC) checks all the scheduler and sends the task to any one of the moderator or scheduler based on its state and Ɵme; When new job is arrived, scheduler checks its idle queue; If (virtual machine is found) then { Sort all available Vm in the idle queue based on Ɵme; Allocate first Vm in the idle queue = new job arrived; Then allocated vm in the idle queue was removed by the scheduler; Updates selected virtual machine Vm_id and Job_id in the allocaƟon table; } Else { Scheduler assigns the task to waiƟng queue; Refresh at every frequent Ɵme; Idle vm’s were checked in the queue; If (any idle vm is found) then { Allocated Vm was removed from the allocaƟon table; Sort all the jobs in the waiƟng queue based on the Ɵme; Assign first Job in the waiƟng queue= new Vm in the idle queue; Scheduler removes this job from the waiƟng queue; Then idle vm in the idle queue was removed by the scheduler; Updates Vm_id and Job_id in the allocaƟon table; } }} }}

3 Conclusion The enhanced join idle queue algorithm is implemented in cloud analyst simulator. This load balancing algorithm compares with remaining load balancing algorithms like round robin, throttled algorithms and hence shown better performance in finding the best and optimized virtual machine within a scheduler.

Selection of a Virtual Machine Within a Scheduler (Dispatcher) …

153

References 1. Y. Lu, Q. Xie, G. Kliot, A. Geller, J.R. Larus, A. Greenberg, in Join-Idle-Queue: A Novel Load Balancing Algorithm for Dynamically Scalable Web Services. Extreme Computing Group. Microsoft Research ICES 2. G. Thejesvi, T. Anuradha, Distribution of work load at balancer level using enhanced round robin scheduling algorithm in a public cloud. Int. J. Adv. Electron. Comput. Sci. (IJAECS) 3(2) (2016) 3. G. Thejesvi, T. Anuradha, Distribution of work load at main controller level using enhanced round robin scheduling algorithm in a public cloud. Int. J. Comput. Sci. Eng. (IJCSE) 3(12) (2015) 4. G. Thejesvi, T. Anuradha, Proposed model for proper balancing of load in the public cloud. Int. J. Adv. Res. Comput. Sci. Softw. Eng. (IJARCSSE) 6(8) (2016) 5. G. Thejesvi, T. Anuradha, Selection of a scheduler (dispatcher) within a datacenter using enhanced equally spread current. Int. J. Eng. Sci. Invent. (IJESI). ISSN (Online): 2319-6734, ISSN (Print): 2319-6726 6. https://www.znetlive.com/blog/what-is-load-balancing-in-cloud-computing-and-its-advant ages/ 7. A. Khiyaita, M. Zbakh, H. El Bakkali, D. El Kettani, in Load Balancing Cloud Computing: State of Art (IEEE, 2012). 9778-1-4673-1053-6/12/$31.00 8. A.M. Alakeel, A guide to dynamic load balancing in distributed computer systems. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 10(6), 153–160 (2010) 9. M.A. Vouk, Cloud computing issues, research and implementations, in Proceedings of the ITI 2008 30th International Conference on Information Technology Interfaces, 23–26 June 2008 10. Z. Xu, H. Rong, Performance study of load balancing algorithms in distributed web server systems, in CS213 Parallel and Distributed Processing Project Report (2009) 11. A. Bhadani, S. Chaudhary, Performance evaluation of web servers using central load balancing policy over virtual machines on cloud, in Proceedings of the Third Annual ACM Bangalore Conference (COMPUTE), Jan 2010

An Analysis of Remotely Triggered Malware Exploits in Content Management System-Based Web Applications C. Kavithamani, R. S. Sankara Subramanian, Srinevasan Krishnamurthy, Jayakrishnan Chathu, and Gayatri Iyer Abstract Web content management systems, referred here as CMS, is a customizable software platform on which websites can be easily built. According to Web Technology Surveys, more than 55% of websites on the Internet use any one of the many popular CMS platforms [1]. Their popularity arises due to ease of use, capability to customize, and ability to abstract complex programming layers into a very simple user interface, especially for a novice user. However, while the user may have very little or no knowledge of the backend database system containing sensitive information, this platform offers itself as an easy target to ever-expanding malware threats. In this paper, we explore and analyze in depth of one such malware, focusing on the origin of the attack, its deployment, its regenerating methods, and the cost of combating and retrieving the compromised data. Keywords Malware exploits · Content management systems (CMS) · Malware analysis · Remote-triggered viruses · Zero-day attacks · Malicious redirects · Malware regeneration

C. Kavithamani (B) Bharathiar University, Coimbatore, Tamilnadu, India e-mail: [email protected] R. S. Sankara Subramanian Department of Mathematics, PSG Institute of Technology and Applied Research, Coimbatore, India e-mail: [email protected] S. Krishnamurthy · G. Iyer Santa Clara, CA, USA J. Chathu Sunnyvale, CA, USA © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_15

155

156

C. Kavithamani et al.

1 Introduction Content management systems (CMS) were originally developed for newspapers and blogs to easily publish informational content on the Internet [2]. One of the first CMS products, Interwoven, was an automation to produce static html pages more efficiently with versioning and tools. Over the years, they have evolved from their simple implementation to become the backbone of broad class of websites like ecommerce, corporate, small and medium business websites. In the early days, the actual content and the layout both were coupled together. In the modern CMS, they are separated out as content and layout. A layout also called as a template is stored in a database. These two entities are glued together by a server-side implementation using programming languages such as PHP, JavaScript, JSP, or Java. A very popular CMS technology stack, commonly referred to as LAMP, is a combination of an open-source operating system (Linux), a web server (Apache), a backend database (MySQL), and an easy-to-learn HTML generator (PHP). Over 50% of all CMS-based websites today use PHP and MySQL [1]. The most popular CMS today are WordPress, Django, Drupal, and Joomla. Web site development on a CMS platform has many advantages as it comes with many in-built features like ease of development, deployment, and management of a web application. On the flip side, they are an easy target for attackers. Multiple factors contribute to this, including • Availability of the platform source code and open protocols. • A flexible language like PHP through which any operating system commands can be executed. For example, eval(), which is a naïve function that executes the string passed to it and processes it. An attacker can inject malicious code in a well-hidden eval function. • Novice users who do not monitor or apply security patches for known vulnerabilities. • Failure to keep the authentication credentials secure. • Lack of zero-day attack prevention mechanisms. Among all the malware attacks, the common web application vulnerabilities heavily targeted by hackers are SQL Injection, cross-site scripting (XSS), remote file inclusion, PHP injection, and command injection. Path traversal, local file inclusion, OS command execution, denial of service, and cross-site request forgery (CSRF) are other vulnerabilities targeted. Data collated from the National Vulnerability Database [3], as depicted in Fig. 1, shows the web attack distribution in the last three years. These are known vulnerabilities that are discovered and documented in CVE vulnerability database. As reported by Akamai research [4], there are over 3.9 billion attacks that were detected by web application firewalls in a 17-month period, targeting these known vulnerabilities. Apart from these vulnerabilities in the web application stack, around 8% of attacks happen through broken authentication, security misconfiguration, and phishing. Our study finds that the recent attacks are sophisticated, where attackers obfuscate large part of their attack. The obfuscation techniques are simple

An Analysis of Remotely Triggered Malware Exploits …

157

Fig. 1 Web attack distribution by type [3]

like URL-encoding or base64 encoding. But the attacks escape security mechanisms, taking benefit of the ignorance of the user about the underlying technology stack of the web application. Broadly, the attacks either exploit the hosting website or the users visiting the website. Attacks targeting unsuspecting users try to exploit the user’s browser or operating system vulnerability. Server-side attacks mainly aim to obtain sensitive and valuable information. They are also used for denial of service, defacing the website, sensitive data exposure or usage of resources to host malign content. Of all the attacks, remote-triggered attacks are harder to eradicate. These attacks are implanted in such a way that even if virus scanners or the end user removes the malicious content, a hidden attack vector replicates the attack. The attackers also create multiple user accounts to make sure that they can remotely invoke the replication in cases where the self-replication fails. Thus, it is extremely hard to clean a website from an attack exposure, and the only solution is to recreate the website from a known good backup. E-commerce websites that get exposed to such attacks lose revenue because of their downtime. If they lose any user identifiable information like credit card or personal information, they become liable for that and incur heavy losses. In this paper, we discuss the attacks we encountered during our research in the last two years. The attacks follow a similar pattern, right from exploiting a known vulnerability in the platform, auto replication mechanism, and hidden code that gets triggered from remote.

158

C. Kavithamani et al.

2 Related Work Meike et al. [5] discuss prevalent security issues in open-source web content management systems. In [6], David Hauzar and Jan Kofron have provided detailed security analysis of PHP-based web applications. Contu et al. [7] provide an overview of few common CMSs and the security issues faced by these systems. Their study also lists different security add-ons present in the common CMS but does not get into the details of any particular type of attack. Trunde and Weippl [8] describe SQL Injection attack in detail in their paper. Their work also shows how the malware scanning tools available in the market are incapable of finding attack vectors in the detection phase. Kaur and Kaur [9] use web hacking incident database, a consortium project that has a list of web application security threats as data for their research. Their paper analyzes web attacks on a yearly basis from January 2012 to June 2015. It also gathers data on sectorwise types of malware attack. However, it steers clear from pointing to the root cause of any of the attacks. Garg and Singh [10] review web application vulnerability and how attacks become possible in an application. In our previous paper titled “IDS alert classifier based on Big Data and Machine Learning” [11], we presented a novel method to positively identify a broad class of network attacks combining Intrusion Detection Systems (IDS) alarms and system metrics. We also detailed such malwares that are not detected by traditional IDS and the requirement for malware scanners and web application firewalls. The work presented in the current paper targets such web application malware, especially in the context of CMS. It details various attacks, their signature, cause, and the exploit. It also describes a novel solution to detect and prevent them.

3 Current Website Security Measures and Their Limitations Research on identifying vulnerabilities in the deployed web application is making tremendous progress. Various vulnerability scanners that can be used even by a novice user are discussed in [12–14]. Web application firewalls and virus scanners could be used to prevent known attacks. Periodic test against site checkers, like [15] Google safe browsing API, is utilized to make sure that the website is free of known unsafe content. But all these tools rely on prior knowledge attack pattern or an attack signature. Hence, zero-day attacks like the ones that are discussed below are not effectively stopped by these tools.

An Analysis of Remotely Triggered Malware Exploits …

159

4 Analysis of Malware Attacks 4.1 Malware Injection Malware is defined as a piece of malicious code written to steal information and disrupt systems. In website security, malicious code is injected into a website by exploiting an already existing vulnerability in the website backend software or through a backdoor.

4.1.1

Injection by Exploiting a Vulnerability

In case of CMS-based websites, the software stack contains a myriad of components listed here: • • • • •

CMS and its extensions Language interpreter like PHP Web server like Apache or Nginx Database like MySQL or Maria DB Operating systems like Linux, Windows, etc.

All CMS have had vulnerabilities discovered in them. These vulnerabilities leave around 40 million websites built using CMS open to attacks. Though the level of security has improved in the recent time, the flexibility offered by the plug-ins and extensions in CMS has become an easy target for the hackers. For example, a WordPress plug-in vulnerability was used in the controversial Panama Papers Leak of 2016. The underlying core components like web server, database, and operating system also have known vulnerabilities. Data collated from the National Vulnerability Database of known vulnerabilities as of September 2019 [16] is plotted in Fig. 2.

4.1.2

Injection Through Backdoor

In general, CMS-based websites hosted on popular hosting site provide multiple access options like SSH, SCP, FTP or web-based user interface called cpanel. Apart from the admin user account, each of these access methods might have a different set of credentials. Any one of these methods that forgo the traditional authentication could be used as a backdoor to gain access to the website and its contents. The access credentials are obtained by the hackers either through brute-force attack or methods like phishing emails. These attacks are very prevalent as seen by the 15 million brute-force attacks alone reported in the month of October 2017 [17]. But still the percentage of this injection method is low compared to the other vulnerabilities in the CMS platform and its extension as shown in Fig. 3.

160

C. Kavithamani et al.

Fig. 2 Known vulnerabilities from the National Vulnerability Database until Sept 2019 [16]

Fig. 3 CMS platform malware injection statistics [17]

An Analysis of Remotely Triggered Malware Exploits …

161

4.2 Malware Infection Once the attacker gains entry, malicious code block is either inserted in the platform files or as an individual executable script file. Typically, the main routing code on the web application is hacked to create a redirect to invoke the attacker’s script. For example, in Apache, the htaccess file routes the incoming requests by rewriting the request URI to the actual URI based on predefined regular expressions. Attackers rewrite these rules in such a way that their scripts are executed under certain conditions and website continues to run normally in all other cases. As a result of this tactical selective execution, regular users or website maintainers do not notice the compromise early on. Less than 20% of these sites are listed as dangerous by search engines [18]. Also, the attacker hides the infected code by various methods like: • • • •

Obfuscation Injecting into CMS platform software files Following the naming convention of existing software on the website Store as binary object in the CMS database.

The injected malicious code could further bring more malicious software onto the website. This malicious code could either target the website, server, and the hosting platform or it could target the users visiting the website. Spam pages, driveby download, and cross-site scripting are the most prevalent attacks targeting the users visiting the website. Figure 4 shows the average statistics of infected content types from 250 CMS websites and we have monitored from April 2017.

Fig. 4 Mean statistics of infected content by type

162

C. Kavithamani et al.

4.3 Malware Code Execution Execution of a malware depends on the type of malware. Spam Pages malware usually creates a URI redirect for a subset of URIs to point to the content downloaded during the infection phase. Since the infection creates a permanent backdoor, the hacker now has control over the system and thus keeps refreshing the content. They also create keywords that pollute the pages or increase its relevance to get recognized by the search engine robots. Usually when the search engines find mismatched or known spam-related keyword, they recognize that the site is compromised and flag the users accordingly. Unfortunately, most of the time, that is when the site maintainer finds that the site contains malicious code/data. XSS malwares inject custom JavaScript and redirects on the pages. When an unsuspecting user visits the site, it tricks the user to download content that exploits the user’s system, or it attacks any known vulnerability in the browser. On the other hand, obfuscated files cannot be classified as belonging to one category of an attack type. These are basically data/contentoriented attacks. The backdoor created by hacker provides a way to execute these obfuscated scripts remotely and pass the payload as a standard HTTP protocol data like POST, GET or COOKIE arrays. With PHP commands like eval, they can simply execute any command in the operating system with data from the payload. From one of the compromised system logs, we noticed that the attacker was trying to load various dynamic libraries like sqldb.so and zlib.so. The logs were generated only when the commands were not successful. Had the commands returned success messages, these suspicious operations would have gone unnoticed.

4.4 Malware Regeneration from Remote Triggers When the hacker initially gets access to a website, he is well aware of the fact that vulnerability which he has exploited might get fixed, or the backdoor he has created would get identified and closed. Also, many installations change passwords frequently, thus not letting attacker use the same compromised password for a long time. In most of the recent real-world attacks that were identified and analyzed, we found a hidden component of the attack which is stored in a variety of places out of plain sight. For example, we found a copy of the attack in zip file, with the contents encoded in base64 format and a PHP library code interspersed with the host website. This copy was used to unzip the file and deploy the attack again if the malware gets cleaned up. The unzipped code first looks for the presence of malware, only when it does not exist, it tries to redeploy it. In some cases, the hackers modified the standard HTTP error handling files like 400.php or 500.php to invoke these redeploy scripts. Because of this nature, once a website is compromised, the best advice is to start from scratch and to restore the contents from a known good backup. The downtime and loss of data caused by such rewrite are very expensive to small businesses who are the main users of CMS based websites. This stored copy of attacks is not detected

An Analysis of Remotely Triggered Malware Exploits …

163

by any virus scanners or site scanners. In one of the infected sites, we tried scanning it with multiple scanners and site checkers but none of them classified these files as malware. That is because the attackers keep changing the malware format and content. Like zero-day attacks they are not discovered until someone identifies, and it becomes a known attack and rules are entered in the firewall.

5 Analysis of an Attack During our research, we came across various attacks by monitoring multiple CMSbased websites using WordPress and Drupal. In this section, we explain one such malware attack and its components. The website unboxedwriters.com was hacked in July 2019. The publisher of the website noticed significant lag in the website response. Running a virus scanner showed us that there were multiple malicious files. Even after an initial cleanup, the virus scanner periodically reported suspicious files. This proved that there is a mechanism that is able to recreate the malware. Upgrading the CMS software and running against various site checkers did not fix the issue. Through log analysis and site access analysis, we found the entry point of the attack.

5.1 Entry Point of Attack In the July 2019, the admin user of the website unboxedwriters.com received a phishing email purporting to come from the hosting services stating that their site has reached the load limit and they need to login to make changes. Admin user unknowingly shared the access credentials and exposed the site to the hacker. The email read like this, Dear Bluehost.com client, Domain account unboxedwriters.com has exceeded the limit load available for the existing pay rate plan. Methods of load analysis and elimination: http://my.bluehost.com.15c437027c133cb72c4d75923fdced07.institutoagrope cuario.com/account/42801/limit.html.

5.2 Attack Execution Following this, multiple FTP and database access accounts were created without the knowledge of the admin. Hidden directories were present in the CMS folder as well in the main public_html/directory. Randomly named PHP files were created. These

164

C. Kavithamani et al.

files contained obfuscated code. The template of the code looked very similar to the one below.

By decoding the above script file, we found that the core code execution happened in the eval function, which received _COOKIES array and _POST array as arguments and executed them. The core of the attack was in the data, and the rest of the code simply executed the payload. Since the backend stack was configured to log all the command failures, log analysis showed commands that were failed to execute. The commands try to load various dynamic libraries that support database operations and compression/decompression. They also downloaded more than 1200 spam pages and hosted them on the site. A redirect for these files was added to the htaccess router code. After complete cleanup of these files and removal of the malicious accounts, the virus scanners and site checkers certified that the website is clean and up to date. But within a few days, the malicious software code resurfaced, leading us to further investigate hidden code and possibility of remote triggers which did not use any of the previously discussed known malware attack methods. Figure 5 shows the attack sequence and how the attack is replicated from stored BLOB in the database.

Fig. 5 Lifecycle of the attack

An Analysis of Remotely Triggered Malware Exploits …

165

5.3 Attack Regeneration and Remote Triggers Despite removing all malicious content and passing vulnerability scanner check, malicious code manifested again. Upon further research, we identified the script that regenerated the attack. A standard 500.php was modified to include a .ico extension file. The included command was in Unicode, so it was not human-readable at first sight.

Analyzing the file content with a standard file tool showed us that the file is not a standard icon file as the extension indicated, instead it is a php script. File content was obfuscated. Below are the few bytes from the start of the file.

Another PHP script file had the tools required to generate the actual attack code from this file. Further looking into the origin of this .ico file, we found that the attacker had hidden this code as multiple blobs in the WordPress database, with a table named wp_sett, which followed similar naming convention of the other WordPress database tables, thus evading our analysis and all the site checkers and scanners. Once these remnants of the attack were cleaned, regeneration of attack and break-ins ceased.

6 Automated Analysis and Detection Our approach to automate this analysis resulted in a tool we named Websense. Goal of Websense is to detect any malicious content infecting the website and prevent that from getting deployed. Based on our study across various WordPress installations, we identified key file system and database parameters and their nominal value range. Based on this, we developed a rule base. The implementation hooked on to the file system changes of the web application, and any changes to web application are verified against this rule base as well as a content analysis engine based on bigram [19] with Markov Chain [20]. Resulting verification data is weighted to distinguish between malicious and benign content. Upon detection of malicious content, the corresponding resource is quarantined. Also, an alert with complete violation details is generated and sent to the administrator. The data from running the tests against four infected sites is shown in Table 1.

166

C. Kavithamani et al.

Table 1 Results from Websense analysis Violation

Site 1

Site 2

Site 3

Site 4

Malicious identifiers

233

1736

–

–

Malicious includes

5

1

–

–

Htaccess bypass

–

–

Modified to list Redirect to third directory contents party

Content type mismatch

Icon files contained PHP code

–

–

Zip files named as PHP source code

Hidden files

Icon file made Malicious content hidden by hidden system prefixing “.” folders

–

–

Irrelevant content

–

–

–

1300 html files

Blob content in database

PHP content in database

–

Excessive statements per line

36 assignments in single line

135 array operations in single line

– –

–

Ratio of byte versus – new lines in source code

45 files contained – code that exceeded normal size value

–

Injected code in platform files

Cryptic PHP code interspersed

Index.php contained injected code

Encoding/decoding URL encoded Base64/gzinflate of program code program code of source code

Backdoor code – added to modify.htaccess as well as generate code from _POST and _COOKIE arrays Modified version gzinflate called of base64_encode from source to function found inflate the third-party html files

Arbitrary strings in vulnerable functions

Eval functions with arbitrary string arguments

Eval function used – arbitrary strings and _POST array contents

file_put_contents called with arbitrary strings

File type/size mismatch

–

–

Morphed PHP size exceeded 64 K normal range

–

An Analysis of Remotely Triggered Malware Exploits …

167

7 Conclusion We have discussed the current malware trends in the content management systembased web applications. Over the course of our research, we were able to identify various types of malwares and their attack patterns. In this work, we have explained all the phases of an attack from injection, infection, replication-to-remote triggers. We have dissected one such remote-triggered attack in detail, how the attack evades security measures and its recurring infection pattern. These malware attacks cause maximum damage to unsuspecting users, who are not experts in the security domain. Even to a fairly experienced user, the attacker’s smartness of hiding malicious activity as genuine content and redeploying with the help of hidden attack vectors is evasive. We have also presented our solution named Websense, and its ability to detect and prevent these attacks.

8 Future Work This study undertaken to analyze vulnerabilities on popular CMS systems showed us the sophistication of the attacks and how they evolve to outpace the detection tools. At the same time, we see that the CMS platforms are becoming more and more flexible, user friendly and provide extensibility with plug-in offerings. This increasing attack surface and constantly evolving attack patterns pose a serious challenge to a rulebased prevention solution. In our current research, we were able to detect subset of attacks using Markov chain-based machine learning algorithm with minimal false positives. As a next step, we will further research on machine learning algorithms that are more suited to web application data and traffic pattern to bring down the false positives. To aid this solution and to train the algorithms, we will deploy multiple honeypots and also work with other open-source projects to gather attack patterns.

References 1. https://w3techs.com/technologies/overview/content_management/all 2. https://www.contentstack.com/blog/all-about-headless/content-management-systems-historyand-headless-cms 3. https://nvd.nist.gov/vuln/search 4. https://www.akamai.com/us/en/multimedia/documents/state-of-the-internet/soti-securityweb-attacks-and-gaming-abuse-report-2019.pdf 5. M. Meike, J. Sametinger, A. Wiesauer, Security in open source web content management systems. IEEE Secur. Priv. 7(4), 44–51 (2009). https://doi.org/10.1109/MSP.2009.104 6. D. Hauzar, J. Kofron, On security analysis of PHP web applications, in Computer Software and Applications Conference Workshops (COMPSACW ) 2012 IEEE 36th Annual (2012), pp. 577– 582

168

C. Kavithamani et al.

7. C.A. Contu, E.C. Popovici, O. Fratu, M.G. Berceanu, Security issues in most popular content management systems, in 2016 International Conference on Communications (COMM). Bucharest, Romania. https://doi.org/10.1109/iccomm.2016.7528327 8. H. Trunde, E. Weippl, WordPress security: an analysis based on publicly available exploits, in iiWAS’15: Proceedings of the 17th International Conference on Information Integration and Web-based Applications and Services. Article no. 81. https://doi.org/10.1145/2837185. 2837195 9. D. Kaur, P. Kaur, Empirical analysis of web attacks, in ICISP2015, Nagpur, India, 11–12 Dec 2015, pp. 298–306 10. A. Garg, S. Singh, A review on web application security vulnerability. IJARCSSE 3(1) (2013) 11. C. Kavithamani, R.S. Sankara Subramanian, S. Krishnamurthy, IDS alert classifier based on big data and machine learning, in National Conference on Advanced Computing NCAC 2019. Coimbatore, India 12. R. Akrout, E. Alata, M. Kaâniche, V. Nicomette, An automated black box approach for web vulnerability identification and attack scenario generation. J. Braz. Comput. Soc. 20(1), 1–16 (2014). https://doi.org/10.1186/1678-4804-20-4.hal-00985670 (Springer) 13. J. Fonseca, M. Vieira, H. Madeira, Testing and comparing web vulnerability scanning tools for SQL injection and XSS attacks, in PRDC’07. IEEE CS (2007) 14. M. Vieira, N. Antunes, H. Madeira, Using web security scanners to detect vulnerabilities in web services, in DSN’09 (2009) 15. https://transparencyreport.google.com/safe-browsing 16. https://nvd.nist.gov/vuln/categories 17. https://www.wordfence.com/blog/2017/11/october-2017-wordpress-attack-report/ 18. https://www.sitelock.com/download/SiteLock%20Website%20Security%20Insider%20Q1% 202018.pdf 19. https://en.wikipedia.org/wiki/N-gram 20. https://en.wikipedia.org/wiki/Markov_chain

GSM-Based Design and Implementation of Women Safety Device Using Internet of Things N. Prakash, E. Udayakumar, N. Kumareshan, and R. Gowrishankar

Abstract These days, ladies and youngsters well-being is a prime problem of our general public. The tally of the injured individual is expanding step by step. In this paper, it will guarantee the security of ladies and youngsters’ everywhere throughout the worldwide. We have utilized various sensors like flex sensor, power sensor, vibration sensor for identifying abrupt change moving of client. We have likewise utilized GPS which will recognize area of the gadget. GSM utilized in the typical is utilized to show ready communication to watchmen, families and police headquarters. The proposed Wi-Fi technology-based gadget will help to persistently screen estimations of various sensors and also GPS utilized in gadget. There are numerous android applications for ladies security; however, they are as not as much as proficient. The concealed camera identifier is utilized to catch the charged picture, and the chloroform engine stunts to make them to do a lot of work quickly. This spares the time and that injured individual gets help without loss of time. Likewise on account of children security, the framework proposes a speed checking and area following offices utilizing GPS, GPRS and GSM. Keywords GPS · GSM · MEMS sensor · Vibration sensor · SMS

N. Prakash (B) · E. Udayakumar · R. Gowrishankar Department of ECE, KIT-Kalaignarkarunanidhi Institute of Technology, Coimbatore, Tamilnadu, India e-mail: [email protected] E. Udayakumar e-mail: [email protected] R. Gowrishankar e-mail: [email protected] N. Kumareshan Department of ECE, Sri Shakthi Institute of Engineering and Technology, Coimbatore, Tamilnadu, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_16

169

170

N. Prakash et al.

1 Introduction At the present situation, women are battling with men in each prospect of society. Ladies contribute half to the movement of our country. Regardless, the ladies have dread of getting aggravated and butchered. These sorts of ladies instigation cases are developing a little bit at a time. So, it is essential to guarantee the security of ladies. In a proposed method, women are able to do their work without any fear using security manner [1]. Proposed model contains different sensors which will gauge various parameters tenaciously. GSM is acceptably new and smart making idea. By utilizing GSM-based advancement gatekeepers, relatives and police can screen and track various sensors worth and position of a gadget. A gadget is wearable likewise it is certainly not difficult to pass on [2]. In the past time, women have variety of issues of going outside creates security problems. The significant thought frequenting every juvenile is the time when they will have the decision to move unmistakably on the ways even in odd hours without engaging with their security. We propose a thought which changes the manner in which where everybody considers ladies flourishing. A day when media passes on a progressively noticeable proportion of ladies’ accomplishments as opposed to pestering, it’s an accomplishment accomplished! Since we (people) can’t react suitably in principal conditions, the need for a gadget which thusly assets and salvages the shocking misfortune is the endeavor of our thought in this task. We propose to have a microcontroller gadget utilizing GSM, which consistently converses with smart telephone that strategies the Web [3].

2 Related Work The proposed model acts as security system for ladies and can carry it through all over the world. We have utilized various sensors like heartbeat sensor, temperature sensor and accelerometer sensor for perceiving heartbeat, temperature and astounding change moving of client. We have correspondingly utilized GPS which will perceive locale of the gadget. GSM utilized in the model is utilized to send arranged message to watchmen, relatives and police headquarters. We have proposed IoT (Web of things)-based contraption which will help to unendingly screen estimations of various sensors and GPS utilized in gadget. Clarification behind the endeavor is to offer security to lady. If any problem happens, lady will able to press the button handled in body, the GPS will be enabled. The SMS will be automatically send to the police and their relative at instant time [1]. We propose a thought which changes the manner in which where everybody ponders ladies security. A day when media grants an increasingly critical proportion of ladies’ accomplishments as opposed to tormenting, it is an accomplishment accomplished! Since we (people) cannot react fittingly in fundamental conditions,

GSM-Based Design and Implementation of Women Safety Device …

171

the need for a contraption which regularly assets and salvages the hurt individual is the endeavor of our thought in this paper. We propose to have a contraption which is the trade-off of various gadgets, and equipment incorporates a wearable “Sharp band” which consistently converses with smart telephone that approaches the Web. The application is changed and stacked with all the fundamental information which unites human lead and responses to various conditions like wrath, dread and weight. The development empowers help rapidly from the police likewise as public in the close to run who can arrive at the appalling mishap with extraordinary precision [4]. Sharp security solution for women using IoT, we propose a thought which changes the manner in which where everybody thinks about ladies thriving. A day when media bestows a more imperative proportion of ladies’ accomplishments instead of impelling, it is an accomplishment accomplished! Since we (people) cannot react properly in central conditions, the essential for a contraption which regularly assets and salvages the hurt individual is the endeavor of our thought in this paper. The proposed framework plans to a gadget remote strategy as inserted gadget to be specific Arduino for ladies that will fill the need of alarms and method for speaking with secure channels, and it catches the picture utilizing electronic camera. This spares the time and that unfortunates casualty get help without loss of time. Additionally, on account of children security, the framework proposes a speed observing. The area following offices utilizes GPS, GPRS, and GSM. The framework comprises transport unit. The transport unit is utilized to recognize the way of bus by utilizing GPS. For the instrument of vehicle, the following haversine and trilateration calculation are utilized. As per that, by utilizing GSM, ready messages will be sent to their folks and vehicle proprietor. The framework has been created on electronic information-driven application that gives the helpful data [5].

3 System Design Flex and force sensor are utilized to distinguish the anomalous condition. Bluetooth covers just short separation. The microcontroller is utilized in framework. It is fairly hard to execute Wi-Fi module. Lady may be harmed in this framework. Ladies are the subject of misuse inside and outside the home say whether on streets, trains, taxis, schools and so on. Ladies’ strengthening in the nation can be brought once their well-being and security are guaranteed, and it is possible that it might be at home, public spots or during voyaging. Driven is utilized to actualize ready unit. There is no checking framework for young ladies, it ought to make numerous issues for them, and there is no security system to shield the young ladies from the bad conduct exercises. Free portable application is used by ladies to help the individuals, who are in problems. Arduino UNO is utilized as a fundamental control unit. The application is changed and stacked with all the necessary data which consolidates temperature, heart beat and besides appalling loss development [6]. This makes a sign which is

172

N. Prakash et al.

transmitted to the moved phone. From this contraption, we can cause the brief to continue forward the circumstance [7] (Fig. 1). The proposed outline is to structure a convenient device which looks like a smart gadget. The power sensor used to discover body’s unusual movement, and MEMS used to discover the body position used to discover whether ladies are under strange conditions [8]. On the off chance that the individual is in emergency implies she can press the emergency switch implies the gadget will get enacted naturally and through Wi-Fi technology, and the alert unit will be actuated. MEMS sensor is utilized to avoid and protect the ladies from assaults, and flex sensor and vibration sensor are utilized to imply the police whether she is in perilous through Wi-Fi module. In emergency situation, it will send the message including minute region to the police,

Fig. 1 Proposed system unit

GSM-Based Design and Implementation of Women Safety Device …

173

through the transmitter module and enrolled numbers by methods for a Wi-Fi module. Sprayer motor is used to sprinkle the chloroform to human who exasperate youngster or woman. The disguised camera is used to get the image of the eve-goading made person [9]. Customer does not require a smartphone like various applications that have been developed previously, because here GSM is used. By and by the work is under strategy to embed it in embellishments, convenient or other conveyor like belt, etc. It can accept a critical [10] activity in the proposed adventures where all the police home office are related and share the criminal records, bad behavior investigating cases, etc. It is in all cases system. GPS is used to trace the person, when they move anywhere in emergency time. GPS following component tracks the customer vivacious when you are the move in the wake of setting off the emergency get. This device works without Web accessibility. Disguised camera is used to get the image of the person who get raucous to women, and Bluetooth is an open remote advancement standard for transmitting fixed and convenient electronic contraption data over short partitions. A collection of automated contraptions uses Bluetooth, including MP3 players, flexible and periphery devices and PCs (Fig. 2). (A) GPS GPS addresses global positioning system and was made by the US department of protection as a general course and masterminding office for both military and nonmilitary personnel use. It is a space-based radio-course framework including 24 satellites and ground support. GPS gives clients precise data about their position and speed, comparably as the time, any place on the planet and in every single climate condition [11]. Course in three estimations is the essential furthest reaches of GPS. Course specialists are made for airplane, ships, ground vehicles and for hand passing on by people. GPS gives exceptionally coded satellite pennant that can be set up in a GPS gatherer, drawing in the recipient to figure position, speed and time. Unimaginable GPS recipients can figure their position, any place on earth, to inside one hundred meters and can generally resuscitate their position more than once reliably. Obviously, different segments, for example, scene and atmospherics Fig. 2 MATLAB unit

174

N. Prakash et al.

can affect the GPS signals. Despite this in any case, exactness of one hundred meters for GPS will reliably be beaten [12]. (B) Buzzer A chime takes a kind of information and releases a sound as a result of it [13]. It uses diverse expects to make the sound. A ringer or beeper is a hailing device, “signal” begins from the grinding clatter that chimes made when they were electromechanical contraptions, worked from wandered down AC line voltage at 50 or 60 cycles. Various sounds consistently used to demonstrate that a catch has been crushed are a ring or a blast. (C) Vibration Sensor Vibration sensors are utilized in various tasks, machines and applications. Regardless of whether you are endeavoring to measure the speed of a vehicle or to check the intensity of an approaching quake, the gadget you are likely utilizing is viewed as a vibration sensor. Some of them work without anyone else, and others require their very own capacity source. Distinctive machine working conditions concerning temperature limits, alluring fields, vibration go, repeat expand, electromagnetic closeness (EMC) and electrostatic discharge (ESD) conditions and the necessary sign quality require the prerequisite for a collection of sensors.

4 Results and Discussion The program is composed of utilizing Embedded C and ordered utilizing Arduino IDE programming, and the different sensor circuits were planned and interfaced to the Arduino controller board. Furthermore, the GPS framework is utilized to get the area of the charged and send it to the enlisted number and furthermore to the enrolled email id (Fig. 3). From the outcome, it is unmistakably referred to that, when this framework recognizes the accurate area of the lady, with the assistance of the GPS and sends the data to the separate authority with no postponement. Notwithstanding that, this is useful for improving the ladies’ security.

5 Conclusion In this venture, we have proposed the framework for security of ladies and youngsters. This task exhibited a remote technique which will alarm and speak with secure medium. This data will be sent to the enlisted telephone number alongside the picture connect. Speed observing for youngsters and ladies security should likewise be possible by utilizing the GPS following system. This unit will find voyaging

GSM-Based Design and Implementation of Women Safety Device …

175

Fig. 3 Overview of hardware module

courses. This framework utilizes haversine and trilateration calculation for following the transport. Alert messaging will be done on the enlisted telephone numbers. The caught picture of the criminal will be sent to the enrolled email id. At the present time, the work is under method to embed it in jewels, convenient or other transporter like belt.

References 1. G. C. Harikiran, K. Menasinkai, S. Shirol, Smart security solution for women based on Internet Of Things(IOT), in 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) (Chennai, 2016), pp. 3551–3554 2. R. Velayutham, M. Sabari, M. Sorna Rajeswari, An innovative approach for women and children’s security based location tracking system, in International Conference on Circuit, Power and Computing Technologies (2016) 3. P. Dhole, Mobile tracking application for locating friends using LBS. Int. J. Innov. Res. Comput. Commun. Eng. 1(2) (2013) 4. P. Vetrivelan, et al., A NN based automatic crop monitoring based robot for agriculture, in The IoT and the Next Revolutions Automating the World (IGI Global, 2019), pp. 203–212 5. K. Valarmathi, Android based women tracking system using GPS, GSM. Int. J. Res. Appl. Sci. Eng. Technol. 4(4) (2016)

176

N. Prakash et al.

6. E. Udayakumar, P. Vetrivelan: PAPR reduction for OQAM/OFDM signals using optimized iterative clipping and filtering technique, in Proceedings of IEEE International Conference on Soft-Computing and Network Security (ICSNS’15) (SNS College of Technology, Coimbatore, 2015), p. 72 7. P. Sunil, Smart intelligent security system for women. IJECET 7(2) (2016) 8. E. Udayakumar, V. Krishnaveni, Analysis of various interference in millimeter-wave communication systems: a survey, in Proceedings of 10th IEEE International Conference on Computing, Communication and Networking Technologies (ICCCNT 2019) (Indian Institute of Technology Kanpur, Uttar Pradesh, 2019) 9. S. Santhi, E. Udayakumar, SoS emergency ad-hoc wireless network, in Computational Intelligence and Sustainable Systems (CISS). EAI/Springer Innovations in Communications and Computing (2019), pp. 227–234 10. A. Maharajan, A survey on women’s security system using GSM and GPS. Int. J. Res. Comput. Commun. Eng. 5(2) (2017) 11. P. Vetrivelan et al., Design of surveillance security system based on sensor network. Int. J. Res. Stud. Sci. Eng. Tech. 4, 23–26 (2017) 12. B. Gowri Predeba, et al., Women security system using GSM and GPS. Int. J. Adv. Res. Trends Eng. Technol. 3(19) (2016) 13. B. Chougula, Smart girls security system. Int. J. Appl. Innov. Eng. Manag. 3(4) (2014)

A Novel Approach on Various Routing Protocols for WSN E. Udayakumar, Arram Sriram, Bandlamudi Ravi Raju, K. Srihari, and S. Chandragandhi

Abstract Reducing the vitality utilization of accessible assets is as yet an issue to be comprehended in wireless sensor networks (WSNs). Numerous sorts of existing directing conventions are created to spare power utilization. In these conventions, bunch-based steering conventions are observed to be more vitality effective. A bunch head is chosen to total the information got from root hubs and advances these information to the base station in group-based steering. The choice of group heads ought to be productive to spare vitality. In Clustering LEACH convention, dynamic grouping for the productive determination of bunch heads has been utilized. The steering convention works effectively in enormous just as little regions. It performed NS2 recreations to watch the system throughput, vitality utilization, arrange lifetime and the quantity of group heads. It has been done in Multihop Leach protocol (M-LEACH), which remaining vitality and separation of hub from BS are utilized as parameters for CH choice. It has been done that the P-LEACH as far as system lifetime and took less vitality devoured when the measure of information to move to BS. On the off chance that most extreme number of hubs is bursting at the seams with time, demonstrates the system lifetime. E. Udayakumar (B) Department of ECE, KIT-Kalaignarkarunanidhi Institute of Technology, Coimbatore, Tamilnadu, India e-mail: [email protected] A. Sriram · B. Ravi Raju Department of IT, Anurag Group of Institutions, Hyderabad, India e-mail: [email protected] B. Ravi Raju e-mail: [email protected] K. Srihari Department of CSE, SNS College of Technology, Coimbatore, Tamilnadu, India e-mail: [email protected] S. Chandragandhi Department of CSE, JCT College of Engineering and Technology, Coimbatore, Tamilnadu, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_17

177

178

E. Udayakumar et al.

Keywords LEACH-C protocol · WSN · P-LEACH, M-LEACH protocols · NS2

1 Introduction The wireless sensor network is of central hugeness since they are in charge of keeping up the courses in the system, information sending and guaranteeing dependable multibounce correspondence. The primary necessity of a wireless sensor network is to draw out system vitality productivity and lifetime. A remote sensor system comprises of spatially circulated self-governing sensor to screen physical or natural conditions, for example, temperature sound, vibration, weight movement or contaminations and to helpfully go their information through the system to a principle area or sink where the information can be watched and examined [1]. A sink or base station acts like an interface among clients and the system. One can recover required data from the system by infusing questions and assembling results from the sink. Normally, a remote sensor framework contains an enormous number of sensor center points. The sensor centers can pass on among themselves using radio sign [2]. A remote sensor center is outfitted with recognizing and enrolling devices, radio handsets and power portions. The individual center points in a remote sensor mastermind (WSN) are basically resource constrained: They have obliged dealing with speed, amassing utmost and correspondence move speed. After the sensor center points are sent, they are responsible for self-orchestrating an appropriate framework establishment normally with multi-skip correspondence with them. By then the on board sensors start gathering information of interest. Remote sensor devices also respond to request sent from a “control site” to perform express rules or give distinguishing tests [3]. In M-LEACH convention [4], dynamic bunching for the productive choice of group heads has been utilized. The directing convention works productively in enormous just as little territories. For an ideal number of bunch head choice it partitioned an eight enormous sensor field into rectangular groups. At that point, these rectangular bunches are additionally assembled into zones for productive correspondence between group heads and a base station. It performed NS2 recreations to watch the system throughput, vitality utilization, organize lifetime and the quantity of group heads [5]. The goal is to actualize an ideal group-based directing convention for remote sensor systems. The proposed calculation means to give a higher throughput, increment arrange life time. What’s more, generally speaking lower vitality utilization contrasted with different conventions. Decreasing the vitality utilization of accessible assets is as yet an issue to be tackled in wireless sensor networks (WSNs). Numerous sorts of existing directing conventions are created to spare power utilization. In these conventions, group-based directing conventions are observed to be more vitality effective. A bunch head is chosen to total the information got from root hubs and advances these information to the base station in group-based directing. The choice of group heads ought to be proficient to spare vitality [6].

A Novel Approach on Various Routing Protocols for WSN

179

2 System Design In M-LEACH convention, dynamic grouping for the productive choice of bunch heads has been utilized. The directing convention works productively in huge just as little zones. For an ideal number of bunch head determination, it isolated an enormous sensor field into rectangular groups. At that point, these rectangular bunches are additionally gathered into zones for productive correspondence between group heads and a base station. It performed NS2 re-enactments to watch the system soundness, throughput, vitality utilization, organize lifetime and the quantity of group heads [5]. In P-LEACH directing convention, out-performs in enormous zones in correlation with the LEACH-C and I-LEACH. It has been done in P-LEACH, which lingering vitality and separation of hub from BS are utilized as parameters for CH determination. To spare vitality, begin the relentless state activity of a hub just if the worth detected by a hub is more prominent than the set edge esteem. The limit worth will be set by the end client at the application layer. LEACH-C [7] is then subjectively and quantitatively dissected. It has been done that the M-LEACH regarding system lifetime and took less vitality expended when the measure of information to move to BS. In the event that most extreme number of hubs is bursting at the seams with time, demonstrates the system lifetime [8]. In M-LEACH, 90 hubs are alive for 50 s. On the off chance that greatest quantities of hubs are alive for long time, the system life time expanded. It has been discovered that first hub passes on at 35 round and half of the hub alive = 250, last hub kicks the bucket at 1000, organize settling time = 1.9 s and convention overhead (bytes) = 16. The outcome is that the system settling time is expanded because of the most extreme number of hubs alive and low vitality utilization [9]. The objective of this paper is to discover a vitality effective directing convention for wireless sensor networks. The proposed convention intends to give a higher throughput, a higher system life time, and by and large lower vitality utilization contrasted with different conventions. Figure 1 shows the block diagram of proposed system P-LEACH protocol [10]. The base station associated between the bunches. Bunching Fig. 1 is a procedure of collection hubs utilizing a calculation to play out specific undertakings productively according to the necessities. Bunching can likewise be utilized to separate the topology into sub-areas dependent on specific criteria, for example, entire territory ought to be secured, [11] least vitality utilization, most extreme lifetime and so forth. The p-drain, all hubs are spread arbitrarily in the system. These hubs isolated into groups. In this convention, the group head chooses haphazardly to disseminate vitality to the entire system to vitality hub. The bunch head close to the sink hub and transmit the information outline. The bunch head which transmit the information from all hubs to the sink, it expends more vitality to different hubs [12]. It is vitality utilization of the entire system unevenly and it influences the lifetime of the system. The vitality utilization is accomplished by the bunch head. It can decrease the lifetime of the system. The capacity of LEACH is partitioned into two stages. Chosen bunch head is the director of the group [13]. The undertaking of CH makes TDMA-based timetable to

180

E. Udayakumar et al.

Fig. 1 Block diagram of proposed system P-LEACH protocol

dole out a vacancy to each group part for intermittent information transmission to CH. At that point, CH totals the information to lessen repetition among corresponded worth and they transmit the information to BS legitimately. The primary capacity of LEACH is partitioned into two stages [14]. The set-up stage comprise of the CH determination and group pursued by consistent state stage in which chose CH does information gathering collection, conveyance to BS. At the season of set-up stage, a sensor hub chooses an arbitrary number 0 and 1. In the event that the arbitrary number is not more prominent than the edge T (n) esteem then the sensor hub select as a bunch head [15]. In Fig. 1, the empty circles speak to the hubs, and the dark spots speak to the bunch heads [10]. In pseudocode of p-filter consider, every one of the hubs N, send their area x, y, and vitality E, data to base station BS [10]. Stage 1: Base Station gets area L, and vitality E, from hubs N1, N2, N3, N4, N5 and N6, etc., of each group. Stage 2: For selecting cluster head, BS chooses a hub Ni with the most extreme outstanding vitality Emax as the cluster head CH for every cluster. Stage 3: Once the BS chooses the cluster head, it educates each hub in the determination of the CH. Hub sends an affirmation message to the BS illuminating the BS has effectively gotten the back to the BS educating data [16]. Stage 4: A CH with most extreme vitality and least good ways from the base station is chosen as the leader, in order to contact the BS straightforwardly. Stage 5: The information is moved way drawn. To start with, the information of every hub is sent to the CH of the bunch. Presently, the CH assembles the information from every one of the hubs and advances it to the following CH in the chain [10]. Stage 6: IF circle is suggested if the vitality of the bunch goes beneath the vitality level. In the event that CH neglects to keep up a similar most extreme vitality, at

A Novel Approach on Various Routing Protocols for WSN

181

that point, the hub with the second most astounding greatest vitality is chosen and pronounced as the CH according to step 2.

3 Results and Discussion The software results of M-LEACH using wireless sensor network are discussed. The software tool network simulator (NS2) is used for simulation and layout process (Table 1). Figure 2 shows assigning and checking 20 nodes and check whether the nodes are working based on the comparison of their time delay with the maximum threshold value. Every hub may have diverse time deferral to transmit bundle. In this work allotted hub 0 is the base station, and the rest of the hubs are sink hubs. The sinks are situated at some spot in the system. The sink knows about the system topology. Every hub creates detecting information occasionally and all hubs work together to advance bundles that contain the information bounce by jump toward a sink [15]. Figure 3 shows the data transmission. Information transmission is happening, the base station gather the data from all the base station. The dropping parcels are tumble down in the sensor information. Information transmission happening, the base station gather the data from all sinks. The dropping bundles are tumble down in the sensor zone. On getting a parcel, the base station decodes it, and consequently discovers unique sender and the bundle succession number. Figure 4 shows the base station will collect data from all nodes. By using LEACHC algorithm during the data transmission base station identified the minimum distance and maximum energy sink and it was informed to all the nodes. The objective of this work is to discover a vitality productive steering convention for wireless sensor networks. The proposed calculation expects to give a higher throughput, increment arrange life time. Furthermore, generally speaking lower vitality utilization contrasted with different conventions. Table 1 Input parameters Parameters

Value

Network area (m)

100 × 100

Number of nodes

20

Location of sink

50, 50

Custer radius

30 m

Sensing radius

10 m

Initial energy

0.5 J

ETX

50 nJ

ERX

50 nJ

Number of rounds

6000

Routing protocol

M-LEACH

182

E. Udayakumar et al.

Fig. 2 Assigning and checking 20 nodes

3.1 Energy Consumption Figure 5 shows the energy consumption of M-LEACH. The percent vitality devoured by a hub is determined as the vitality expended to the underlying vitality. Lastly, the percent vitality devoured by every one of the hubs in a situation is determined as the normal of their individual vitality utilization of the hubs.

3.2 Throughput Figure 6 shows the graph plot for time and number of dead nodes. Throughput characterized that normal pace of fruitful bundle conveyance. The throughput is the most significant parameter to examine the exhibition of the system, to improve throughput the mistake ought to be revised, rather than retransmitting the bundle. On the off chance that the mistake is remedied there is no need of retransmitting the bundle. On the off chance that the retransmission traffic is decreased the clog would not happen. On the off chance that there is no clog there is no parcel misfortune that is blunder. On the off chance that increasingly number of parcels in the system the exhibition of the system corrupts which prompts clog, which prompts bundle

A Novel Approach on Various Routing Protocols for WSN

183

Fig. 3 Data transmission

misfortune. In the event that there is a mistake revision procedure which rectifies the blunder as opposed to going for retransmission, it improves throughput [3]. Figure 6 shows the comparison of P-LEACH and LEACH-C for 20 nodes. It shows the relation between dead hubs in P-LEACH and LEACH-C. The outcomes demonstrate that for 180 rounds, P-LEACH has the base number of dead hubs: around 82 [10]. For LEACH-C, the quantities of dead hubs are 132 and 175 separately. Along these lines, P-LEACH is increasingly maintainable and moderates vitality. The explanation behind less dead hubs in our methodology is that we incorporated the best highlights of both P-LEACH and LEACH-C [17].

3.3 Network Lifetime One of the most used implications of framework lifetime is the time at which the chief organize center point misses the mark on imperativeness to send a package, in light of the fact that to lose a center point could infer that the framework could lose a couple of functionalities [15]. The ideal opportunity for the main hub or a specific level of sensor hubs to come up short on power or it is the time interim from the begin activity of the sensor arrange until the passing of the principal alive hub.

184

E. Udayakumar et al.

Fig. 4 Base station to collect data

Figure 7 shows the time and throughput. Throughput alludes to the amount data can be moved starting with one area then onto the next in a given measure of time. It is utilized to quantify the exhibition of hard drives and RAM, just as Web and system associations. For instance, a hard drive that has a greatest exchange pace of 100 Mbps has double the throughput of a drive that can just exchange information at 50 Mbps. Likewise, a 54 Mbps remote association has about fivefold the amount of throughput as a 11 Mbps association. In any case, the real information move speed might be restricted by different factors, for example, the Web association speed and other system traffic. Along these lines, it is a great idea to recall that the most extreme throughput of a gadget or system might be essentially higher than the genuine throughput accomplished in regular use [18].

4 Conclusion In this paper, an ideal-based directing convention utilizing LEACH-C for remote sensor system have the M-LEACH routing protocol for improving vitality productivity in remote sensor systems. The presentation of P-LEACH is contrasted and the M-LEACH and LEACH-C conventions. With reproduction we saw that M-LEACH performs much superior to LEACH-C and P-LEACH regarding system lifetime,

A Novel Approach on Various Routing Protocols for WSN

185

Fig. 5 Energy consumption of P-LEACH

number of dead hubs and vitality utilization. MATLAB is utilized for assessing the exhibition of the convention. In light of the reproduction results, we confirmed that M-LEACH performs superior to LEACH-C and P-LEACH as far as vitality and lifetime of the system. The reenactment results approve that our proposed methodology could broaden the system for WSNs applications.

186

E. Udayakumar et al.

Fig. 6 Comparison of P-LEACH and LEACH-C throughput for 20 nodes

Fig. 7 Comparison of M-LEACH and LEACH-C network life time for 20 nodes

A Novel Approach on Various Routing Protocols for WSN

187

References 1. M. Al-Otaibi, H. Soliman, Efficient geographic based routing protocols with enhanced update mechanism. Sens. Netw. 8, 160–171 (2011) 2. M. Chen et al., Spatial-temporal based energy reliable routing protocol in WSN. Int. J. Sens. Net 5, 129–141 (2012) 3. Y. Chen et al., A multipath QoS protocol in WSN. Sens. Netw. 7(4), 207–2216 (2010) 4. C.T. Sony, C.P. Sangeetha, C.D. Suriyakala, Multi-hop LEACH protocol with modified cluster head selection and TDMA schedule for wireless sensor networks. 2015 Global Conference on Communication Technologies, 2015, pp. 539–543 5. R. Biradar et al., Classification of routing based protocols in WSN. Spec. Comput. Secur. Syst. 4(2), 704–711 (2009) 6. S. Ehsan, B. Hamdaoui, A survey on energy with QoS assurances for sensor net. IEEE Comm. Surv. Tuts 14(2), 265–278 (2011) 7. M. Tripathi, M.S. Gaur, V. Laxmi, R.B. Battula Energy efficient LEACH-C protocol for Wireless Sensor Network. Third International Conference on Computational Intelligence and Information Technology, Mumbai, 402–405 (2013) 8. M.B. Haider et al., Success guaranteed in planar nets in wireless sensor communication. Sens. Net 9, 69–75 (2013) 9. R. Yadav, S. Varma, N. Malaviya, A survey of MAC protocols for WSN. UbiCC J. 4, 827–833 (2009) 10. A. Razaque, M. Abdulgader, C. Joshi, F. Amsaad, M. Chauhan, P-LEACH: Energy efficient routing protocol for wireless sensor networks. 2016 IEEE Long Island Systems, Applications and Technology Conference, Farmingdale, NY, 2016, pp. 1–5 11. J. Liu, X. Hong, An online energy routing protocol of traffic load prospects in WSN. Sens. Netw. 5, 185–197 (2009) 12. H. Zhou, D. Luo et al., Modeling of node energy consumption for WSN. WSN 3, 18–23 (2011) 13. G. Anastasi et al., Energy conservation in wireless sensor networks: a survey. Ad Hoc Net 7(3), 537–568 (2009) 14. D. Ye, F.G.S. Zhong, L. Zhang, Gradient broadcast: a robust data delivery protocol for scale sensor net. Wirel. Net 11, 285–298 (2005) 15. S. Santhi, et al., SoS emergency ad-hoc wireless network, in Computational Intelligence and Sustainable Systems (CISS). EAI/Springer Innovations in Communication and Computing (2019), pp. 227–234 16. L. Almazaydeh et al., Performance evaluation of routing protocols in WSN. Comput. Inf. Tech. 2, 64–73 (2010) 17. E. Udayakumar, P. Vetrivelan, A neural network based automatic crop monitoring robot for agriculture, in The IoT and the Next Revolutions Automating the World. IGI Global, pp. 203–212 (Chapter 13) 18. E. Udayakumar, P. Vetrivelan, Design of smart surveillance system based on WSN. Int. J. Res. Stud. Sci. Eng. Tech. 4, 23–26 (2017)

Fraud Detection for Credit Card Transactions Using Random Forest Algorithm T. Jemima Jebaseeli, R. Venkatesan, and K. Ramalakshmi

Abstract In these days, credit card fraud detection is a major concern in the society. The use of credit cards in e-commerce sites and various banking sites has been increased rapidly in recent times. As modernization will have both positive and negative impacts, the use of credit cards in online transactions has made it simple; likewise, it also led to the increase of the number of fraud transactions. As part of the activities happening, it is always advised for the e-commerce sites and the banks to have automatic fraud detection system. Credit card fraud might result in huge financial losses. While look for the solutions for credit card frauds that are happening, machine learning techniques provide favorable solutions. The proposed system uses a random forest application in solving the problem and to attain more accuracy when compared to the other algorithms used till now. All the basic classifiers have the same weight but random forest algorithm has relatively high and others have relatively low weights because of the randomization of bootstrap sampling of a making decision and selection of attributes cannot guarantee that all of them have the same stability in decision making. Keywords Decision tree · Fraud detection · Random forest

1 Introduction In our everyday life, various transactions are done through credit card payments, cardless transactions like Google Pay, PhonePe, Samsung Pay, and PayPal. There is an ongoing concern in recent days which is fraud detection, and it leading to the great loss of money every year. If the fraud continues this way, it is said that by the year 2020, it will reach double digits. Nowadays, the presence of the card isn’t physically required to finish the exchange which is prompting increasingly more extortion exchanges [1]. Fraud detection has an emotional impact on the economy. T. Jemima Jebaseeli (B) · R. Venkatesan · K. Ramalakshmi Department of Computer Science and Engineering, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu 641114, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_18

189

190

T. Jemima Jebaseeli et al.

In this way, fraud detection is fundamental and vital. Financial institutions have to employ various fraud detection techniques for tackling this problem [2, 3]. But when given time the fraudsters find ways to overcome the techniques established by the company holders. Despite all the preventive methods taken by the financial institutions and strengthening of law and government putting their best efforts to eradicate fraud detection, fraud detection continues to rise and it remains as a major concern in the society [4, 5]. Credit cards are generally utilized in the improvement of the Internet business and furthermore portable applications and primarily in the online-based exchanges. With the help of the credit card, the online transactions and online payment are easier and convenient for usage [6, 7]. Fraud transactions have a great influence on enterprises [8]. Machine learning techniques have been widely used, and it has become very important in many areas where spam classifiers protect our mail id. The fraud detection systems learn the features of extraction and helps in controlling the fraud detection.

2 Related Works Transactional fraud detection based on artificial intelligence techniques for visa transactions is highly caught attention in the research world. There is data available in bulks is to preprocess, analyze, and derive a suitable conclusion of it. Often it is said to be that there is a need to derive the best results out of the data that is available based on different algorithms. Here, there are few methods which are discussed that have been proposed until now.

2.1 Kernel-Based Hash Functions In general, the hash technique is acquainted to develop a group of hash functions. This assistance in mapping the high-dimensional information to the lower-dimensional articulations, and this procedure is done inside the hash code space [9]. The main advantage in this process is that it increases the pace at which the nearest neighbor scanning in a huge data set for finding the nearest fraud detections which consist of many fraudulent examples. Kernel plays an important role in constructing all the basic hash functions. It has been proved that it tackles the linear inseparable problem. ⎞ ⎛ m k x( j) , x a j − b⎠ h(x) = sgn⎝

(1)

j=0

Here, h denotes the hash function, k denotes the kernel, and x( j) . . . x(m) denotes the sample training set. After obtaining these values, KSH model is trained with labeled records. For every tested dataset, first map the hash function with KSH

Fraud Detection for Credit Card Transactions Using …

191

model and it helps in finding out the most similar training samples that are labeled as fraudulent. In most of the cases, it uses k-nearest neighbor in fraud detection.

2.2 Bayesian Network Bayesian network is utilized to assemble a coordinated non-cyclic chart which is trailed by a conditional probability distribution to build the non-cyclic graph [10]. Consider four random variables A, B, C, and D. Given are the minimal joint probabilities for each pair of factors, that is the probabilities of the form P (A, B), P (A, C) and so on, and the conditional probability P(A, B|C, D). P (A, C|B, D) is calculated as, P( A, B, C, D) P(B, D) P( A, B|C, D)P(C, D) = P(B, D)

P( A, C|B, D) =

(2)

2.3 Clustering Model The clustering models identifying the records and grouping the records as per the cluster to which they have a place. Clustering model firmly corresponded to information which is fixated on distribution models. While performing cluster analysis, first separate the arrangement of information into clusters. Clustering is additionally utilized in perceiving applications. To find the credit card fraud, the clustering calculation is basic and most extreme predominant calculation than other AI calculations [8]. K-means clustering technique is an unsupervised clustering model. This technique is mainly used for clustering particular classes in a data set. J=

k 2 xn − μ j

(3)

i=1 nεS j

Here, xn is the vector representing nth data point, and μ j is the centroid of the data points in s j .

192

T. Jemima Jebaseeli et al.

Start

Stop

Credit card data set

Data Exploration

Performance Analysis

Random Forest Algorithm

Classifier Selection

Fig. 1 Architecture diagram of the proposed system

3 The Proposed Random Forest Technique Random forest algorithm first delivers a forest, and it makes them randomly; the forest is built as a collaborative decision tree which is otherwise known as the bagging method. The algorithm is dependent upon the quality of the discrete trees and furthermore the correlation between various trees. It shows the distinction of various information factors that simultaneously permit a high number of considerations to contribute to prediction. The algorithm works well with an insignificant amount of data. Aggregating un-correlated trees are the main concept that makes the random forest algorithm better than the decision trees. The main idea is to create several model trees and make an average of these trees to create a better random forest.

3.1 Implementation of Random Forest Algorithm The procedure of the random forest algorithm execution is done in the below diagram. As shown in Fig. 1, there are several steps involved; first, there is a requirement to gather the information and to store the information. The gathered information are in the form of data set in an excel sheet. In data exploration, the entire data set is checked and removed the unnecessary data that is present. However, the data which is preprocessed is further treated using a random forest algorithm in two ways by using the train data set and then using the test data set. The attained results are verified as legal and fraudulent transaction process.

3.1.1

Collection of Data Sets

Here, the first step is to collect the data sets. The data sets can be collected from various methods like crawling or application program interface. The data sets must contain attributes like name of the customer, customer email address, card number, payment method, customer mobile number, bank account number, and pin number. After collecting the data sets from the above attributes, the data set is used for

Fraud Detection for Credit Card Transactions Using …

193

performing analysis. The primary distinction between the training data set and the test data set is that the training data set is labeled; however, the test set is unlabeled. So at first, the data set is trained by the regression analysis and then it is been tested by the random forest algorithm.

3.1.2

Analysis of Data

After preparing the data, analysis is performed using various algorithms. This data has a set of functions for training the data and creating classification predictive models. Random forest algorithm is used for grouping the data sets, and it is divided into training set and rest as test sets. It is composed of various tools such as splitting of data and data preprocessing; which is done by using the resampling method.

3.1.3

Reporting Results

After the above stages, the complete analysis of data is done and the results are produced by using random forest algorithm. Hence, the random forest algorithm is performed by the classification to obtain the results. The result is to gain more accuracy in fraud detection. Random forest algorithm has nearly the same hyperparameters as a decision tree or a bagging classifier. Fortunately, it is not required to combine a decision tree with a bagging classifier. Random forest algorithm also deals with regression tasks. The algorithm adds additional randomness to the model while growing the trees. Therefore, only a random subset of the features is taken into consideration by the algorithm for splitting a node. Instead of searching for the most important feature node, it searches for the best feature among the random subset of features by using random thresholds. This results in a wide diversity and yield a better model. The structure of the random forest tree is portrayed in Fig. 2. The acquired information is classified from the data set and represented as in the form of attributes such as card number, card limit, and personal information. The information is presented in the form of a matrix to decide the sample belongs to which of the decision tree. This process requires a large number of trees to create similar trees so as to provide various decision trees. By analyzing all the trees present in the graph, which tree gets the most number of relevant solutions, are identified. Based on that conclusion, the decision tree is chosen.

4 Data Preprocessing The biggest step in data analytics is data preprocessing; it plays a vital role in reducing several problems. Through preprocessing partially filled columns are eliminated. As shown in Fig. 3, the data exploration summarizing the main characteristics of a

194

T. Jemima Jebaseeli et al.

Fig. 2 Random forest tree construction

Fig. 3 Data exploration

data set, including its size, accuracy, initial patterns in the data, and other attributes. According to Fig. 4, the pattern 0.0 shows the legal transactions that is 55,701.000 and the number of the fraud transactions. There are 21.000 fraud transactions present in the data set. According to Fig. 5, the combinations of 1.1 provide the information that 70.000 are the legal transactions and there are 8.000 possible fraud transactions. Further, the training operation is performed. Random forest is performed now along with the other training data set. According to Fig. 5, the processing has been done to the training data set. The number of legal transactions is denoted by 0.0 which get the count as 185,697.000 and then the numbers of fraud transactions present in the data set are 21.000. There are other transactions such as 1.1 and 1.0. All of the transactions that are 1.1 are

Fraud Detection for Credit Card Transactions Using …

195

Fig. 4 Prediction on Y-axis

0

1

55701.000

8.000

21.000

70.000

0

1

Fig. 5 Prediction on X-axis

0

1

185697.000

8.000

21.000

274.000

0

1

labeled as 274.000. This provides a gradual rise in the number of legal transactions, and the number of fraud transactions remains the same, so that the system to observe the improvement in the legal transactions.

5 Model Developments In this paper based on prior studies, the application of random forest algorithm is introduced and analyzed in credit card fraud detection. The models are constructed and compared with other existing competitive techniques. The proposed system uses the library called MATPLOT to represent the information that is attained in the form of a graph. The values given in Figs. 4 and 5 are x-predict and y-predict which represents the fraud and legal transactions based on x-axis and y-axis. In Table 1, the sample used by the proposed system is given.

78,049

15,7168

69,297

14,4504

1

2

3

4

51,435

V1

0

Time 1.465321 −0.584911 −0.425494

−0.289947

0.478266

−0.672705

2.249806

0.976047

−1.395302

1.276014

−0.312745 2.249806

V4 −0.135904

V3

−0.352125

1.197490

V2

Table 1 Data exploration

−0.297210

−0.777398

−1.201527

1.300002

0.222100

V5

−0.963389

−0.582088

0.928544

−1.382887

0.231128

V6

1.207532

−0.880396

−0.743618

−0.479586

1.086617

V7

V8

−0.837776

−0.103505

0.755504

−0.632572

−0.420363

−0.057654

−0.203036

−0.141397

0.064533

0.391464

V9

1.121421

−1.241653

−2.118499

0.710743

0.672499

V10

196 T. Jemima Jebaseeli et al.

Fraud Detection for Credit Card Transactions Using …

197

6 Conclusion The proposed unsupervised random forest algorithm reduces the number of fraud transactions. There are several experiments performed using random forest algorithm. The obtained results ensured that the number of fraud transactions is greatly reduced. This improves more secure transactions through online and makes the system more accurate.

References 1. U. Fiore, A. De Santis, F. Perla, P. Zanetti, F. Palmieri, Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Inf. Sci. 479, 448–455 (2019) 2. N. Carneiro, G. Figueira, M. Costa, A data mining based system for credit-card fraud detection in e-tail. Decis. Supp. Syst. 95, 91–101 (2017) 3. A.C. Bahnsen, D. Aouada, A. Stojanovic, B. Ottersten, Feature engineering strategies for credit card fraud detection. Exp. Syst. Appl. 51, 134–142 (2016) 4. M. Zareapoor, P. Shamsolmoali, Application of credit card fraud detection: based on bagging and ensemble classifier. Proc. Comput. Sci. 48, 679–685 (2015) 5. K. Randhawa, C.K. Loo, M. Seera, C.P. Lim, A.K. Nandi, Credit card fraud detection using AdaBoost and majority voting. IEEE Access 6, 14277–14284 (2017) 6. P. Save, P. Tiwarekar, K.N. Jain, N. Mahyavanshi, A novel idea for credit card fraud detection using decision tree. Int. J. Comput. Appl. 161(13), 0975–8887 (2017) 7. S. Sorournejad, Z. Zojaji, R.E. Atani, A.H. Monadjemi, A survey of credit card fraud detection techniques: data and technique oriented perspective. ArXiv (2016) 8. A. Singh, A. Jain, Adaptive credit card fraud detection techniques based on feature selection method. Adv. Comput. Commun. Comput. Sci., 167–178 (2019) 9. Z. Li, G. Liu, S. Wang, S. Xuan, C. Jiang, Credit card fraud detection via kernel-based supervised hashing, in 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (2018) 10. S. Xuan, G. Liu, Z. Li, L. Zheng, S. Wang, C. Jiang, Random forest for credit card fraud detection, in 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC) (2018)

Deep Learning Application in IoT Health Care: A Survey Jinsa Mary Philip, S. Durga, and Daniel Esther

Abstract Recent advances in healthcare domain integrate IoT technology to effectively monitor living conditions in day-to-day life. Monitoring real-time data obtained from IoT sensors enables to predict the risk factors of any chronic diseases. Machine and deep learning algorithms make the job of physicians easier in predicting the seriousness of the diseases. This paper presents an exhaustive overview on the need for intelligent prediction models in IoT health care. It also reviews in detail the merits and demerits of the classification and prediction techniques. The generic framework for IoT healthcare prediction is proposed. This paper also outlines the healthcare sensors and its purposes for intelligent healthcare monitoring. The areas of further research have also been presented. Keywords Internet of things · Deep learning · Health care · Machine learning · Intelligent health care · Health sensors

1 Introduction The Internet of things (IoT) is considered as a rapidly growing segment of Internet and has captured attention in various fields. It plays a vital role in today’s generation. It becomes more powerful when IoT is combined with machine learning technique to categorize the real-time sensor data. With the advancement in the field of AI, all the things beyond human understanding during the earlier days came into practical level. Deep learning is utilized in many applications which is a subset of machine learning, and it stands at the forefront of AI revolution [1]. It has shown its growth in various areas like health care, entertainment, robotics, coloring images and videos. Deep learning is considered as the main area under machine learning. It helps in developing a model that can predict various aspects of real time. The data that needs to be analyzed can be collected from various medical sensors [2]. This data can be given to various machine learning algorithms for the purpose J. M. Philip (B) · S. Durga · D. Esther Karunya Institute of Technology and Sciences, Kottayam, Kerala, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_19

199

200

J. M. Philip et al.

of prediction. The real-time data is stored in the cloud, and as a result we can benefit from both IoT and cloud computing technologies in the same manner [3]. The combination of IoT and cloud computing technologies can lead to high efficiency for the system. The frequent changes in health parameters can be collected and can be analyzed to measure the severity of parameters during equal interval of time. Smart healthcare sensors and wearables are the important entities that make IoT environment to monitor patients’ health conditions in a non-intrusive way. Table 1 shows the purposes of the IoT healthcare sensors. The large volumes of data can be handled by the machine learning algorithms to make decisions. Deep learning concept is derived from the conventional artificial neural network with the presence of a large number of perceptron layers that can identify the hidden patterns. When the term deep learning was introduced, it was not that much promising at the earlier stages due to the lack of powerful computing systems but as time changed the computational power of systems increased. The rise of cloud computing devices in the storage of huge volumes of data caused a boom in the deep learning concept. Deep learning is considered as the process of making the system to learn a concept which can be done through either supervised, unsupervised or semi-supervised learning [10], and the various types of learning are depicted in Fig. 1. Nowadays, the concept of deep learning is growing as it develops hybrid systems that help in predicting various diseases that affect mankind. A deep learning technique can study the humongous datasets, and new features can build rather than relying on the existing one. Table 1 Sensors used in healthcare domain Sensors used in health care

Purpose

Air bubble detector [4]

To identify the properties of fluids

Force sensor [4]

Measure force with high reliability

ECG sensor [5]

For measuring heart rate

Photo optic sensor [4]

Suitable for medical applications in which selection of a peak wavelength is important

Position sensor [4]

Used for measuring the changes in the magnetic field

Temperature sensor [6]

Determine the body temperature

Blood glucose sensor [5]

For finding out the blood glucose level

Blood pressure [6]

Systolic and diastolic pressure is identified

Oximetry sensor [5]

Used for monitoring oxygen saturation

Potentiometric sensor [7]

Sweat ion monitoring

Biosensors [6]

Scans during pregnancy or ultrasound, testing blood sugar

Image sensors [8]

Cardiology

Gyroscope [9]

Used for measuring the angular velocity

Accelerometer [9]

For measuring acceleration

Deep Learning Application in IoT Health Care: A Survey

201

Fig. 1 Types of machine learning

2 Classification of Deep Learning Networks in IoT Health Care 2.1 Fully Connected Neural Networks A group of neurons is considered as the basic constituents of the neural network (NN). The idea of NN comes from the working principle of biological neural networks. It can be considered as a tool used in data classification. The main layers present in a neural network are the input layer, hidden layer and the output layer. It is basically a group of neurons connected in the form of an acyclic graph. In a fully connected NN, all the neurons will be connected to every other neuron in the network. For training the neural network, backpropagation can also be incorporated [11].

2.2 Convolutional Neural Networks Convolutional neural network (CNN) is an artificial neural network that uses perceptron for supervised learning. It generally processes spatial data and is used in the areas of image recognition and object classification. The preprocessing needed for CNN is relatively small when compared to other image classification algorithms. CNN consists of convolutional layer, max-pooling layer and classification layer [12]. The convolutional layer present is considered as the main building block in CNN. The hidden layer has a series of convolutional layers. CNN is considered as more powerful than RNN.

202

J. M. Philip et al.

2.3 Recurrent Neural Networks The recurrent neural network (RNN) is an artificial neural network which processes a sequence of inputs with the help of memory. It is a powerful and robust type of neural network. RNN is considered a promising algorithm as it is having its own internal memory [13]. Every node present in the layer gets connected with every other node in a single direction. Each connection is associated with a set of weight values. The input layer accepts data to the network, and the output layer generates the output of the process. The hidden layers perform all the processing of the input. RNN generally processes sequential data and is used in speech recognition and natural language processing.

2.4 Recursive Neural Networks Recursive neural network is a type of RNN that can process variable-length input. The individual nodes are grouped together using weight matrix into parents [14]. It can use a training dataset to model the hierarchical structures. The various parts of an image are learned by a shared-weight matrix and also a binary tree structure. The various applications include image decomposition and natural language processing. The training is faster in recurrent neural network when compared with recursive neural network.

3 Study of Deep Learning Techniques in Health Care Deep learning comes as a subcategory of machine learning, and it is the fastestgrowing field in machine learning. Deep learning refers to artificial neural network which consists of various layers. It mimics the neurons of neural network in human brain. In deep learning, we are training the system to respond to situations that come naturally. This enables accurate decisions in case of emergencies. It helps in achieving high performance, and also sometimes it can exceed the performance level of humans. The deep learning has gained a huge rise in various fields due to the accuracy that it provides in various areas. Labeled data is very much essential in deep learning, and it also requires large computing power. Large set of labeled data is taken into account for training deep learning models, and features are extracted directly without manual feature extraction. Deep learning techniques can improve the healthcare domain in a much positive way. In medical imaging, it helps in the detection and recognition of melanoma. It learns the important features from a group of medical images. By using images from MRI and other sources, 3D brain construction is possible. Other applications include brain tissue classification, tumor detection, Alzheimer’s detection,

Deep Learning Application in IoT Health Care: A Survey

203

etc. In the area of bioinformatics, it helps in the diagnosis and treatment of terminal diseases like cancer. Deep learning also improves the predictive analysis area. Deep learning in health care helps doctors to analyze the disease and take quick decisions, thus enabling proper treatment. Deep learning has gained much importance in various spheres of health care like genomics, cell scope, medical imaging, discovery of drugs [15] and Alzheimer’s disease. In the area of genomics, if a patient is undergoing any treatment it can determine the genomic structure to predict what diseases can affect the person in the future. It helps in making the doctors fast and more accurate. Cell scoping also uses deep learning to monitor the health condition of patients, and thereby we can reduce the visit to hospitals. Harmful diseases like brain tumors and cancer can be detected using the medical imaging technique. Alzheimer’s can be detected at an early stage using the deep learning technique. Deep learning can make the unstructured data into the simple representations which can be easily determined by individuals. It learns the important relationship in the data and creates a model. Improved outcomes can be provided at low cost. As mentioned in Fig. 2, deep learning concept is showing a massive increase in the healthcare domain.

Fig. 2 Application of deep learning techniques in health care during the year 2010–2019

204

J. M. Philip et al.

Table 2 Comparative study on decision and prediction algorithms Disease

Algorithm used

Merits and demerits

Human activity recognition [16]

PCA and linear discriminant analysis

Merits: It outperformed SVM and ANN Demerits: Poor recognition rates

Fall detection [9]

KNN and decision tree

Merits: The accuracy of KNN and decision tree is 98.75 and 90.59% Demerits: Implementing in real time is difficult

Health monitoring system for war soldiers [17]

K-means

Merits: Provides security and safety to soldiers Demerits: Real-time soldier data is unavailable

Fetal movement monitoring [18]

PCA

Merits: Some of the physiological features extracted can be used to identify the fetal health state Demerits: Use of maternal perception as the references are not fully trustworthy

Stroke detection [19]

CNN

Merits: Human errors during detection can be rectified Demerits: It cannot automatically identify nodes when a new node is connected to its IP and needs to be added manually

3.1 A Comparative Study on Decision and Prediction Algorithms This section discusses the development of decision and prediction algorithms in five different domains, viz., (1) stroke detection, (2) human activity recognition system, (3) fall detection system, (4) health monitoring system for war soldiers and (5) fetal movement monitoring. Table 2 shows the comparative study on the various decision and prediction algorithms.

3.2 Stroke Detection Stroke is a situation that arises due to blood clot, and it leads to the non-functioning of various parts of the brain. Deep learning can be used in the area of stroke detection. The common method used in hospitals is computerized tomography (CT) scan and

Deep Learning Application in IoT Health Care: A Survey

205

MRI scanning. Instead of using these methods, we can use convolutional neural network technology to detect a healthy brain and also a hemorrhagic stroke [19]. The detection is carried out by developing a framework by examining the CT images. Parameters like accuracy, precision, F1-score, recall and processing time are analyzed to determine the accuracy, and CNN provides 100% accuracy. The new classifier is trained using the new dataset. Various steps include image acquisition, feature extraction, classification and IoT framework. The acquired images are in DICOM format. During feature extraction, the CNN returns many attributes. The Gaussian probability density function is used by Bayesian classifier. An online platform called LINDA (Lipisco Image Interface for Development of Application) [16] was created to perform image processing. The DICOM images can achieve 100% accuracy using CNN architecture.

3.3 Human Activity Recognition System The human activity recognition gained a lot of attention for both pattern recognition and human–computer interactions. Incorporating wearable sensors to an individual body can help in identifying their lifestyle. The main parts of the system are sensing feature extraction and recognition. Initially, feature extraction is done from raw data. Smartphone sensors like triaxial accelerometers and gyroscope are used for the collection of data. To remove noise, a low-pass Butterworth filter is used. The robustness is achieved by processing the extracted features using kernel principal component analysis and linear discriminant analysis [16]. A deep belief network (DBN) is used for human activity recognition. The DBN has two main parts: the pre-training phase which is based on the restricted Boltzmann machine; after the pre-training, fine-tuning algorithm is used for adjusting the network weights.

3.4 Fall Detection System Fall detection system is mainly implemented for elderly people at home. As age increases, the tendency to fall also increases and the main reasons are heart attack and low blood pressure. Therefore, it is very much essential to develop a fall detection system that can precisely identify the fall and can send a message immediately. The system can save a life when fall occurs suddenly. The system is implemented in an Arduino platform with sensors and Python programming. For the purpose of prediction and analysis, classification algorithms can be used. The sensor used here is MPU6050, and data is collected for the training purpose. The accelerometer sensor is capable of sensing acceleration, and the gyroscope sensor deals with sensing angular velocity. The dataset is developed by considering different activities like sleeping, sitting, walking and falling [9]. The sensor readings are stored in MongoDB, and from there it is taken for preprocessing and stored in a CSV file. Training of the dataset is

206

J. M. Philip et al.

done using KNN and decision tree algorithm. New input is given for prediction, and an SMS is given in case of emergency.

3.5 Health Monitoring System for War Soldiers Since soldiers play a vital role in safeguarding the nation, it is very much important to track their location and the real-time monitoring of their health. It is also important to identify the location of people who lost life in battlefield or became dead in between the war. GPS module is used to track the location, and sensors like temperature, humidity, heartbeat and accelerometer are used for determining the body conditions. Initially, real-time data is collected from the war zone using sensors [17]. Data is transmitted to the leader using Zigbee, and it is given to control unit with the help of a LoRaWAN. The data prediction and analysis are done using K-means algorithm. Healthy, ill, abnormal and dead are the clusters to which the real-time data will be classified. ThingSpeak cloud platform is used for data storage purposes.

3.6 Fetal Movement Monitoring Automatic monitoring of fetal movement is done using accelerometers and machine learning. There are a local monitoring system and a remote evaluation unit [18]. Signal acquisition is done using four acceleration sensors, and a microcontroller is used for signal processing. An Android platform is used for fetal movement detection and for communication. IIR band-pass filter is used for signal filtering, and features are extracted which is given to a machine learning algorithm for classification.

4 A Generic Framework for Intelligent Predictions in IoT Health Care Cardiac arrhythmia is a collection of conditions in which the heartbeat is too slow or too fast. Heartbeats are generally classified into five, viz., (1) normal sinus beats, (2) premature ventricular contraction (PVC), (3) atrial premature beats (APB), (4) left bundle branch block (LBBB) and (5) right bundle branch block (RBBB). Except the first one, all the other characterize and constitute complex arrhythmia. For atrial premature beat, an abnormal P wave occurs before expected in the cardiac cycle. When the T wave is large and appears opposite to the QRS complex, premature ventricular contraction takes place. In LBBB, the QRS complex widens and is downwardly deflected, and if the R wave is short and if QRS complex widens RBBB happens. The MIT-BIH arrhythmia dataset is taken for analysis. In the

Deep Learning Application in IoT Health Care: A Survey

207

Train the system using ECG dataset

Collecng data from ECG sensor

Live ECG data is given to the trained model

Sensor data is taken using Arduino

predicon of Risk level

Fig. 3 Collecting ECG sensor data and giving to the trained model

proposed method, the system is trained using ECG dataset and we come up with a model. Initially, data preprocessing is done to remove missing values and irrelevant values. The morphological features are extracted during principal component analysis (PCA). PCA is an unsupervised learning technique that can extract features or patterns from the data. A long short-term memory (LSTM) can record each class of ECG beat types. LSTM is a recurrent neural network having feedback connections which is suitable for classification. Figure 3 represents the live data collection using sensors which is taken using Arduino Uno. This data is given to the trained model for prediction. This is given to doctors or physicians to monitor the patient’s condition.

5 Conclusion and Future Research Directions Earlier methods have got a wide range of limitations in terms of accuracy, precision and computational time, whereas deep learning technology has shown a better result for the same task. This survey analyzes the possibilities of machine learning, deep learning and IoT in the health domain. Deep learning is a subcategory of AI which helps in developing model from available datasets, and new data can be given to predict the output. This can train the machines to reach human capabilities. It can make decisions from images and also from the data values obtained from IoT devices. The deep learning technology is gaining more and more important in day-to-day life as it is applied in almost all areas of human life. The wide range of practical applications is enabling it to grow more rapidly. Many of the data generated by IoT devices is processed in remote servers. By applying the concept of edge computing,

208

J. M. Philip et al.

the data can be analyzed in a much closer environment. Edge computing allows data traffic from the IoT devices to be analyzed at a local server before giving it to cloud.

References 1. A. Das, P. Rad, K.K.R. Choo, B. Nouhi, J. Lish, J. Martel, Distributed machine learning cloud teleophthalmology IoT for predicting AMD disease progression. Futur. Gener. Comput. Syst. 93, 486–498 (2019) 2. M.M. Hassan, M.Z. Uddin, A. Mohamed, A. Almogren, A robust human activity recognition system using smartphone sensors and deep learning. Futur. Gener. Comput. Syst. 81, 307–313 (2018) 3. P.M. Kumar, S. Lokesh, R. Varatharajan, G. Chandra Babu, P. Parthasarathy, Cloud and IoT based disease prediction and diagnosis system for healthcare using fuzzy neural classifier. Futur. Gener. Comput. Syst. 86, 527–534 (2018) 4. https://www.te.com/usa-en/industries/sensor-solutions/applications/sensor-solutions-formed ical-applications.html 5. F.-Y. Leu, C.-Y. Ko, I. You, K. Kwang, A smartphone-based wearable sensors for monitoring real-time physiological data. Comput. Electr. Eng. 65, 376–392 (2018) 6. https://slideplayer.com/slide/5683843/ 7. Q. An, S. Gan, X. Jianan, Y. Bao, A multichannel electrochemical all-solidstate wearable potentiometric sensor for real-time sweat ion monitoring. Electrochem. Commun. 107, 106553 (2019) 8. S. Veena, A. Monisha, V. Manjula, A survey on sensor tools for healthcare applications. Int. J. Res. Appl. Sci. Eng. Technol. 6 (2018) 9. S.K. Bhoi, S.K. Panda, B. Patra, B. Pradhan, FallDS-IoT: a fall detection system for elderly healthcare based on IoT data analytics, in International Conference on Information Technology (2018) 10. M. Nilashi, O. bin Ibrahim, H. Ahmadi, L. Shahmoradi, An analytical method for diseases prediction using machine learning techniques. Comput. Chem. Eng. 106, 212–223 (2017) 11. P. Naraei, A. Abhari, A. Sadeghian, Application of multilayer perceptron neural networks and support vector machines in classification of healthcare data, in Future Technologies Conference (2016) 12. M.Z. Alom, M. Alam, T.M. Taha, in Object Recognition Using Cellular Simultaneous Recurrent Networks and Convolutional Neural Network (2017) 13. https://en.wikipedia.org/wiki/Recurrent_neural_network 14. https://en.wikipedia.org/wiki/Recursive_neural_network 15. https://www.flatworldsolutions.com/healthcare/articles/top-10-applications-of-machinelearn ing-in-healthcare.php 16. M.M. Hassan, M.Z. Uddin, A. Mohamed, A. Almogren, A robust human activity recognition system using smartphone sensors and deep learning. Futur. Gener. Comput. Syst. 81, 307–313 (2017) 17. Aashay Gondalia, Dhruv Dixit, Shubham Parashar, IoT-based healthcare monitoring system for war solders using machine learning. Proc. Comput. Sci. 133, 1005–1013 (2018) 18. X. Zhao, X. Zeng, L. Koehl, in An IoT-Based Wearable System Using Accelerometer and Machine Learning for Fetal Movement Monitoring (IEEE, 2019) 19. C.M.J.M. Dourado, S.P.P. da Silva, R.V.M. da Nóbrega, A.C. Antonio, P.P.R. Filho, V.H.C. de Albuquerque, Deep learning IoT system for online stroke detection in skull computed tomography images. Comput. Netw. 152, 25–39 (2019)

Context Aware Text Classification and Recommendation Model for Toxic Comments Using Logistic Regression S. Udhayakumar, J. Silviya Nancy, D. UmaNandhini, P. Ashwin, and R. Ganesh

Abstract In recent days, the online conversations have become a vibrant platform in expressing one’s opinion about the issues prevailing in the society. Increase of threats, abuses and harassment in social websites has stopped many from expressing themselves, and they give upon seeking different opinions due to the fear of being offended and cornered. This can be advocated with the advancements of machine learning and artificial intelligence, which offers a way to possibly classify the comments based on its level of toxicity and its nature. In this following work, we propose an application that classifies the comments based on the nature of toxicity. Here, the learning model is trained with datasets containing toxic comments, and when such comments are used, the prototype identifies and highlights it. This is carried out with the most popular data classifier called logistic regression in which the prototype and feature extraction is done using sklearn’s TfidfVectorizer module. Once the toxic comments are identified by the trained application, warning flag will be raised to the user and analogous assertive statements are provided as suggestion by the recommendation engine, in which item-to-item-based collaborative filtering algorithm is used. The obtained results justifies that logistic regression affords adequate evidence in identifying the toxic comments. S. Udhayakumar (B) Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Chennai, India e-mail: [email protected] J. Silviya Nancy PES University, Bangalore, India e-mail: [email protected] D. UmaNandhini Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology, Chennai, India e-mail: [email protected] P. Ashwin · R. Ganesh Rajalakshmi Engineering College, Chennai, India e-mail: [email protected] R. Ganesh e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_20

209

210

S. Udhayakumar et al.

Keywords Machine learning · Toxic comments · Logistic regression · Recommendation systems · Collaborative filtering

1 Introduction All with the advanced revolution in the era of Internet technologies and raise in data generation, the conversational strategies are changed and turned down which has a great impact on the society. At the time of the evolution of Internet, the communication started with the commotion of sending of pictures through e-mails. And slowly it started where the huge quantity of data was generated as the days passed. Now we call this era as “information era.” On the other side, the classification and mining of these data have become a tedious task. And with the heap of conversations online, it is noted that, these texts has the highest probability of toxicity [1]. Recently, the emerging technical areas like machine learning and natural language processing (NLP) have become a huge asset in classifying and removing them online. Hence, the automatic recognition of words which are abusive, hatred, threat, insults can be positive. To facilitate this, Google and Jigsaw are working to improvise the better online conversation so that toxicity is reduced. Many machine learning functionalities pave way for efficient classification of toxic comments with convolutional neural networks (CNNs) [2], bidirectional RNN and LSTM, logistic regression and SVM. The CNNs are intelligent neural networks that are used in the classification of object recognition and similarity. This also supports optical character recognition. As an alternative, CNN was given the input tasks as text or document which is characterized as matrix. It is taken as that every row in the matrix is considered as one word. The objective of this research work is to categorize the toxic comments with the provided comments with the facility of rephrasing. Here, the contexts of previous statements are analyzed using a recommendation architecture [3] and the user is offered with opinions to reframe the sentences for healthy communication. It will be able to detect the toxic words/comments in the online conversation and report through logistic regression which is depicted in Fig. 1. Fig. 1 Perspective API architecture

Display Toxic Comments

Display Non - Toxic Comments

Context Aware Text Classification and Recommendation …

211

2 Literature Review This chapter enlightens the great throwbacks on text classification for finding of unethical and toxic comments online and the recommender systems for rephrasing. Alexander Genkin et al. [4] have presented a Bayesian logistic regression that applies Laplace. Hao Peng1 et al. proposed the hierarchical text classification problem with deep learning algorithm [5]. Here, the algorithms are based on the graph-based convolutional networks where the text is transformed to words of graph and a word graph model is optimized. This helps in understanding the higher layer semantics which are also non-continuous. This work experiments the classification of short text [6]. The authors have proposed a Gaussian model for finding whether the words and their associations are syntactically and semantically linguistic. The researchers, Revati Sharma and Meetkumar Patel [7], have analyzed the meticulous importance of toxic comment classification. They have proved with the working principles of convolutional neural networks and recurrent CNN and demonstrated that LSTM works better in text classification of online comments in the aspects of accurateness and point in time. Spiros V. Georgakopoulos et al. [2] recommended online toxic comment classification using convolutional neural networks. They have attempted to prove with different layers of convolutional neural networks and various text classification methods like SVM, NB, kNN and LDA. Chikashi Nobata et al. [8] described the research work on detection of abusive online comments with natural language processing methodologies. The effort was made on detecting the toxic nature of the contents online with its lexical analysis.

3 Toxic Comments Classification Model The entire application contains a sequence of primary processes. Each of them is considered primary because it is interdependent. The following is the flow of our application which classifies the comments based on the toxicity of the comments. The following are the highlights of the proposed classification model. To classify the toxic comments, initially a dataset is created or fetched from the source which consists of lot of comments and their corresponding nature of the toxicity (Here the dataset source is from kaggle.com). The fetched dataset is used to create a learning model by extracting the features from the datasets and combining it with one of the data classification algorithms. In our case, logistic regression is used as the data classification algorithm. Once the learning process ends, the result model which is obtained is the trained model, which is capable of identifying the toxic comments and its nature of toxicity or classifying the comments based on its toxic nature. In order to implement this feature, it is necessary to create real-world application software which could be capable of executing parallel the model in the backend. In our case, we are creating a messaging application using simple socket programming.

212

S. Udhayakumar et al.

Fig. 2 Toxic comment classification using logistic regression

3.1 Learning Model The learning model is the primary component of the proposed architecture. The data is gathered, and probability is formularized for the gauge of success. After which the data is prepared and split based on its needed classification. Here in this scenario, the dataset consists of mixture abusive and non-abusive comments. This has to be manipulated by splitting based on the nature of toxicity. To classify this, logistic regression is employed and the view is depicted in Fig. 2.

3.2 Trained Model When the model is learned to identify the association between the labels and the characteristics of a prototype, then it can be commented as trained model. Here sklearn’s TfidfVectorizer module is used to train. Once the toxic natures of the comments are identified, recommender systems provide equivalent phrases for re-writing the comment. This is experimented with a real-time messaging application where the trained model works on. A message application is created using Node JS and HTML.

Context Aware Text Classification and Recommendation …

213

4 Experimental Analysis This segment constitutes the evaluation and results of toxic comment classification. Selecting an appropriate dataset: Based on the chosen dataset, the toxicity of the comments can be differentiated [9]. The comments are marked as ‘1’ based on its nature of toxicity and the enduring columns are marked as ‘0’. Selecting the classifier: In this work, a classifier called “logistic regression” is used. Using this classifier, a model is created by training the trained dataset. Now the model is capable of differentiating the comments according to its toxicity. Creating a classifier: In order to make use of the trained model, it is necessary to develop an automated script which could fetch the comments from the application and feed it to the model. Transforming comment into extracted feature: The extracted feature from the datasets was saved in a .sav file (fit_model.sav). This file is loaded, and the comment/message is transformed. Determine the nature of the comment using the trained model: The trained models are saved into six .sav files. The toxic label with highest probability value determines the toxicity nature of the comment. for i in range(Y_train.shape[1]): feature = Y_train.columns[i] loaded_model = pickle.load(open('model_{}.sav'.format(feature), 'rb'))exec('pred_{}=pd.Series(loaded_model.predict_proba(X_test).flatten()[1::2])'.for mat(feature))

Creating a data frame and converting into .csv file: The predicted probability values are converted into pandas data frame format and further the data frame created is converted into .csv file (result.csv) [12]. submission = pd.DataFrame({ 'obscene': pred_obscene, 'threat': pred_threat, 'insult': pred_insult, 'identity_hate': pred_identity_hate, 'none': pred_none }) submission.to_csv("result.csv", index=False)

Creating models using extracted features: A learning and trained model is developed from the extracted features which is depicted in the above tables. Logistic regression: The prediction feature defines the probability is “true” or “false.” Module Importing: from sklearn.linear_model import LogisticRegression Creating object for Logistic Regression and training : clf=LogisticRegression(C=4.0,solver='sag') clf.fit(X_train, Y_train.iloc[:, i]) Saving the trained model : filename = 'model_{}.sav'.format(feature) pickle.dump(clf,open(filename,'wb'))

214

S. Udhayakumar et al.

In case of our example, let us predict the probability of occurrence of bad comment. The probability of a toxic comment can be calculated by P (Toxic_Comment) =

ec 1 + ec

(1)

The probability of non-toxic comment can be calculated by P (Non-Toxic_Comment) = 1 − P (Toxic_Comment)

(2)

Assume the c function as any of the category of toxic comment. For now, let it be “threat,” the probability of toxic comment may be equated as β∗threat

P (Toxic_Comment) =

ec , β is type of comments 1 + eβ∗threat

(3)

On calculating the likelihood of the problem, the following equation is manipulated P(Toxic_Comment) = eβ∗threat P(Non − Toxic Comment)

(4)

Now by classing all the comments, Eq. (3) takes all the types of comments P(Toxic_Comment) P(Non − Toxic_Comment) = eβ1 ∗C1 +β2 ∗C2 +β3 ∗C3 +β4 ∗C4 +β5 ∗C5

LR =

(5)

where β refers to the type of comments, C commonly identified as class of comments, by calculating the final logistic regression with the coefficients of β, it can be defined as = e0.9928565418∗C1 +0.9999828603∗C2 +0.9992578669∗C3 +0.9957899965∗C4

(6)

Assuming C 1 , C 2 , C 3 , C 4 and C 5 can take the values as 0 or 1 based on the nature of toxicity. Here C 1 = C 2 = C 3 = C 4 = 1 and C 5 = 0. = e3.9885623254 = 53.977393 So, the final estimated value is (Fig. 3). P (Toxic_Comment) =

53.977393 = 0.98181069 % 1 + 53.977393

(7)

Context Aware Text Classification and Recommendation …

215

Fig. 3 Representational plotting of toxic comments

• Analysis of result.csv files. For each and every message, the same result.csv file is overwritten. Since our last message was “I will kill you” which is definitely a threat message, we could see that the probability value in the threat label column is the maximum compared to other label columns. This is how the suitable toxicity label for each and every comment message is determined. Table 1 shows the toxicity range for the comments mentioned in the messaging application in Fig. 4. • Integrating application with trained model. Table 1 Nature of comments based on the types of toxicity Comments

Identity hate (IH)

Insult (IN)

None (NO)

Obscene (OB)

Threat (TH)

Classing

I will kill you 0.0254091

0.2100355

0.016198

0.1846444

0.9928565

C1

Are you an idiot

0.2896478

0.9999828

0.014569

0.1579566

0.3595478

C2

He looks like donkey

0.9992578

0.4587899

0.001558

0.1947523

0.0254456

C3

His acts are filthy

0.0158966

0.2144896

0.002648

0.9957899

0.0145259

C4

She looks beautiful

0.0000000

0.0000000

0.999999

0.0000000

0.0000000

C5

Note The bold is based on highest value

Fig. 4 Message application classifying message based on the nature of its toxicity

216

S. Udhayakumar et al.

Fig. 5 Suggestion for rephrasing the toxic comments

The message application is started in the local host. The server running is Node JS Fig. 4 represents the message application with few comments along with the nature of the comment labeled to the right of the application. • Trained model with recommended rephrasing. Figure 5 explains the need for the user to interact with the trained model for expressing their views through comments. Likewise, Fig. 5 explain how the application would prompt a warning and recommends rephrasing when the users’ comments are toxic in nature. The suggestions are enabled by the recommendation systems by calculating the similarity index. For instance, let us take the comment, “I will kill you”. It is an offensive statement which cannot be used in a common platform. So, the recommendation engine would map this to the available semantics in the dataset and find the nearest meaning and suggest that to the user [10]. Assume that there are two given words (word1 and word2 ); the semantic similarity has to be sorted for these words. There are different layer of hierarchies in finding the semantics of the words form the dataset. The similarity index takes combination of the minimal path (P) and the deepness (D). S 1 = I will kill you which can probably be replaced by one of the options S 1 = “I don’t like you” smw(word1 ,word2 ) = e−∝∗P ∗

eβ∗D − e−β∗D , where ∝ and β ≥ 0 eβ∗D + e−β∗D

5 Conclusion and Future Works The research work mainly focuses on making Internet a medium for peaceful and meaningful conversations or discussions. People who are positive in their point of

Context Aware Text Classification and Recommendation …

217

view should be able to converse and debate in a reasonable manner rather than creating verbal fights over the Internet to win the opposition. The only way to overcome this is to create a monitoring unit which could detect and remove such comments or messages, and our work tries to overcome this problem by detecting the nature of the toxic comment. The proposed model paves way for identifying the toxic comments and enables a prototype that emphasizes the usage of non-toxic words or sentences in the public platform. The future enhancement would be better if the focus more on training the model to detect obscene images or videos which could possibly be more harm than the verbal messages or comments. The images could be trained and implemented in such a way that the trained model should be able to detect the obscene images or videos and should be able to blur the image. In case of the videos, either it could be labeled as obscene video or could be completely removed from the page.

References 1. B. van Aken, J. Risch, R. Krestel, A. Löser, Challenges for toxic comment classification: an in-depth error analysis, in 2nd Workshop on Abusive Language Online, pp. 33–42, (2018) 2. S.V. Georgakopoulos, S.K. Tasoulis, A.G. Vrahatis, V.P. Plagianakos, Convolutional neural networks for toxic comment classification, in SETN’18 Conference on Artificial Intelligence (2018) 3. U. Nandhini, L. Tamilselvan, Mobile recommendation engine for offloading computations to cloud using Hadoop cluster, World Appl. Sci. J. 29, 41–47 (2014). (Data Mining and Soft Computing Techniques) 4. A. Genkin, D.D. Lewis, D. Madigan, Large-scale Bayesian logistic regression for text categorization. Am. Stat. Assoc. Am. Soc. Qual. TECHNOMETRICS 49(3), 291–304 (2007) 5. H. Peng, J. Li, Y. He, Y. Liu, M. Bao, L. Wang, Y. Song, Q. Yang, Large-scale hierarchical text classification with recursively regularized deep graph-CNN, in International World Wide Web Conference Committee, Creative Commons, 2018 6. C. Ma, W. Xu, P. Li, Y. Yan, Distributional representations of words for short text classification, NAACL-HLT, pp. 33–38, 2015 7. R. Sharma, M. Patel, Toxic comment classification using neural networks and machine learning. Int. Adv. Res. J. Sci. Eng. Technol. 5(9), 2018 8. C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive language detection in online user content, in International World Wide Web Conference Committee (IW3C2), 2016 9. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data 10. H. Hosseini, S. Kannan, B. Zhang, R. Poovendran, Deceiving google’s perspective API built for detecting toxic comments, Network Security Lab (NSL), (University of Washington, 2017)

Self-supervised Representation Learning Framework for Remote Crop Monitoring Using Sparse Autoencoder J. Anitha, S. Akila Agnes, and S. Immanuel Alex Pandian

Abstract Remote crop monitoring is one of the emerging technologies required for precision agriculture. The advent of remote monitoring techniques accumulates huge amount of real-time image data in cloud storage. Preserving this big data without suppressing its significant details is important for the further crop investigation process. Feature learning with autoencoder helps in learning significant data without compromising the original dataset variance. This work aims to develop a sparse autoencoder model to obtain high abstract level features from the image such that the original image could be reconstructed from the reduced feature with minimum error. Sparse autoencoder is an unsupervised back propagation neural network with sparse network connections that converts high-dimensional image into low-dimensional features. The performance of autoencoder with respect to image reconstruction is evaluated in terms of MSE and PSNR. Also, the effect of dimensionality reduction is qualitatively analyzed with PCA plots that confirms the reduced dataset maintain the expanded variance required for further crop investigations. Keyword Dimensionality reduction · Sparse autoencoder · Principle component analysis · Feature engineering

J. Anitha · S. Akila Agnes Department of CSE, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu 641114, India e-mail: [email protected] S. Akila Agnes e-mail: [email protected] S. Immanuel Alex Pandian (B) Department of ECE, Karunya Institute of Technology and Sciences, Coimbatore, Tamil Nadu 641114, India e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_21

219

220

J. Anitha et al.

1 Introduction Remote crop monitoring is an agricultural information system that captures highresolution photos of the plants in a fixed time interval. Plants images are automatically uploaded to the cloud storage without manual assistance and enable continuous crop monitoring. High-resolution plants images are analyzed by agricultural consultants or machine learning algorithms to understand the condition of the farm field. This remote crop monitoring process helps the formers in producing quality yields by continuous germination monitoring and crop development with the support of fertilizers and pesticides. The processing of massive amounts of data requires a high computational power system and also increases the complexity of algorithms. Real-time applications need an efficient method to reduce the dimensions of huge data to minimize the computational complexity. Popular dimensionality reduction approaches such as backward selection, removing high correlated attributes, and principal components analysis have been adapted for efficient data representation. Autoencoders (AE) are widely used in a range of applications because of their nonlinear discriminative features learning ability. Autoencoder learns to compress the input image into a reduced dimensional feature set and then reconstruct the original image [1]. The autoencoder is trained using gradient descent optimization technique by minimizing the mean squared error. Researchers have enforced various levels of constraints [2] to learn good representation on AE by preventing memorization or overfitting. Sparsity is one of the common constraints that help to minimize the computational complexity and improves the performance of the model. The sparse representation in the neural network is inspired by the human neural network [3]. This sparse neural network has been widely utilized in big data problems. The sparsity can be imposed on the autoencoder by penalizing the weights [4] and average activation output of the neuron. The sparse representation helps the model to learn the general pattern by keeping useful features. Researchers have utilized sparse autoencoder in the medical field to extract the feature and to diagnosis the locomotive adhesion status [5], premature ventricular contraction [6], and glaucoma in fundus images [7]. Rehman et al. [8] have analyzed the performance of stacked sparse autoencoder (SSAE) in classifying the MG-based hand motions and found that SSAE outperformed linear discriminant analysis (LDA) in recognizing the patterns. A denoising sparse autoencoder framework [9] has been developed to classify the nonseizure and seizure EEGs that achieve 100% for specificity, sensitivity, and recognition. The huge amount of remote sensing data has been managed with the sparsed autoencoder framework [10] to retrieve the desired high-resolution remote sensing image. Recently, these models have been used for fault detection in mechanical engineering and anomaly detection in civil engineering too. Pathirage et al. [11] have designed a framework based on deep learning sparse autoencoder to identify the structural damages. The rest of the paper is organized as follows: Sect. 2 describes the proposed feature learning framework with a sparse autoencoder. Section 3 deals with experimental

Self-supervised Representation Learning Framework …

221

Crop Investigation Disease identification, Plant identification, Prediction of crop yield, Crop Pest detection, etc

Cloud Trained autoencoder

Fig. 1 Proposed framework for efficient crop monitoring system with autoencoder

results and discussions. Finally, Sect. 4 summarizes the effects of dimensionality reduction in big data representation.

2 Methodology Remote crop monitoring is a periodic monitoring process that produces highdimensional digital data that has to be recorded in cloud storage for future investigations. The data collection process of the crop monitoring system records the high volume of the data progressively in cloud space. Working on this big data requires more processing power and more storage space. In this paper, a novel framework is proposed to reduce the dimensionality of the image data into feature data without losing its information. The leaf image is captured via a drone camera and the features are extracted from the image using the trained autoencoder. The autoencoder has been trained with around 5000 sample leaves images from the leaf dataset.1 The extracted features are stored in the cloud storage space. These features will be used for various applications such as disease identification, plant identification, prediction of crop yield, crop pest detection, etc. Figure 1 shows the proposed framework for efficient crop monitoring system with autoencoder.

2.1 Autoencoder Autoencoder is a self-supervised learning neural network, where the model filters the essential representation details from input data automatically. Autoencoder network consists of three parts such as encoder, decoder, and bottleneck. The number of neurons in the consecutive layers of the encoder is gradually decreased in size, which enforces the dimensionality reduction. The decoder section contains a sequence of 1 https://github.com/spMohanty/PlantVillage-Dataset/tree/master/raw/color.

222

J. Anitha et al.

hidden layers, which is the reverse structure of encoder and it has an output layer. The decoder weights are not initialized randomly instead of the transpose weights of the corresponding encoder are assigned. The output of the autoencoder is the estimation of the input data. Autoencoder supports nonlinear data reduction which is better than linear data reduction methods. This model learns the abstract features during the training phase using backpropagation method. This method requires adequate training data to train the encoder and decoder sections of the network. Autoencoder model maps the input data with the target output data (i.e., input data) through the sequence of hidden layers and the bottleneck layer imposes data reduction in the process. During the learning phase, the model learns the sensitive pattern from the data that helps to reconstruct the original image. Encoder helps to extract features from the input image and decoder network reconstructs the image with minimum error between the original data and reconstructed details. Autoencoder learns the hidden patterns such that the reconstructed image x¯ is closer to the input image x. In the encoding part, the input layer is a vector representation of an input image consists of pixel data. While training, the network weights are adjusted to minimize the reconstruction error loss = x − x. ¯ Autoencoders are data specific; the encoder model trained for one domain may not be suitable for another domain. It reconstructs the image which is almost similar to the original image. The encoder learns to compress the input data which is lossy. h(x) = x ∗ w + b

(1)

x¯ = h(x) ∗ w¯ + b¯

(2)

where w and w¯ represent the weight matrixes, b and b¯ represent the bias of the encoder and decoder, respectively. The reconstructed error should be optimized as follows: min

(w,b,w,b)

=

n

xi − x i2

(3)

i=1

2.2 Sparse Autoencoder Overfitting is a great issue when mapping the input vectors into high-dimensional feature vectors. Regularization is a strategy that helps to prevent model overfitting by ignoring some of the features, or by making the weights connected to these features about zero. The sparse autoencoder is an enhanced autoencoder that includes the sparse penalty term, which limits the average activation value of the hidden layer neuron. Autoencoder is a dense neural network; during training, the network may

Self-supervised Representation Learning Framework …

223

have few connections with zero weight. Sparsity helps to improve the performance of the model by removing these zeros. Adding down sparsity between the first input layer and hidden layer sets the level of sparsity that determines the active connections for mapping the input into features. Sparsity can be set at the level of 0.1–0.9. The sparsity level of 0.1 states that 10% of connections are expected to have a minimum weight that is very close to zero and this can be skipped at the time of computation. Sparsity is added to the cost function to control the average output activation of a neuron, formulated as: pˆ i =

m 1 zi x j m j=1

(4)

where i is ith neuron, m is the total training samples, and j is the jth training sample. The sparsity regularization is defined using the Kullback–Leibler divergence [12] such as, sparsity =

d

p 1− p plog log pi 1 − pi

i=1

(5)

where d is the total neurons in a layer and p is the desired activation value named sparsity proportion. Further, an L 2 norm regularization term is included in cost function to control the weights. 1 l 2 w ji 2 l j i L

weights =

N

K

(6)

where L, N, and K be the number of hidden layers, the number of observations, and the number of features, respectively. The updated cost function after adding the regularization terms in Eqs. (5) and (6), E=

N K 1 xi j − x¯i2j + λ ∗ weights + β ∗ sparsity N j=1 i=1

(7)

3 Results and Discussion 3.1 Dataset Training and testing data have been acquired from the leaf dataset. Around 5000 leaf samples from healthy and infected leaf categories are selected for training the

224

J. Anitha et al.

Fig. 2 Sample images. a Healthy leaf. b Infected leaf

autoencoder. The images are in JPEG format with a dimension of 128 × 128 × 3. Color leaf images are taken for research since color element devises a key portion in categorizing the healthy and infected leaves. Few sample images form the dataset is shown in Fig. 2.

3.2 Experimental Setup The proposed autoencoder maps 49,152 pixel data into 100 features using backpropagation neural network by minimizing the reconstruction error. This dense autoencoder has about 5 million connections between the input layer neurons and the hidden layer neurons. Approximately, 49,152 neurons of an input layer are connected with one neuron present in the hidden layer. Out of all these connections, only a few are strong and others are weak or useless to activate the neuron in the hidden layer. These sparse active connections should be made stronger to highlight the latent structure of the input data. The sparse autoencoder model will be enforced by fixing the regularization parameters of the model. A slight variation in the parameter makes a significant difference in the model’s performance. Selecting appropriate values for these hyperparameters is a challenging task in designing a sparse autoencoder. Also, the values of hyperparameters of an autoencoder are not general, which means the model has been trained for one domain cannot be fit for another domain problem. In this experiment, the optimal value of hyperparameters such as level of sparsity, L2R coefficient, and sparse regularizing coefficient is determined for an efficient feature representation of the given leaf dataset.

3.3 Regularization in Autoencoder Sparsity is one of the regularization paradigms that helps to normalize the output of activations in a fully connected neural network by endorsing the most-contributing connections and quashing the least-contributing connections. This made the connections sparse and thus, the computational cost is reduced and also the performance of the model is improved. In this experiment, the performance of the autoencoder model is evaluated on the leaf dataset with different levels of sparsity. The performance metrics such as MSE and PSNR are used to quantify the model performance. The mean square error of the model for various levels of sparsity is shown in Fig. 3.

Self-supervised Representation Learning Framework …

225

Fig. 3 Performance of sparse AE with various levels of sparsity

The obtained results suggest that imposing sparsity in dense autoencoder considerably improves the performance of the model. From the plot, it is apparent that the model yields minimum mean squared error at sparsity levels of 0.3 and 0.7 in all color channels Red (R), Green (G), and Blue (B). However, the AE reconstruct the images with a maximum PSNR value of 18.2185 at the sparsity level of 0.3. This result indicates that the proposed AE model produces good results at 0.3 level of sparsity where the model employs 70% of previous layer neurons for activating it output. L2 norm weight regularization boosts activations by summing the squared activation values. L 2 norm regularization (L2R) updates weights with large magnitudes; still, it does not produce zero weight connections. The influence of the L 2 norm coefficient in the performance of the model is analyzed with various L2R coefficient values with a fixed sparsity level of 0.3. The model performance is measured at every color channels with various L2R coefficient values and the obtained results are plotted in Fig. 4. Since the model has been regularized with fixing the desired level of sparsity, there is a minimum variation with respect to MSE could be observed with various L2R coefficient values. However, the AE model produces minimum MSE value at the coefficient value of L2R is 0.0001. Fig. 4 Performance of sparse AE with various L2R coefficient value

226

J. Anitha et al.

Original dataset

Encoded features set

Fig. 5 Data variance analysis with PCA plot

3.4 Exploring the Encoded Features Autoencoder compresses the high-dimensional image data into a low-dimensional feature set without losing the essential information. In this experiment, leaf images belong to two categories such as healthy and infected are taken for representation learning through self-supervised learning methods. The proposed SAE model reduces the input data from the dimension of 128 × 128 × 3 to 100 features. Approximately, the data is compressed into 491 times lesser than the original data size. A scatter plot between the first two principle components demonstrates the discriminative nature of the dataset. As well as this PCA plot explores the amount of data variance that exists in the dataset for categorizing the healthy and infected leaves. A scatter plot between the first two principle components of the original dataset and the encoded feature set is shown in Fig. 5. From the figure, it is clear that the sparse autoencoder learns the variant features from the image without losing the essential information. The efficiency of the autoencoder is verified with the PSNR value between the original image and the reconstructed image. The number of neurons in the hidden layer determines the scale of dimensionality reduction. In this experiment, the performance of a sparse autoencoder with different dimensions of the hidden layer is analyzed. The PSNR value for SAE with 100 features is better than the SAE with 50 features. This confirms that the high number of neurons in the hidden layer minimizes the reconstruction error. There is no standard rule to fix the number of neurons in the hidden layer. Based on the nature of application, the number of features could be decided. If the application requires quality image for analysis, then a greater number of neurons at hidden layer is recommended and if the application needs only details for classification or prediction, then limited numbers of neurons at hidden layer are sufficient.

Self-supervised Representation Learning Framework …

227

4 Conclusion Autoencoder is a preferred computational paradigm for dimensionality reduction that helps in efficient representation of big data. In this paper, a sparse autoencoder is developed to extract essential features from leaf image for remote crop monitoring. Also, the optimal value for the regularization parameters such as sparsity level and L 2 norm coefficient for the proposed sparse autoencoder is identified. The efficiency of the model is quantitatively analyzed in terms of MSE and PSNR. The proposed SAE model with a sparsity level of 0.3 and 0.0001 L2R coefficient value could reconstruct the image from the 100 encoded features with MSE of 0.0055 and PSNR value of 18.2185. The discriminant ability of features extracted by sparse autoencoder is explored with PCA scatter plot and it confirms that the proposed SAE model reduces the data dimension without affecting the data variance required for the future classification tasks.

References 1. Z. Hu, Y. Song, Dimensionality reduction and reconstruction of data based on autoencoder network. J. Electron. Inf. Technol. 31(5), 1189–1192 (2009) 2. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010) 3. A. Ng, et al., Sparse autoencoder. CS294A Lect. Notes 72, 1–19 (2011) 4. F. Li, J.M. Zuraday, W. Wu, Sparse representation learning of data by autoencoders with l sub 1/2 regularization. Neural Netw. World 28(2), 133–147 (2018) 5. C. Zhang, X. Cheng, J. Liu, J. He, G. Liu, Deep sparse autoencoder for feature extraction and diagnosis of locomotive adhesion status. J. Control Sci. Eng. 2018 (2018) 6. J. Yang, Y. Bai, G. Li, M. Liu, X. Liu, A novel method of diagnosing premature ventricular contraction based on sparse auto-encoder and softmax regression. Biomed. Mater. Eng. 26(s1), S1549–S1558 (2015) 7. S. Pratiher, S. Chattoraj, K. Vishwakarma, Application of stacked sparse autoencoder in automated detection of glaucoma in fundus images, in Unconventional Optical Imaging, vol 10677 (2018), p. 106772X 8. M. ur Rehman et al., Stacked sparse autoencoders for EMG-based classification of hand motions: a comparative multi day analyses between surface and intramuscular EMG. Appl. Sci. 8(7), 1126 (2018) 9. Y. Qiu, W. Zhou, N. Yu, P. Du, Denoising sparse autoencoder-based ictal EEG classification. IEEE Trans. Neural Syst. Rehabil. Eng. 26(9), 1717–1726 (2018) 10. W. Zhou, Z. Shao, C. Diao, Q. Cheng, High-resolution remote-sensing imagery retrieval using sparse features by auto-encoder. Remote Sens. Lett. 6(10), 775–783 (2015) 11. C.S.N. Pathirage, J. Li, L. Li, H. Hao, W. Liu, R. Wang, Development and application of a deep learning–based sparse autoencoder framework for structural damage identification. Struct. Heal. Monit. 18(1), 103–122 (2019) 12. B.A. Olshausen, D.J. Field, Sparse coding with an overcomplete basis set: a strategy employed by V1? Vis. Res. 37(23), 3311–3325 (1997)

Determination of Elements in Human Urine for Transient Biometrics N. Ambiga and A. Nagarajan

Abstract The objective of this research paper is to measure the concentration of various elements in the human body fluids such as urine and assess them for identifying a person for transient biometrics. Sixty-five urine samples were obtained from healthy persons, and data are analysed for biometrics. The major and trace elements in the body fluid—urine—are profiled to identify each individual and also used to provide the environmental exposure to variety of toxins. Sixty-five urine samples of ten healthy persons were collected on one day (intra) or two non-consecutive days (inter). The analysis of major elements such as sodium (Na), magnesium (Mg), potassium (K) and calcium (Ca) and trace elements such as cobalt (Co), copper (Cu), arsenic (As), rubidium (Rb), molybdenum (Mo), lead (Pb), zinc (Zn) and strontium (Sr) is done. This study was done using inductively coupled plasma mass spectrometry (ICP-MS). The element level in human body fluids such as blood, urine and serum is affected by dietary intake, environmental factor and physiological factors. The major elements Na, Ca, K, Mg and trace elements studied here are Co, Zn, As, Sr, Mo, Pb, Cu, Rb for its variation or consistency to prove that it can be used for biometrics. This paper provides element concentration of 12 major and trace urinary elements in order to determine whether the level is consistent for a particular period of time.

1 Introduction The human body consists of major elements and trace elements which are available in the body fluids such as urine, blood, milk and sweat. Analysis of urine samples provides better understanding of its level and availability of elements better than other body fluids or tissues. The maintenance of proper balance of elements is important for N. Ambiga (B) · A. Nagarajan Department of Computer Applications, Alagappa University, Karaikudi, Tamil Nadu, India e-mail: [email protected] A. Nagarajan e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2021 J. D. Peter et al. (eds.), Intelligence in Big Data Technologies—Beyond the Hype, Advances in Intelligent Systems and Computing 1167, https://doi.org/10.1007/978-981-15-5285-4_22

229

230

N. Ambiga and A. Nagarajan

the wellness of human beings, and element imbalance may cause diseases and illness. The study of urine is vital to investigate the concentration of elements in the body fluid urine considered for biometrics. The determination and analysis of elements in urine have been the vast and new area of interest in the biometric authentication. Urine plays a vital role to determine the concentration of elements due to the fact that the sample preparation and collection process are so simple and also the storage is easy. Analysis of biological fluid of human, urine samples depicts the total intake of certain elements in the human body is better than the other biological fluids such as human blood, saliva, serum and sweat. Thus, this study was aimed at analysing the levels of various elements (lead, cadmium, zinc and copper) in human urine for biometric authentication. Urine analysis also helps to determine the diseases compared to another body fluids such as faeces and blood. The urine analysis is easy, there is no need of trained clinicians to collect the sample, and it is non-invasive [1]. In this research paper, we analysed the variability of elemental concentration in healthy human urine on inter-, intra-days by inductively coupled plasma quadrupole mass spectrometry (ICP-QMS). We present an investigation that some element level can be stable for a particular period of time and so considered for biometric authentication. The elements are monitored using ICP-MS, for both major elements and (Na, Ca, Mg, k) and trace elements (As, Co, Zn, Rb, St, Mo, Cu and Pb). The human body consists of 98% of nine non-metallic elements. Trace elements present are of 0.012%. However, the analysis and determination of trace elements are very crucial for transient biometric authentication. They play important roles in toxicity of human and also help for the normal biological function. Several of these elements are inevitable for the body functions and also essential for the life [2]. The analysis of the status of the human body can be done using different biological fluids (blood, serum, sweat, milk) and tissues (nail and hair). Tissue may be the best specimen but cannot be obtained easily, and sufficient reference information is not available. The external contamination is a problem in hair. So, the hair cannot be widely used [3]. The analysis of the elements in biological fluids, either plasma/serum, whole blood or urine provides useful information about the metabolism of the body. The analysis of the human body fluids for biometric authentication requires a versatile and reliable method. The analytical method used must be precise, sensitive, accurate and relatively fast. So, the ICP-MS method is used to measure the elements in urine. Trace elements may be either essential or non-essential to the human body. Essential elements are required for the human body to perform its normal physiological function. The intake of the inadequate essential elements causes impairment to human health [4]. The non-essential trace elements are considered toxic when the intake is high and they are not required for the physiological function of the body. Essential elements can also cause toxicity when excess in the concentration.

Determination of Elements in Human Urine for Transient Biometrics

231

Table 1 Details of healthy human participants H1–H10 Participant

H1 H2 H3 H4 H5 H6 H7 H8 H9 H10

No. of samples provided

8

6

5

9

4

8

9

4

7

5

Sampling period (days)

2

2

1

2

1

2

2

1

2

1

Gender (F-female, M-male)

F

F

F

F

F

F

M

M

M

M

Smoking status (N-never, O-occasionally, N C-curr, P-past)

O

N

N

C

P

P

N

N

N

The normal life cycle of the human body is affected without the essential elements. Thus, monitoring the excess or deficiency of essential elements in the human body is important for human health. However, the levels of these elements in biological fluids such as blood, urine, serum and sweat are affected by environmental factors, dietary intake and physiological factors, and therefore considerable variations in the human body can occur between the specific population subgroups. Urine contains 95% water, and remaining are chemical components. The chemical components available in urine are bicarbonates, uric acid, urea, organic acids, glucose, proteins. The mineral salts or major elements are 2.5% in urine such as Na, K, Ca, Fe and Mg. Contents of blood are variable and depend on many matters including age, gender, health, medicines and diet. Elements enter the human body in different ways. They are transferred from liquids, air, food, drugs through skin, GI tract and respiratory tract, distributed throughout the body, get accumulated in different body parts such as muscle, liver, kidney and bone by blood and excreted as sweat, urine and faeces [5].

2 Materials and Methods 2.1 Samples The sample data of 65 urine samples from ten participants were taken. Among them, six participants were women and four participants were men. The participants’ age was between 23 and 33 years. Of these ten participants, six healthy participants gave urine samples on two non-consecutive days and four healthy participants gave samples on only one day. The details of the ten healthy human participants are given in Table 1.

2.2 Sample Analysis The exclusion criteria applied to participant samples are (1) those who are pregnant; (2) those who are under 18 years of age; (3) those who use any medication; (4) those

232

N. Ambiga and A. Nagarajan

who are in between Days 1 and 8 of the menstrual cycle at time of sample collection; (5) those with history of kidney or urethral infection; and (6) those with urinary tract infections.

2.3 Methods 2.3.1

Data Sets

The data set was from Electronic Supplementary Material (ESI) for RSC Advances.

2.3.2

Statistical Analysis

The data from Table 2 are analysed by comparing with averages of the day itself (intra-day) and with the average of the non-consecutive days (inter-days).

3 Discussion 3.1 Variability of Elemental Concentration for Single Individuals To analyse the variability of elemental concentration in urine for different healthy individuals, the average of sample values was calculated for each day as well as for two non-consecutive days. The elemental concentrations of the individual samples taken from each healthy participant, both over a single day (intra) and over two non-consecutive days, were taken for consideration.

3.2 Within-Day Elemental Concentration Variations (Intra-day) for Healthy Participants The one-day concentration values that were calculated from healthy individuals show that the elements Cu and Pb were constant in within-day concentrations.

H4 Day 1

H3 Day 1

H2 Day 2

H2 Day 1

H1 Day 2

H1 Day 1

264

207

4.59

3.62

280

1.54

126

281

1.69

2.78

262

1.61

132

122

1.41

122

152

6.6

1.23

182

1.27

205

133

5.75

5.63

86

4.85

73

4.16

5.23

400

0.71

5.15

91

167

3.07

301

150

3.19

181

2.72

Ca (µg/ml)

4.75

Na (mg/ml)

4.33

2.97

2.29

1.49

8.43

2.71

2.68

6.31

6.74

3.41

6.58

5.09

2.65

1.68

2.68

6.49

1.52

3.64

4.89

6.11

2.22

K (mg/ml)

118

176

129

147

127

84

99

105

119

25

152

134

137

100

79

45

119

109

119

92

126

Mg (µg/ml)

0.41

0.57

0.51