145 18 7MB
English Pages 196 [187] Year 2021
Advances in Intelligent Systems and Computing 1319
Jagdish Chand Bansal Lance C. C. Fung Milan Simic Ankush Ghosh Editors
Advances in Applications of Data-Driven Computing
Advances in Intelligent Systems and Computing Volume 1319
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Nikhil R. Pal, Indian Statistical Institute, Kolkata, India Rafael Bello Perez, Faculty of Mathematics, Physics and Computing, Universidad Central de Las Villas, Santa Clara, Cuba Emilio S. Corchado, University of Salamanca, Salamanca, Spain Hani Hagras, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK László T. Kóczy, Department of Automation, Széchenyi István University, Gyor, Hungary Vladik Kreinovich, Department of Computer Science, University of Texas at El Paso, El Paso, TX, USA Chin-Teng Lin, Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan Jie Lu, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia Patricia Melin, Graduate Program of Computer Science, Tijuana Institute of Technology, Tijuana, Mexico Nadia Nedjah, Department of Electronics Engineering, University of Rio de Janeiro, Rio de Janeiro, Brazil Ngoc Thanh Nguyen , Faculty of Computer Science and Management, Wrocław University of Technology, Wrocław, Poland Jun Wang, Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong
The series “Advances in Intelligent Systems and Computing” contains publications on theory, applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually all disciplines such as engineering, natural sciences, computer and information science, ICT, economics, business, e-commerce, environment, healthcare, life science are covered. The list of topics spans all the areas of modern intelligent systems and computing such as: computational intelligence, soft computing including neural networks, fuzzy systems, evolutionary computing and the fusion of these paradigms, social intelligence, ambient intelligence, computational neuroscience, artificial life, virtual worlds and society, cognitive science and systems, Perception and Vision, DNA and immune based systems, self-organizing and adaptive systems, e-Learning and teaching, human-centered and human-centric computing, recommender systems, intelligent control, robotics and mechatronics including human-machine teaming, knowledge-based paradigms, learning paradigms, machine ethics, intelligent data analysis, knowledge management, intelligent agents, intelligent decision making and support, intelligent network security, trust management, interactive entertainment, Web intelligence and multimedia. The publications within “Advances in Intelligent Systems and Computing” are primarily proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by DBLP, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH, Japanese Science and Technology Agency (JST). All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/11156
Jagdish Chand Bansal · Lance C. C. Fung · Milan Simic · Ankush Ghosh Editors
Advances in Applications of Data-Driven Computing
Editors Jagdish Chand Bansal Department of Mathematics South Asian University New Delhi, Delhi, India Milan Simic School of Engineering RMIT University Melbourne, VIC, Australia
Lance C. C. Fung Murdoch University Murdoch, WA, Australia Ankush Ghosh School of Engineering and Applied Sciences The Neotia University Sarisha, West Bengal, India
ISSN 2194-5357 ISSN 2194-5365 (electronic) Advances in Intelligent Systems and Computing ISBN 978-981-33-6918-4 ISBN 978-981-33-6919-1 (eBook) https://doi.org/10.1007/978-981-33-6919-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Preface
Data-driven computing is a new field of computational analysis which uses provided data to directly produce predictive outcomes. Recent works in this developing field have established important properties of data-driven solvers, accommodated noisy data sets and demonstrated both quasi-static and dynamic solutions within mechanics. The research and development of data-driven system we see today are a tip of the iceberg. The sensor’s data from hardware-based system making a mammoth database is increasing day by day. Recent advances in big data generation and management have created an avenue for decision-makers to utilize these huge volumes of data for different purposes and analyses. AI-based application developers have long utilized conventional machine learning techniques to design better user interfaces and vulnerability predictions. However, with the advancement of deep learning-based and neural-based networks and algorithms, researchers are able to explore and learn more about data and their exposed relationships or hidden features. This new trend of developing data-driven application systems seeks the adaptation of computational neural network algorithms and techniques in many application domains, including software systems, cybersecurity, human activity recognition and behavioural modelling. As such, computational neural networks algorithms can be refined to address problems in data-driven applications. This book aims to foster machine and deep learning approaches to data-driven applications, in which data governs the behaviour of applications. Original research and review articles with model and data-driven applications using computational algorithm are included as different chapters. The reader will learn on various datadriven applications and their behaviours in order to extract key features. The book will enable researchers from academia and industry to share innovative applications and creative solutions to common problems using data-driven computational analysis. We would like to express our thanks to Prof. Janusz Kacprzyk, the Series Editorin-Chief, for his ongoing encouragement and support when realizing this publishing
v
vi
Preface
project. We are indebted to the professionals at Springer; the team has made the overall production process smooth and efficient. New Delhi, India Murdoch, Australia Melbourne, Australia Sarisha, India
Jagdish Chand Bansal Lance C. C. Fung Milan Simic Ankush Ghosh
Contents
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme for Cloud Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koushik Majumder, Koyela Chakrabarti, Rabindra Nath Shaw, and Ankush Ghosh
1
KNN-DK: A Modified K-NN Classifier with Dynamic k Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nazrul Hoque, Dhruba K. Bhattacharyya, and Jugal K. Kalita
21
Identification of Emotions from Sentences Using Natural Language Processing for Small Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saurabh Sharma, Alok Kumar Tiwari, and Dinesh Kumar
35
Comparison and Analysis of RNN-LSTMs and CNNs for Social Reviews Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Suraj Bodapati, Harika Bandarupally, Rabindra Nath Shaw, and Ankush Ghosh Blockchain-Based Model for Expanding IoT Device Data Security . . . . . Anand Singh Rajawat, Romil Rawat, Kanishk Barhanpurkar, Rabindra Nath Shaw, and Ankush Ghosh Linear Dynamical Model as Market Indicator of the National Stock Exchange of India . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prabhat G. Dwivedi E-Focused Crawler and Hierarchical Agglomerative Clustering Approach for Automated Categorization of Feature-Level Healthcare Sentiments on Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saroj Kushwah and Sanjoy Das
49
61
73
87
Error Detection Algorithm for Cloud Outsourced Big Data . . . . . . . . . . . . 105 Mohd Tajammul, Rabindra Nath Shaw, Ankush Ghosh, and Rafat Parveen
vii
viii
Contents
Framing Fire Detection System of Higher Efficacy Using Supervised Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 Sudip Suklabaidya and Indrani Das Twitter Data Sentiment Analysis Using Naive Bayes Classifier and Generation of Heat Map for Analyzing Intensity Geographically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Jyoti Gautam, Mihir Atrey, Nitima Malsa, Abhishek Balyan, Rabindra Nath Shaw, and Ankush Ghosh Computing Mortality for ICU Patients Using Cloud Based Data . . . . . . . 141 Sucheta Ningombam, Swararina Lodh, and Swanirbhar Majumder Early Detection of Poisonous Gas Leakage in Pipelines in an Industrial Environment Using Gas Sensor, Automated with IoT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Pushan Kumar Dutta, Akshay Vinayak, Simran Kumari, and Mahdi Hussain
Editors and Contributors
About the Editors Dr. Jagdish Chand Bansal is Associate Professor at South Asian University New Delhi and Visiting Faculty at Maths and Computer Science, Liverpool Hope University UK. Dr. Bansal has obtained his Ph.D. in Mathematics from IIT Roorkee. Before joining SAU New Delhi, he has worked as Assistant Professor at ABV-Indian Institute of Information Technology and Management Gwalior and BITS Pilani, India. His primary area of interest is swarm intelligence and nature-inspired optimization techniques. Recently, he proposed a fission–fusion social structure-based optimization algorithm, Spider Monkey Optimization (SMO), which is being applied to various problems from the engineering domain. He has published more than 60 research papers in various international journals/conferences. He has also received Gold Medal at UG and PG levels. He is Series Editor of Algorithms for Intelligent Systems (AIS) and Studies in Autonomic, Data-driven and Industrial Computing published by Springer. He is Editor-in-Chief of International Journal of Swarm Intelligence (IJSI) published by Inderscience. He is also Associate Editor of IEEE ACCESS (IEEE) and ARRAY (Elsevier). He is the steering committee member and the general chair of the annual conference series SocProS. He is the general secretary of Soft Computing Research Society (SCRS). Emeritus Professor Lance C. C. Fung was trained as Marine Radio/Electronic Officer, and he graduated with a B.Sc. degree with First Class Honours and a M.Eng. degree from the University of Wales. His Ph.D. degree from the University of Western Australia was supervised by the late Professor Kit Po Wong. Lance taught at Singapore Polytechnic, Curtin University, and Murdoch University where he was appointed Emeritus Professor in 2015. His roles have included Associate Dean of Research and Director of the Centre for Enterprise Collaborative in Innovative Systems. He has supervised to completion over 31 doctoral students and published over 335 academic articles. His contributions can be viewed at IEEE Xplore, Google Scholar, and Scopus. Lance has been a dedicated volunteer for the IEEE in various positions for over two decades. Lance’s motto is “Learning has no Boundary”. ix
x
Editors and Contributors
Dr. Milan Simic While currently being with RMIT University, School of Engineering, Dr. Simic is also General Editor of KES Journal and Professor of University Union Nikola Tesla, Faculty of Business and Law, Belgrade, Serbia. Adjunct Professor of Kalinga Institute of Industrial Technology (KIIT), School of Computer Engineering, Bhubaneswar, Odisha, India; Associate Director of Australia-India Research Centre for Automation Software Engineering (AICAUSE). He has bachelor’s, master’s, and Ph.D. degrees in Electronics Engineering from The University of Nis, Serbia, and Graduate Diploma in Education from RMIT University, Australia. Dr. Simic has comprehensive experience from industry (Honeywell Information Systems), CISCO, Research Institute and Academia, from overseas and Australia. For his contributions, he has received prestigious awards and recognitions, like two for industry innovation, from Honeywell, and two University awards for the excellence in teaching and provision of education to the community. Dr. Ankush Ghosh is Associate Professor in School of Engineering and Applied Sciences, The Neotia University, India and visiting Faculty at Jadavpur University, Kolkata, India. He has more than 15 years of experience in teaching, research as well as industry. He has outstanding research experiences and published more than 60 research papers in International Journal and Conferences. He was a research fellow of the Advanced Technology Cell- DRDO, Govt. of India. He was awarded National Scholarship by HRD, Govt. of India. He received his Ph.D. (Engg.) Degree from Jadavpur University, Kolkata, India in 2010.
Contributors Mihir Atrey Department of Computer Science and Engineering, JSS Academy of Technical Education, Noida, India Abhishek Balyan Department of Computer Science and Engineering, JSS Academy of Technical Education, Noida, India Harika Bandarupally Computer Science Engineering, Chaitanya Bharathi Institute of Technology, Hyderabad, India Kanishk Barhanpurkar Department of Computer Science and Engineering, Sambhram Institute of Technology, Bengaluru, Karnataka, India Dhruba K. Bhattacharyya Department of CSE, Tezpur University, Tezpur, Assam, India Suraj Bodapati JPMorgan Chase and Co., New York, USA Koyela Chakrabarti Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, Kolkata, West Bengal, India
Editors and Contributors
xi
Indrani Das Department of Computer Science, Assam University Silchar, Silchar, Assam, India Sanjoy Das Indira Gandhi National Tribal University, RCM, Imphal, India Pushan Kumar Dutta Amity School of Engineering and Technology, Amity University Kolkata, Kolkata, India Prabhat G. Dwivedi Department of Mathematics, Institute of Chemical Technology, Mumbai, India; Mithibai College of Arts, Chauhan Institute of Science, Amrutben Jivanlal College of Commerce and Economics, Mumbai, India Jyoti Gautam Department of Computer Science and Engineering, JSS Academy of Technical Education, Noida, India Ankush Ghosh The Neotia University, Sarisha, West Bengal, India; The Neotia University, Kolkata, West Bengal, India Nazrul Hoque Department of Computer Science, Manipur University, Imphal, India Mahdi Hussain Computer Engineering Department, Faculty of Engineering, University of Diyala, Baqubah, Iraq Jugal K. Kalita University of Colorado, Colorado Springs, USA Dinesh Kumar KIET Group of Institutions, Delhi-NCR, Ghaziabad, India Simran Kumari Amity School of Engineering and Technology, Amity University Kolkata, Kolkata, India Saroj Kushwah Noida International University, Greater Noida, India Swararina Lodh Department of Information Technology, Tripura University, Agartala, Tripura, India Koushik Majumder Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, Kolkata, West Bengal, India Swanirbhar Majumder Department of Information Technology, Tripura University, Agartala, Tripura, India Nitima Malsa Department of Computer Science and Engineering, JSS Academy of Technical Education, Noida, India Sucheta Ningombam Department of Information Technology, Tripura University, Agartala, Tripura, India Rafat Parveen Jamia Millia Islamia, New Delhi, India Anand Singh Rajawat Department of Computer Science Engineering, Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore, India
xii
Editors and Contributors
Romil Rawat Department of Computer Science Engineering, Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore, India Saurabh Sharma KIET Group of Institutions, Delhi-NCR, Ghaziabad, India Rabindra Nath Shaw Department of Electronics and Communication Engineering, Galgotias University, Greater Noida, India Sudip Suklabaidya Department of Computer Science and Application, Karimganj College, Karimganj, Assam, India Mohd Tajammul Jamia Millia Islamia, New Delhi, India Alok Kumar Tiwari KIET Group of Institutions, Delhi-NCR, Ghaziabad, India Akshay Vinayak Amity School of Engineering and Technology, Amity University Kolkata, Kolkata, India
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme for Cloud Data Centers Koushik Majumder, Koyela Chakrabarti, Rabindra Nath Shaw, and Ankush Ghosh
Abstract The highly customizable cloud server interface has seen a growing demand in availing the services over the years. So to ensure proper servicing of user requests and due profit, load balancing is of utmost importance for proper resource utilization in cloud data centers. A robust and scalable system that can grow and shrink according to the number and size of application saves both power of data center and availability commitment of vendor. The proposed algorithm is designed in a modular fashion where modules work with resources divided in clusters and take a hierarchical approach to delegate tasks to virtual machines. The clusters are load sensitive and scale up or down accordingly. Tasks submitted to the data center are forwarded to a suitable cluster. Inside each cluster, tasks are mapped to resources and the process is optimized using genetic algorithm. Since each cluster receives a separate task set, this resource mapping can be done parallelly in data center. As a result, the overall response time for a task can get reduced. The forwarding of a task set to a suitable cluster is however centrally controlled for faster response time. The system has a backup central node to avoid a single-point-of-failure situation and thus availability of the system is increased. Keywords Load balancing · Genetic algorithm · Data center · Virtual machine · Cluster · Scaling · SLA
K. Majumder · K. Chakrabarti Department of Computer Science and Engineering, Maulana Abul Kalam Azad University of Technology, Kolkata, West Bengal, India e-mail: [email protected] R. N. Shaw Department of Electronics and Communication Engineering, Galgotias University, Greater Noida, India e-mail: [email protected] A. Ghosh (B) The Neotia University, Kolkata, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_1
1
2
K. Majumder et al.
1 Introduction Cloud computing has found its demand exponentially escalating over the years. The extremely customizable pay as you go cost model of cloud services has lured many businesses to move their applications over cloud, a prominent one among them being the famous food delivery service Swiggy. The businesses have relieved themselves of maintaining servers and associated application framework. Along with business applications, there has been a sharp rise over application of IoT services which also use cloud for processing and managing the data. So the main concern of the cloud vendor is being able to provide adequate computing infrastructure so that the task request is processed within the stipulated makespan. In order to provide the infrastructure, a logical partitioning of the physical resources is done by the cloud vendor termed virtualization. Different virtual machines (VM) are employed to carry out the task request. These VMs perform function similar to a physical computer [1]. In public cloud systems, the size of the data centers is quite large. Therefore, allocating a proper VM in accordance to the application demand can be time consuming if the task allocation is done centrally. Apart from that, a single-point-of-failure (SPOF) situation might also arise which can adversely affect the performance of the application. Also, if the load in the system is not properly balanced, it might give rise to situations like excess power consumption, longer execution time or even system failure [2]. In short, a balancer should be able to optimize resource utilization along with minimization of application downtime, response time, avoid overload and maximize throughput [3]. Therefore, a robust, hierarchical, distributed load balancing system is required to consistently maintain the load in the system along with an effort to do away with unnecessary power consumption. Load balancing in cloud computing has been categorized as an NP-hard optimization problem [4]. Clients have different service-level agreement (SLA) with the vendor and every client prefers to get their application request processed in minimum time. But the vendors need to process as many application requests possible by conforming to the SLA standards defined without a resource overload. This conflicting situation makes classical algorithms like first-come-first-served (FCFS) not an effective solution to load balancing problem. Therefore, there has been a rise in using metaheuristic approach to allocate resources or schedule task in cloud computing [5, 6]. Proposed work has used genetic algorithm (GA) to provision resources in the system. The reason for choosing GA over algorithms like simulated annealing (SA) or hill climbing or tabu search is, its ability to explore the search space better for a global optima without getting converged to a local optima in a reasonable time. The architecture designed is divided into clusters and the load in each cluster is managed by a cluster manager. The size of cluster is same in the system but the number of clusters can increase or decrease according to the utilization of resources in each. A central manager has been added which redirects the task load according to the load condition in a cluster. The cluster then assigns a proper VM to the task. This task-VM mapping is carried out using GA. A detailed description of the same is given in the subsequent sections.
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
3
The rest of the chapter has been arranged in a way that the motivation and contributions are summarized in Sect. 2, related work pertaining to the topic is described in Sect. 3, a proposed architecture model is detailed in Sect. 4 as proposed work, along with Sects. 5 and 6 containing analysis and conclusion respectively.
2 Motivation and Contributions The load balancing in cloud computing can be considered to be one of the most important ways to strike a balance between application requirement and resource utilization. This factor plays an important role in earning profit for the cloud vendor. In many of the proposed solutions, the load balancing has been done centrally. So an approach has been taken in this chapter to employ both centralized and distributed architecture to minimize the single-point-of-failure (SPOF) situation in load balancing. The major contributions of the chapter are summarized as follows; (i) (ii)
(iii)
(iv)
(v)
The servers are separated into heterogeneous clusters so that task provisioning can be carried out parallelly in each cluster. A central module is deployed to receive the task request from the client. This module delegates the task to an appropriate cluster for processing based on its remaining capacity of resources. Initially, the system starts with one active cluster while the rest are powered off. As the load in the system increases, the clusters are incrementally activated. Similarly as the load in the system decreases, the idle clusters are powered off accordingly. This saves the unnecessary power consumption in the system. Inside the cluster for a particular task set, genetic algorithm is applied to choose an optimum set of VMs with configuration matching the application resource requirement. Instead of considering a fixed set of resource types like CPU or bandwidth or RAM, a set of n resource types are considered which is customizable according to the resources in the data center.
3 Related Work The most common way of handling service request in cloud is by allocating a VM for processing each task. The responsibility lies with the vendor to provide optimal amount of resources to the client so that the quality of service (QoS) is preserved and no loss is incurred due to over provisioning of resources. Several algorithms have been suggested for carrying this out. Some of the prominent works have been discussed in this section.
4
K. Majumder et al.
In the chapter titled load balancing in cloud computing using dynamic load management algorithm (DLMA) [7], authors have suggested using a centralized architecture. The jobs here are allocated to the first free VM available from a list of VMs available in the system and the list is updated accordingly. But the architecture does not comply to any SLA constraints or matching task resource requirement against VM configuration. In the algorithm dynamic resource allocation scheme in cloud (DRAS) [8], authors have classified a task request as one of the three types of predefined priorities and a lower priority task is suspended or canceled to allocate the VM to a higher priority task. But the resource needs are not taken into consideration while preempting a running application. So this architecture also apparently works best with a homogeneous work load. In chapter adaptive management of virtualized resources in cloud computing using feedback control (AMVR) [9], multi-input multi-output (MIMO) feedback system is employed. Here, resources are adjusted in a running application based on the service-level objective (SLO) of each by a central controller. Sensors are deployed to monitor resource utilization and performance measurement of the running application. Resources are granted or revoked by the balancer based on that feedback data from sensors. The performance degradation of balancer and the network traffic for the continuous data exchange might increase with growing number of task applications. In the chapter, an SLA-aware load balancing scheme for cloud data centers (SALBS) [10], a decentralized architecture is employed where the VMs are divided into clusters and the task is provisioned according to the SLA needs by a local load balancer in each cluster using artificial neural network (ANN). The architecture has avoided SPOF but the architecture works for a large data center with a reasonably high work load; otherwise, it would suffer from huge power consumption for maintaining a two-tier architecture in each cluster. In [11], a genetic-based task scheduling algorithm in cloud computing (GBTSA), a task allocation policy is suggested. A task set is passed as an input to the algorithm and the combination of task-VM which incurs minimum execution time is chosen as the final point. While considering resource allocation, only the processing speed of the VM is taken into account. In this scheme, tasks with longer execution time might suffer from starvation. Other SLA constraints like task priority or other resource needs are not taken into consideration in this algorithm.
4 Proposed Two-Tiered Load Balancing Algorithm The chapter proposes a three-tiered system where the physical machines (PM) are divided into several clusters which can incrementally scale up or down according to the growing or shrinking work load [12]. Instead of considering CPU, I/O, i.e., just a few particular types of resources in the system, the system considers resources as an n-dimensional vector where each dimension stands for a particular resource type. Each cluster has its own cluster manager to provision a task and the resources of the
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
5
Fig. 1 Logical representation of proposed two-tiered load balancing architecture
PMs are managed by a resource manager. The primary objective of the system is to effectively manage the system in a layered architecture so that the clusters are not overloaded by task request and failure of a part of the system does not bring down the system as a whole. Figure 1 gives a general overview of the architecture.
4.1 Architectural Prerequisites There are a few network arrangements which are assumed/employed for the system to work. • A gateway is employed as a global manager (GM) or the central node of the system. To increase the availability of the system in case of central node failure, a backup gateway is provided. This gateway is generally on a sleep mode and will be powered on using a protocol like wake on wireless LAN. • The global managers will be connected to all the balancers existing in the system. • A point-to-point link is to be maintained among the local balancers for communicating information. • Dedicated LAN connections are maintained between local balancers and resource managers and between physical hosts and resource manager within a cluster. • In order to save the VM creation overhead, the VMs are initialised when a particular cluster is powered on in the system.
4.2 General System Architecture Global Manager: The gateway accepts client’s request and forwards it to a suitable balancer. The global manager maintains a list of balancers with attributes like balancer
6
K. Majumder et al.
id, balancer status like active or powered off or busy or ready to be power cycled, task request handling capacity of the active balancers, distance between balancers as number of hops [13]. It regularly collects information from balancers like how much capacity is left to handle the work load, whether to initiate the scale up or scale down operation. Local Balancer: These modules act as provisioner in the system. It accepts the request forwarded to it by the global balancer. It then provisions the task according to the availability of VM in the system. If it is unable to handle the task, it forwards it to the global manager to pass on the request to another local balancer in the system. Every 5 min, it computes the task handling capacity left in the cluster based on the remaining resource capacity in the system. If it increases or falls beyond a certain predetermined threshold factor, it calls the scaling operation. Resource Manager: All the resources in the cluster is associated with a manager. The manager computes the utilization of each resource type in each VM in the system [14]. This information is communicated to the local balancer based on which the load in the cluster is determined and task provisioning takes place.
4.3 Load Balancing Scheme The load balancing takes place in two steps. The first step is the balancing load among clusters in the scalable cluster system and the second step includes balancing task load among VMs in individual clusters. The first tier of the load balancing is the autoscaling module and the second tier is the provisioning module.
4.3.1
Autoscaling Module
The hosts are divided into several clusters. To reduce unnecessary power consumption, the clusters are powered on according to the load condition in the system [15]. After every 5 min, the cluster balancer in each cluster checks remaining resource capacity of the cluster. If the cluster is over loaded or under loaded, it triggers a scale up or scale down operation at the global balancer. The final scaling decision lies with the global balancer.
Determining Load in the System The load in the system is calculated as a function of resource utilization of each resource type of a VM. The utilization of a particular resource type k in VM i at a particular time instant T is the fraction of the total capacity of k-type resource in use by n number of applications running in the VM i. This can be mathematically expressed as,
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
7
Fig. 2 Flowchart of the load condition determination by the cluster manager
Ruik (T ) =
n
R jik /Rik
(1)
i=1
where Ruik (T ) is the utilization of k-type resource n VM i. Rik is the total capacity of k-type of resource in VM i. Rjik is the utilization of resource k by jth job running in the VM i. The above equation gives the utilization of a resource type for a particular time instant. But it is often observed that the resource utilization might increase momentarily resulting in demand spike of the resource. These demand spikes do not give a fair view of the utilization level for a resource (Fig. 2). Therefore for an unbiased utilization overview, we consider the utilization of a particular resource type for a time interval t. Mathematically, this can be expressed as, t
Ruik = 1/t ∫ Ruik (T )dt
(2)
t−t
Based on the above equation, each resource type of every VM in the system is analyzed and load condition of the system can be classified into three types as explained below. Condition (1) Overload: The cluster manager observes the resource utilization level of each resource type of each VM in the system. There is a predetermined threshold upper_util which serves as the cutoff value to classify an utilization fraction of a resource type as overutilized. If any one of the resource type in a particular VM crosses the upper_util mark, the VM is classified as overloaded and is added to a queue. After adding all the overloaded VMs in the queue, the fraction of the VMs overutilized to total VMs in the system is determined. If the fraction is greater than another predetermined threshold overload_thresh, the cluster is marked as overloaded
8
K. Majumder et al.
and the cluster balancer requests for a scale up. However, if there is any underloaded VM in the system as described in the subsequent condition, the cluster cannot be classified as an overloaded cluster. Condition (2) Underload: The cluster manager classifies a VM as underutilized if the utilization level of all the resource types fall under a particular predetermined threshold called lower_util. As described in the previous condition, a cluster can be determined as an underloaded cluster if it meets two conditions. Firstly, the ratio of underutilized VMs must be above a predetermined threshold or underload_thresh. Secondly, none of the VMs must be overloaded in the cluster. Condition (3) Neither Overloaded nor Underloaded: If a cluster is neither overloaded or underloaded, the average resource utilization of all the VMs in the cluster is calculated. Since the resource utilization is represented as a fraction (as in Eq. 2), the remaining capacity (RRC) is expressed as: RRC = 1 − Average_Resource_Utilization Algorithm 1 Scale_Down_Balancer Input :Cluster_id 1. Start 2. Initialise Flag to 0, reply to 0 3. Set Cluster_id Status to 3 //Status is set to idle cluster 4. For i = 1 to Total_number_of Clusters in list Cluster_Info 4.1 if VM[i].Status = 2 then 4.1.1 Flag=1 4.1.2 break; 5.1.3 end if 5. end for 6. If Flag = 1 then, 6.1 reply = 0 7. end if 8. else 8.1 reply = 1 8.2 set Cluster_id.Status=0 9. end else 10. return reply 11. end
Algorithm 2 Scale_Up_Balancer
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
9
Input : Overloaded Cluster ID 1. Start 2. Initialise CLTOT to total number of clusters in the system, List_Reactivate-Cluster to NULL, List_Activate_Cluster to NULL, dest_id to -1, len_ List_Reactivate to 0, len_List_Activate to 0, found to FALSE 3. Set Overloaded Cluster ID status to 2 4. For i= 0 to CLTOT loop 4.1 if cluster[i].status is 0 then, // status 0 stands for inactive clusters 4.2 List_Activate_Cluster[index++]= cluster[i].id 4.3 end if 4.4 else if cluster[i].status is 3 then, // status 3 stands for underloaded cluster. 4.5 List_Reactivate-Cluster[index++] = cluster[i].id 4.6 End else if 5. End for loop 6. Calculate len_List_Reactivate = List_Reactivate-Cluster.Length 7. Calculate len_List_Activate = List_Activate-Cluster.Length 8. If len_List_Reactivate not equal to zero then, 8.1 for i= 0 to len_List_Reactivate-1 loop 8.1.1 temp = List_Reactivate-Cluster[i] 8.1.2 if distance_between( Overloaded Cluster ID, temp) fitness of parent[2] then insert parent”[2] in place of parent[2] End loop Output the global best solution that holds first rank in the parent population in last generation
Factor 1 (Feasibility): The factor that governs whether a VM has sufficient resources to host an application and is mathematically the ratio between the required capacity to the remaining capacity for a particular resource type like CPU or RAM. Maximum value of this ratio can be 1 for the VM to be able to run the application. Mathematically, it can be expressed as; Feasibility(r jik ) = req jk /rik
(3)
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
13
where reqjk is the k-type resource requirement of the jth application, r ik is the remaining capacity of k-type resource in ith VM. Say the k-type resource is unavailable in the VM, then a problem of division by zero occurs in this case. Again, if r ik < reqjk , we need to avoid any erroneous assignment of such task to a VM. To tackle both the situations, we replaced the value of r ik with a very small number (10−3 in our implementation). This will make the feasibility factor high which in turn increases the cost of execution in manifolds. Since we are to choose assignment that costs minimal, VM-task matchings with high cost value will not be selected. Factor 2 (Execution Penalty): If we consider the feasibility factor alone, the task with low resource requirement will be placed in VMs with high resource configurations, leaving resource extensive applications to starve. To avoid that situation, execution penalty factor is introduced which will avoid placing a high configuration VM to a job with low resource requirement. Mathematically, the penalty factor is expressed as; ωik = μ(rik )/(req jk )
(4)
where Ñ ik is the resource penalty factor for kth resource of ith VM in the system, μ is an integer constant (in the implementation, the value taken is 2), r ik is the Resource Capacity Remaining for kth resource of ith VM in the system and reqjk is the k-type resource needed by application j. The ratio r ik /reqjk increases as the VM configuration increases. This in turn increases the penalty factor and thereby the cost factor increases. Thus, the probability of placing a task on a high configuration VM is less likely. The cost of execution of a task per resource type k on VM i can be given as: Cik (T ) = Feasibility(rik ) ∗ ωik ∗ (1/ p) ∗ W E + Downtimeik ∗ W D
(5)
where the feasibility(r ik ) gives the fraction of resource type i in VM number k utilized by the task T. Ñ ik or the execution penalty factor is the weighted cost associated to the configuration of the ith hardware resource of kth VM used to the actual amount of resource used by the task. The factor p is an integer that stands for priority of the task, bigger the value, higher the priority. Since the cost is to be minimized, we multiply it with (1/p) to decrease the cost as the priority of the task increases. W E and W D are the weight factors such that W D + W E = 1. The weight factors are adjusted according to the deadline sensitivity of the application running. A hard deadline will have a high W D , whereas an application with soft deadline will have a lower W D . The cost function computed for all the n resources needed for each task thus will be of the form C(T ) =
n (Feasibility(rik ) ∗ ωik ∗ (1/ p) ∗ W E + Downtimeik ∗ W D ) i=1
14
K. Majumder et al.
Therefore, the fitness function or objective function computed for a set of tasks will be of the form Minimize: F(T ) =
n m Feasibility rik j ∗ ωik j ∗ (1/ p) ∗ W E + Downtimeik j ∗ W D (6) j=1 i=1
Subject to: Feasibility(r ik ) < = 1; where j stands for any of the m tasks in the task set submitted. Genetic Algorithm used for Provisioning: Here, genetic algorithm (GA) has been employed as the metaheuristic algorithm to provision tasks to available VMs. GA is a soft computing technique that mimics the natural process of genetic combinations and re-combinations between chromosomes to create off springs. In the VM provisioning context, GA has been utilized to output the optimal task-to-VM mapping. The reason to choose GA over other soft computing techniques like simulated annealing or hill climbing lies in the ability of the GA to choose a random solution from the search space as opposed to the neighborhood search followed by the above two. Therefore, it converges to a global optima better than the other two. Here, the search space is defined by the probable VMs to host a task. The steps involved in GA are as follows: Encoding: Real valued encoding has been used. Chromosomes are represented as n-bit string where each dimension stands for the task and the value contained in it represents the VM id to which it has been assigned. Selection: Since this is a minimization problem, four-participant tournament selection has been used. From the solution pool, randomly four parents are selected and the fittest among them is selected as a parent. In this way, desired number of parents are selected. In the simulation, 200 parents have been selected. Crossover: This step helps the algorithm to converge toward a solution. In the simulation, single-point crossover has been used. Mutation: While crossover converges the algorithm toward a solution, this might lead to a choosing a local minima. Therefore, for a better exploration of the search space to reach the global minima, mutation is needed. Mutation causes the search space exploration to diverge by choosing random solution. Figure 3 shows the logical diagram of the genetic algorithm described above.
5 Simulation and Experimental Analysis The resource provisioning by genetic algorithm has been simulated in MATLAB 2014b. The experiment is carried out with a task set comprising of 35 tasks initially with 20 VMs available. The fitness function for the taskset considering all the individual tasks is calculated according to the formula in (6). The VM set that attributes
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
15
Fig. 3 Logical representation of the genetic algorithm used in provisioning of the proposed algorithm
to minimum total fitness of the tasks is considered for actual provisioning. A total of five resource types, i.e., processor speed (in MIPS), RAM (in GB), network resources (in MBPS), CPU cores and persistent disk, is considered. Five kinds of VM configurations are used here. Four VMs each of category A, B, C, six VMs of category D and two VMs of category E are used. The configurations are as listed in Table 2. The VMs vary in delay factor between range 0.1 and 0.5 for same configuration. In the simulation, the normalization factor μ (in Eq. 4) has been taken as 2. The weight factors W E and WD (as per Eq. 5) are set according to the task priority. A high priority task is assigned a VM with less downtime factor. So the value of W D in this case will be less than WE , since this is a minimization problem. In this case, the priority of task when is less than or equal to 10, W D is 0.2 whereas W E is 0.8. Otherwise both W E and W D of lesser priority task is set to be 0.5, respectively. A task set comprising of 35 tasks with different priorities are submitted to the cloud. 20 VMs are taken with five types from each category with varying delay factors. The final allocation has been made as per the “final point” mentioned in the screenshot in Fig. 4. The first row depicts the tasks numbered from 1 to 35 and the Table 2 Configuration of VMs used in simulation Category
Processor speed (MIPS)
RAM (GB)
CPU cores
Bandwidth (MBPS)
Persistent disk
A
2000
4
1
1024
64
B
4000
8
2
2048
64
C
6000
16
4
2048
64
D
8000
32
8
4096
64
E
10,000
64
8
4096
64
16
K. Majumder et al.
Fig. 4 Task set—VM mapping by MATLAB simulation
second row represents the VM to which it is assigned. The number 0 in the second row indicates that the task could not be allotted a VM in the first stage of allotment and needs to wait.
5.1 Results and Discussion The effectiveness of the task allocation scheme is compared against the existing load balancing schemes as described in [7, 11].
5.1.1
Resource Utilization
A comparative study has been done by randomly taking task load and VM combinations, and based on the task allocation by the three algorithms, the average utilization level of the following resources in the VMs active in the system is charted with Xaxis showing the number of jobs submitted and Y-axis showing the percentage of utilization of the particular resource type.
5.1.2
Load Balancing
Since the primary objective of the algorithm is an uniform distribution of load among the VMs, a load imbalance factor V has been introduced here. This is nothing but the standard deviation of the resource utilization in the system. It can be represented mathematically as follows: n 2 Rik − Rk /n υk = i=1
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
17
Fig. 5 Chart showing the utilization percentage of the VM resources by the three algorithm
Fig. 6 Chart showing the load imbalance as percentage of the VM resources by the three algorithm
where Vk is the standard deviation of resource type k, Rik is utilization percentage of kth resource in ith VM, Rk’ is the average percentage of utilization of kth resource and n is the number of active VMs in the system. In the analysis, different set of tasks are taken, the number of VMs available in the system is 80% of the total workload and the load imbalance in memory, processor and bandwidth for the three algorithms have been charted as follows. The X-axis showing the number of jobs submitted and Y-axis showing the percentage of standard deviation of load for the particular resource type (Figs. 5 and 6).
6 Conclusion and Future Work Over provisioning has been a cause of concern in the data centers since it leads to starvation of task sets arriving later. It is important for the cloud vendor to conform to the SLA like executing a task successfully within the deadline specified. Completing task way before the assigned deadline is not an essential requirement. The algorithm therefore takes into account that a task be assigned in a way so that maximum number of tasks submitted be attended to maintaining the QoS as promised by the vendor. The resources in the algorithm are considered generic in nature. The formulation does not take into account the specific type of a resource like RAM or CPU. Therefore, the model can be implemented to work in cloud data centers with various types of heterogenous resources. SLA constraints like reliability of a VM and task priority
18
K. Majumder et al.
have also been taken into consideration while doing a task-to-resource mapping. Apart from that, the entire load balancing scheme has been carried out by dividing the physical resources into clusters. This facilitates task allocation to be carried out parallelly in individual clusters. Also, the provision to power off idle clusters helps to save energy consumption which is another cause of concern in data centers. However, the implementation of the autoscaling module is kept as a future work.
References 1. D. Bertsimas, S. Gupta, G. Lulli, Dynamic resource allocation: a flexible and tractable modeling framework. Eur. J. Oper. Res. 236(1), 14–26 (2014) 2. S.K. Mishra, B. Sahoo, P.P. Parida, Load balancing in cloud computing: a big picture. J. King Saud Univ.-Comput. Inf. Sci. (2018) 3. V. Kunamneni, Dynamic load balancing for the cloud. Int. J. Comput. Sci. Electr. Eng. (IJCSEE) 1(1) (2012) 4. A.R. Varkonyi-Koczy, A load balancing algorithm for resource allocation in cloud computing, in Recent Advances in Technology Research and Education: Proceedings of the 16th International Conference on Global Research and Education Inter-Academia 2017, vol. 660 (Springer, 2017), p. 289 5. Mohand, M., Melab, N., Kessaci, Y., Lee, Y. C., Talbi, E, G., Zomaya, A. Y. and Tuyttens, D. “A parallel bi-objective hybrid metaheuristic for energy-aware scheduling for cloud computing systems.” Journal of Parallel and Distributed Computing 71, no. 11, 1497–1508 (2011) 6. C. Mezache, O. Kazar, S. Bourekkache, A genetic algorithm for resource allocation with energy constraint in cloud computing, in International Conference on Image Processing, Production and Computer Science, London (UK) (2016), pp. 62–69 7. R. Panwar, B. Mallick, Load balancing in cloud computing using dynamic load management algorithm, in 2015 International Conference on Green Computing and Internet of Things (ICGCIoT) (IEEE, 2015), pp. 773–778 8. A.T. Saraswathi, Y.R.A. Kalaashri, S. Padmavathi, Dynamic resource allocation scheme in cloud computing. Procedia Comput. Sci. 47, 30–36 (2015). ISSN 1877-0509 9. Y. Belkhier, A. Achour, R.N. Shaw, Fuzzy passivity-based voltage controller strategy of gridconnected PMSG-based wind renewable energy system, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, 2020, pp. 210–214. https://doi.org/10.1109/iccca49541.2020.9250838 10. R.N. Shaw, P. Walde, A. Ghosh, IOT based MPPT for performance improvement of solar PV arrays operating under partial shade dispersion, in 2020 IEEE 9th Power India International Conference (PIICON), SONEPAT, India, 2020, pp. 1–4. https://doi.org/10.1109/piicon49524. 2020.9112952 11. S. Mandal, V.E. Balas, R.N. Shaw, A. Ghosh, Prediction analysis of idiopathic pulmonary fibrosis progression from OSIC dataset, in 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, 2020, pp. 861–865. https://doi.org/10.1109/gucon48875.2020.9231239 12. M. Kumar, V.M. Shenbagaraman, R.N. Shaw, A. Ghosh, Predictive data analysis for energy management of a smart factory leading to sustainability, in Innovations in Electrical and Electronic Engineering, ed. by M. Favorskaya, S. Mekhilef, R. Pandey, N. Singh. Lecture Notes in Electrical Engineering, vol. 661 (Springer, Singapore, 2021). https://doi.org/10.1007/978981-15-4692-1_58 13. Q. Li, Q. Hao, L. Xiao, Z. Li, Adaptive management of virtualized resources in cloud computing using feedback control, in 2009 First International Conference on Information Science and Engineering, Nanjing (2009), pp. 99–102
Genetic Algorithm-Based Two-Tiered Load Balancing Scheme …
19
14. C.C. Li, K. Wang, An SLA aware load balancing scheme for cloud data centers, in Information Networking (ICOIN), International Conference (2014), pp. 58–63 15. S.A. Hamad, A.O. Fatma, Genetic-based task scheduling algorithm in cloud computing environment. Int. J. Adv. Comput. Sci. Appl. 7(4), 550–556 (2016)
KNN-DK: A Modified K-NN Classifier with Dynamic k Nearest Neighbors Nazrul Hoque, Dhruba K. Bhattacharyya, and Jugal K. Kalita
Abstract K-nearest neighbor (k-nn) is a widely used classifier in machine learning and data mining, and is very simple to implement. The k-nn classifier predicts the class label of an unknown object based on the majority of the computed class labels of its k nearest neighbors. The prediction accuracy of the k-nn classifier depends on the user input value of k and the distance measure used to compute the nearest neighbors from the training objects. If we use a static value for k for a particular classification task, the prediction accuracy of a k-nn classifier may decrease due to class imbalance in a dataset. In this paper, we propose a modified k-nn classifier that considers class imbalance in a dataset, and computes an appropriate value for k. The proposed k-nn classifier has been validated on a large number of benchmark datasets from various domains. The method is compared with traditional k-nn, decision tree, random forest and SVM classifiers, and the method yields significantly better prediction accuracy than the traditional the k-nn classifier and other algorithms.
1 Introduction Classification is an important analysis tool in data mining and machine learning. Especially, for Big Data analytics, often classification methods are used to predict data behavior in real time. A number of supervised learning methods have been proposed. Among these methods, k-nearest neighbor is considered a top 10 data mining algorithm due to its simplicity and efficiency [1]. It predicts the label of a test N. Hoque (B) Department of Computer Science, Manipur University, Imphal 795003, India e-mail: [email protected] D. K. Bhattacharyya Department of CSE, Tezpur University, Tezpur, Assam 784028, India e-mail: [email protected] J. K. Kalita University of Colorado, Colorado Springs, USA e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_2
21
22
N. Hoque et al.
object based on the labels of k nearest neighbors in the space of objects. The value of k or the number of nearest neighbors is decided a priori by the user. The classifier uses a proximity measure (similarity or distance measure) to find the nearest objects of an unknown test object, and majority voting is applied on the class labels of the nearest neighbors to find the class label of an unknown object. The two major issues that need to be addressed for the k-nn algorithm are the proximity measure used to find the k nearest objects of a test object and the value of k [2]. A number of distance measures have been proposed to handle the first issue and empirical studies reveal that the effectiveness of the k-nn algorithm depends on training data behavior and the proximity measure used on the dataset. The common conclusion on the first issue is that different applications need different distance measurements [3, 4]. The value of k also plays a vital role on the classification of an unknown data object. If we use a static value for k without considering the distribution of data, prediction accuracy of the k-nn classifier may degrade. According to Zhang et al. [5] different test data points should consider different numbers of nearest neighbors. According to Ahang et al. [5], k-nn classification methods select the k value by either setting a fixed constant for all test data or by conducting cross-validation to estimate the k value for each test data point. This often leads to low prediction rate in real classification applications because these methods do not give consideration to the distribution of data. We illustrate this issue by an example shown in Fig. 1. In Fig. 1, when k = 5 for the whole problem space, two test data points are assigned the positive class according to the majority rule. From the distribution, k = 5 is suitable for predicting the label of the left test data point, but unsuitable for the right one. The right test data point should be predicted to be in the negative class. This can be obtained with k = 3. These two examples indicate that different test data points should consider different numbers of nearest neighbors. In this paper, we propose a method called KNN-DK that computes different k values for predicting the class labels of unknown objects. The method considers the distribution of data objects to compute an appropriate value for k.
Fig. 1 Binary classification using the kNN method with a fixed k value, k = 5
KNN-DK: A Modified K-NN Classifier …
23
1.1 Problem Definition Given dataset D of n classes, C1 , C2 , . . . , Cn , which are imbalanced in nature, the problem is to develop a classifier using the k-nn framework with dynamic selection of k to yield high classification accuracy.
1.2 Motivation The main motivation of this paper is that it is currently necessary to use a fixed value for k in the k-nn classification framework to predict the class label of an unknown data object. However, a fixed value of k is not suitable for imbalanced data distributions. During prediction of the class label of a test object in a dataset, the k-nn classifier should be able to choose different values for k as it considers objects belonging to different classes. This has motivated us to develop a modified version of the traditional k-nn classifier that computes the value of k dynamically, during training, considering the training data objects and the numbers of objects belonging to each of the classes. For example, in an imbalanced dataset, if we assume that the numbers of objects belonging to classes A, B and C are 50, 17, and 10, respectively, and the value of k is 5, the traditional k-nn classifier simply considers 5 nearest neighbors to find the class label for any test object. However, due to the imbalanced data distribution, it may so happen that all the k nearest neighbor objects of a test object belong to only class A. As a result, the probability of getting the class label as A for the test object becomes high. In another case, during training, the classifier may get two objects from class C and three objects from class A as neighbor. Although the actual class label of the test object is C, it will be predicted as A and hence, it will be an instance of mis-classification due to the constant value of k. Instead of using a constant value of k, if we compute a separate k value for each class, based on the number of objects belonging to the class, then classification accuracy may be improved. In this example, if the values of k for classes A, B and C are 5, 3 and 2, respectively, the classifier will predict the label for the test object correctly as C.
1.3 Contributions The major contributions of this paper are listed below. 1. A novel class representative score to dynamically compute the value of k one per class for a given dataset. 2. A novel class relevance score estimation technique to decide on the class label for a given test object. 3. A modified k-nn classifier to yield high classification accuracy for any dataset with imbalanced classes.
24
N. Hoque et al.
1.4 Paper Organization The rest of the paper is organized as follows. In Sect. 2, related work on various modified k-nn algorithms are discussed. The proposed k-nn algorithm is discussed in Sect. 3. In Sect. 4, we present experimental results and analysis of the method, followed by the conclusion and future work in Sect. 5.
2 Related Work The k-nearest neighbor (k-nn) classifier is a simple, yet effective and widely used method in data mining. k-nn has been recognized as one of the top 10 algorithms in data mining [1]. The method is used in many applications such as Big Data analysis, micro-array data analysis, network anomaly detection, and image analysis. Direct application of the k-nn classifier in Big Data analytics is not feasible due to time and memory restrictions. But a few variants of the k-nn classifier have been developed to work with MapReduce. Maillo et al. [6] proposed a method called KNN-IS, which is an iterative Spark-based design of the k-nearest neighbors classifier for Big Data. A deep learning based method called Deep k-Nearest Neighbor has been developed by Paperno and McDaniel [7] to classify an unknown sample using deep neural networks. Arefin et al. [8] have developed a software tool called GPU-FS-kNN for fast and scalable k-nn computation using GPUs. The method is not only fast, and scalable to a very large-scale instances.
3 Proposed Modified k-nn Classifier KNN-DK is a modified version of the traditional k-nn classifier. The main challenge of a modified k-nn classifier is to predict the number of nearest neighbors, i.e., the appropriate value of k. The traditional k-nn classifier uses a static value of k during classification of any unknown object. A static value of k often yields low classification accuracy in many cases. For example, if the numbers of objects belonging to different classes are not approximately the same for an imbalanced dataset, k-nn does not obtain high classification accuracy. Instead of using a static value of k, we compute a dynamic value for classification of unknown data objects. KNN-DK is not dependent on a user specified static value of k. It computes an appropriate value of k based on the number of objects belonging to each class. For different classes, KNN-DK computes different values dynamically. We demonstrate that it is able to handle large variations in datasets with varied numbers of classes with varied properties for instances representing each class.
KNN-DK: A Modified K-NN Classifier …
25
KNN-DK takes training data with n classes (say, C1 , C2 , . . . , Cn ) and splits the datasets into n subparts. It executes three steps to estimate an appropriate k value for the dataset, viz., (i) computation of a class representative score (ii) proximity computation, and (iii) computation of class relevance score.
3.1 Computation of k for Each Class For each test object Oi , the method computes a relevant numbers of neighbors kj for each class Cj , where j = 1, 2, . . . , n. To compute kj scores of Oi , the method considers (i) the number of objects belonging to each class Cj , (ii) the size of largest class in terms of the number of objects and (iii) the size of the smallest class. We estimate the kj score of Oi for class Cj as follows. kj =
minnl=1 |Cl | × |Cj | maxnl=1 |Cl |
(1)
where, |Cl | = the cardinality or the numbers of objects in class Cl for l = 1, 2, 3, . . . , n. The relevant numbers of neighbors (kj ) of a test object Oi for a class Cj represents the number of objects that need to be considered from class Cj to predict the class label of Oi . Assume, in a given dataset D with three classes, say A, B and C, the number of objects belonging to these three classes are 70, 20 and 30, respectively. The size of the smallest class is 20 and the size of the largest is 70. The relevant number of neighbors for object Oi for these three classes will be (i) for class A, (kA = ( 20×70 ) = 20, 70 20×20 (ii) for class B, (kB ) = ( 70 ) = 5, and (iii) for class C, (kC ) = ( 20×30 ) = 8. 70 During prediction of the class label for Oi , the proposed KNN-DK considers only 20 nearest objects from class A, 5 nearest objects from class B and 8 nearest objects from class C. Using these nearest objects, the method computes class relevance score, discussed later, for each class.
3.2 Proximity Computation The proposed KNN-DK classifier identifies class-labeled nearest neighbors of a test object using Cosine distance. Cosine distance is the inner product of two vectors that measures the cosine of the angle between the vectors. This measure gives a value in the range from 0 to 1, where 0 means the two vectors in exactly the same direction, while 1 the indicates highest dissimilarity, and in-between values indicate
26
N. Hoque et al.
intermediate proximities. dX ,Y denotes the cosine distance between two objects X = [x1 , x2 , . . . , xn ] and Y = [y1 , y2 , . . . , yn ], and the distance is computed as follows. n dX ,Y = n
i=1 xi .yi
2 i=1 xi
n 2 i=1 yi
(2)
3.3 Class Relevance Score Computation KNN-DK computes a class relevance score to decide the class label of a given object Oi for a given dataset. The method first computes the number of kj nearest neighbor objects to consider, k is computed as the relevant number of neighbors belonging to class Cj . The method needs three values, viz., (i) how many objects belong to a particular class Cj among the k nearest objects, (ii) the sum of the distances from object Oi to all other nearest neighbor objects and (iii) the sum of the distances from object Oi to all nearest neighbor objects Oj for which class label of Oi is the same as the nearest neighbor object Oj . Mathematically, the class relevance score of an object Oi is represented as follows. s d (Oi , Om ) j (3) ri = m=1 k p=1 d (Oi , Op ) Om are the objects for which class label of Oi is same to class labels of Om ; there are s such objects. On the other hand Op represents all the nearest objects of Oi .
3.4 Prediction of Class Label for a Unknown Object The proposed KNN-DK decides the class label of a given test object Oi based on the kj of the object for the class. To predict the class label, the method identifies k nearest neighbor objects from the training dataset, i.e., Dtrain . The value of k is not constant, rather computed dynamically. The value of k is equal to the value of kj for class Cj . This value of k is dynamically calculated for different datasets. From the nearest neighbor objects of Oi for class Cj , the method computes kj . Finally, it assigns that j class label to Oi for which the class relevance score, ri is maximum. Our method not only considers a dynamic value for k, but it also uses the considers average distance of the object Oi to all its nearest neighbor objects for each class during prediction of its class label. Lemma 1 In the KNN-DK framework, k is not static. Proof The KNN-DK computes relevant numbers of neighbors, kj , for a test object Oi for the class Cj using Eq. 1. kj is computed based on the total number of objects
KNN-DK: A Modified K-NN Classifier …
27
in a dataset, and the minimum and maximum numbers of objects belonging to any class. The kj is considered the value of k for Cj . Hence, for different datasets as well as for different classes, the value of k is different, and hence the proof. Lemma 2 An object Oi is assigned class label A iff the Class Relevance Score of Oi j for A, ri , is the highest. Proof Class Relevance Score of an object Oi is computed based on the class-based j nearest neighbors. The ri of an object Oi for class A considers the nearest objects belonging to class A and their distance to the object Oi . If the distance between Oi and all other nearest objects belonging to A is minimum and the distance between Oi j and all other nearest objects not belonging to A is also minimum, then ri will be high. But, if the distance between Oi and all other nearest objects belonging to class B is minimum but the distance between Oi and all other nearest objects not belonging to j class B is maximum, then ri will be low. Therefore, Oi is assigned the class label of j a given class (say A) if corresponding ri is the highest, and hence the proof. Lemma 3 KNN-DK can handle class imbalanced data. Proof The method computes the k value dynamically using Relevant Number of neighbors for various classes as defined in Eq. 1. The kj value is dependent on the total number of objects belonging to the dataset, and the minimum and the maximum numbers of objects belonging to a class, which are different for various datasets. The equation shows that the value of k is dependent on the number of objects belonging to each class in the dataset. If the dataset is imbalanced, then the method yields the k value for each imbalanced class accordingly and assigns the label of a test object based on the Class relevance Score. Hence, KNN-DK can handle class imbalanced data.
3.5 Proposed Framework The framework of the proposed KNN-DK method is shown in Fig. 2. In this framework, we show the computation steps of our method. First the method takes the labeled training data and splits the datasets into subgroups. Each subgroup contains data objects of similar type. The method computes a score called relevant numbers of neighbors, kj of an unknown object Oi for each class. This kj is used to find that many nearest neighbor objects for each class, and then based on this value, the method computes class relevance score kj of an unknown object for each class using Eq. 3. For the unknown object, the method has a kj for each class and from these j ri values, it assigns the class label of the unknown object as the class that has the maximum relevance score.
28 Fig. 2 Proposed framework of KNN-DK method
N. Hoque et al. Training Dataset, Di
Test Object Oi
Split the dataset into n Number of classes
Compute relevant number of neighbors for jth class
Decide ‘K’ for Oi for Di
Compute Class Relvevent Score
Predict class label for Oi
3.6 The Proposed KNN-DK Method The steps of the proposed KNN-DK method are shown in Algorithm 1. 1. Find the total number of classes (say C1 , C2 , . . . , Cn ) in the dataset D and the corresponding set of objects belonging to each class Cj . 2. For each test object say Ot , compute relevant numbers of neighbors, kj , for each class Cj . The score of Ot for a class Cj is considered the value of k for that class. j 3. For each class Cj , compute ri of Ot for class Cj . j 4. Find the class Cj for which ri (Ot ) is the maximum and assign that class as the label of the test object Ot .
3.7 Complexity Analysis The complexity of the proposed KNN-DK method depends on the number of the training data objects in Dtrain . If the number of training instance in Dtrain is N , then the method takes O(N ) times to split the whole dataset into different groups. For m number of test objects, the method takes O(m × l) + O(m × l) times to compute the class relevance scores and class representative scores for l number of classes. So, total complexity of the method to classify all the m number of test objects is O(N )+O(m × l)+O(m × l). Since it is quite likely that N m and N l, the time complexity is O(N ).
KNN-DK: A Modified K-NN Classifier …
29
Data: Training Dataset Dtrain , Test Dataset Dtest Result: Predicted class labels of objects in Test Dataset Steps: Find the number of class labels C1 , C2 , · · · Cn in Dtrain for each training object Oi ∈ Dtrain do if class(Oi ) = Cj then put Oi into the group of class Cj end end for each test object Oi ∈ Dtest do for each class Cj do kj =Compute the relevant number of neighbors, kj , of Oi for Cj using Eq. 1 end j
Find k, i.e., ki (Oi ) of Oi from Dtrain for each class Cj do j ri =Compute class relevance score of Oi for Cj using Eq. 3 end j
Find the class Cj for which ri is maximum and assign the class as the label of Oi end
Algorithm 1: KNN-DK: A modified k-nn classifier
4 Experimental Results Experiments were carried out on a workstation with 12 GB main memory, 2.26 Intel (R) Xeon processor and 64-bit Windows 7 operating system. We implement our method using MATLAB R2015a software.
4.1 Datasets Used We choose thirty benchmark datasets with varying numbers of instances and dimensionalities. We categorize them into three major categories such as, UCIgeneral , UCIdisease and network (Table 1).
4.2 Result Analysis The proposed method is validated using four different parameters, viz., accuracy, precision, recall and f-measure. We use tenfold cross validation for an unbiased performance analysis of our KNN-ND.
30
N. Hoque et al.
Table 1 Details of datasets used in analysis Dataset #Objects #Attributes Zoo Accute1 Accute2 Hayes Roath House voter Iris Wine TicTacToe StateHeartLog Liver Glass Ecoli CMC Sonar Pima Automobile Leukemia SRBCT Colon cancer Breast cancer Diabetes Lung Lymphoma Cleveland NIC Seed Wpbc German KDD99 TUIDS
4.2.1
101 120 120 160 232 150 178 958 270 345 214 336 1473 208 768 159 27 83 62 699 768 73 45 297 60 210 198 1000 900 561
17 8 8 4 16 4 13 9 13 6 9 7 9 60 8 25 7130 2308 2000 30 8 325 4026 13 9712 7 34 20 42 19
#Classes
Type
7 2 2 3 2 3 3 2 2 2 7 8 3 2 2 6 2 4 2 2 2 7 2 5 9 3 2 2 6 3
UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIgeneral UCIdisease UCIdisease UCIdisease UCIdisease UCIdisease UCIdisease UCIdisease UCIgeneral UCIgeneral UCIgeneral UCIdisease UCIgeneral Network Network
Analysis on UCI Datasets
The proposed KNN-DK gives very high classification accuracy on most UCI datasets. From the experimental analysis, we observe that our method can perform significantly better than the traditional k-nn method on all the compared UCI datasets except TicTacToe. On Zoo, Accute1, Accute2, Hayes Roath and Iris datasets, KNN-DK shows more than 90% classification accuracy. In comparison to the traditional k-nn method and other counterparts such as Naive Bayes, Decision Tree and SVM, the performance of our method is significantly better as shown in Table 2.
KNN-DK: A Modified K-NN Classifier …
31
Table 2 Comparison of KNN-DK with other classification methods Dataset
NB ± SD
Zoo
0.8527 ± 0.01649 0.7950 ± 0.0350
DT ± SD
SVM ± SD
0.08200 ± 0.1300 0.9700 ± 0.0500
KNN ± SD
KNN-DK ± SD 0.9885 ± 0.0110
Accute1
0.9967 ± 0.0060
1±0
1±0
0.9475 ± 0.0507
1±0
Accute2
0.9783 ± 0.0335
1±0
0.9967 ± 0.0057
0.9892 ± 0.0182
0.9961 ± 0.0087
Hayes Roath
0.5446 ± 0.0965
0.7854 ± 0.0764
0.3323 ± 0.1391
0.7104 ± 0.0992
0.7151 ± 0.0413
House Voter
0.9417 ± 0.0367
0.9583 ± 0.0307
0.9474 ± 0.0354
0.9196 ± 0.0478
0.9436 ± 0.0117
Iris
0.9600 ± 0.0416
0.9447 ± 0.0508
0.7213 ± 0.0973
0.9587 ± 0.0455
0.9747 ± 0.0108
TicTacToe
0.5854 ± 0.0392
0.8536 ± 0.0291
0.5998 ± 0.0428
0.7532 ± 0.0365
0.6834 ± 0.0803
StateheartLog
0.6943 ± 0.0488
0.6170 ± 0.0573
0.6898 ± 0.0492
0.5820 ± 0.0502
0.7129 ± 0.0309
Liver
0.5785 ± 0.0579
0.6515 ± 0.0579
0.6529 ± 0.0596
0.6179 ± 0.0644
0.7636 ± 0.0158
Glass
0.4881 ± 0.0794
0.6852 ± 0.0734
0.5386 ± 0.0728
0.7262 ± 0.0691
0.9015 ± 0.0327
Ecoli
0.9494 ± 0.0225
0.8848 ± 0.0140
0.9682 ± 0.0329
0.9247 ± 0.0351
0.9358 ± 0.1400
CMC
0.4687 ± 0.0360
0.5137 ± 0.0312
0.3768 ± 0.0558
0.4610 ± 0.0373
0.6843 ± 0.0131
Colon cancer
0.6350 ± 0.1773
0.7217 ± 0.1447
0.8250 ± 0.1073
0.7983 ± 0.1323
0.9426 ± 0.0406
Breast cancer
0.9405 ± 0.0218
0.9273 ± 0.0239
0.9727 ± 0.0175
0.9173 ± 0.0319
0.9446 ± 0.0114
Esophageal cancer
0.997 ± 0.006
0.950 ± 0.082
0.960 ± 0.068
0.927 ± 0.236
0.953 ± 0.0206
Automobile
0.7620 ± 0.0092
0.7520 ± 0.0792
0.7183 ± 0.0426
0.5373 ± 0.0960
0.7497 ± 0.0111
Leukemia
0.6386 ± 0.1329
0.5714 ± 0.1300
0.6371 ± 0.1637
0.5514 ± 0.1289
0.7192 ± 0.0541
SRBCT
0.9625 ± 0.0570
0.8412 ± 0.1143
0.9963 ± 0.0067
0.9113 ± 0.0790
0.9927 ± 0.0100
Sonar
0.7000 ± 0.07210 0.7175 ± 0.0764
0.7445 ± 0.0756
0.8195 ± 0.0604
0.9030 ± 0.0372
Pima
0.7341 ± 0.0398
0.7037 ± 0.0425
0.7570 ± 0.0334
0.6833 ± 0.440
0.7957 ± 0.0084
Diabetes
0.7329 ± 0.0458
0.7120 ± 0.0350
0.7553 ± 0.0347
0.6804 ± 0.0364
0.7998 ± 0.0045
Lung
0.8629 ± 0.0900
0.5800 ± 0.1589
0.8171 ± 0.1311
0.8357 ± 0.1037
0.9902 ± 0.0140
Lymphoma
0.9325 ± 0.995
0.7400 ± 0.1685
0.9175 ± 0.1155
0.7775 ± 0.1400
0.9752 ± 0.0146
Wine
0.9547 ± 0.0414
0.9059 ± 0.0505
0.985 ± 0.021
0.7612 ± 0.0846
0.9411 ± 0.0041
Seed
0.9033 ± 0.0479
0.9186 ± 0.0484
0.9148 ± 0.0499
0.9071 ± 0.0465
0.9483 ± 0.0030
Wpbc
0.6295 ± 0.0802
0.6905 ± 0.0731
0.7221 ± 0.0678
0.6426 ± 0.0853
0.8460 ± 0.0171
German
0.7237 ± 0.0352
0.6991 ± 0.0328
0.7192 ± 0.0402
0.6156 ± 0.0372
0.7948 ± 0.0551
KDD99
0.8753 ± 0.4000
0.9941 ± 0.0410
0.9403 ± 0.0428
0.9305 ± 0.0823
0.9885 ± 0.0026
CorrectedKDD99
0.4013 ± 0.0941
0.9941 ± 0.0410
0.9403 ± 0.0428
0.9305 ± 0.0823
0.9128 ± 0.093
NSL-KDD99
0.7829 ± 0.0848
0.9710 ± 0.0126
0.9130 ± 0.0817
0.9138 ± 0.0172
0.8597 ± 0.0103
TUIDS
0.9868 ± 0.0143
0.9892 ± 0.0066
0.9870 ± 0.0212
0.9793 ± 0.0082
0.9937 ± 0.0036
4.2.2
Analysis of Network Datasets
Similarly, in case of network intrusion datasets, our method gives very high classification accuracy. In the experimental results as shown in Table 2, we observe that KNN-DK gives better accuracy than k-nn, decision tree and SVM. Similarly, on the TUIDS dataset, the method gives highest (99%) accuracy which is much better than the traditional k-nn, SVM, decision tree and Naive Bayes.
32
4.2.3
N. Hoque et al.
Analysis of Gene Expression Datasets
In our experiments, we consider seven different gene expression and disease datasets, viz., colon cancer, breast cancers, leukemia, SRBCT, diabetes, lung and lymphoma. On colon cancer, leukemia, SRBCT, diabetes, lung and lymphoma datasets, the classification accuracy of KNN-DK is better than those of k-nn, SVM, decision tree and Naive Bayes classifier. On the breast cancer dataset, although KNN-DK yields higher classification accuracy than traditional k-nn, decision tree and naive Bayes classifiers it gives a bit lower classification accuracy compared to SVM. In addition to the individual domain specific performance analysis of our method, we carried out an overall performance analysis between KNN and KNN-DK. Figure 3 shows the comparison between KNN and KNN-DK over 30 different datasets. We observe that the proposed KNN-DK algorithm gives better results than the traditional KNN classifier on 28 datasets. As shown in Table 3, KNN-DK yields better precision, recall and f-measure for most of the datasets compared to the traditional k-nn method. However, KNN-DK’s performance is a bit lower than traditional k-nn on Zoo, Hayes roath and TicTacToe datasets.
4.3 Discussion From the experimental results, we observe that the proposed KNN-DK method yields better classification accuracy compared to the traditional k-nn method that considers a fixed value of k. Without using a user input value for k, the proposed method gives
Fig. 3 Comparison of KNN and KNN-DK
KNN-DK: A Modified K-NN Classifier …
33
Table 3 Performance comparison in terms of precision, recall and f-measure between KNN and KNN-DK Dataset KNN KNN-DK Precision Recall f-measure Precision Recall f-measure Zoo HayesRoath Iris TicTacToe StateheartLog Liver CMC Colon cancer Breast cancer Leukemia SRBCT Sonar Pima Diabetes Lung Lymphoma Wine Seed Wpbc German
0.9980 0.7042 0.9621 0.7111 0.6106 0.5450 0.5619 0.7750 0.9194 0.6648 0.9378 0.8147 0.7493 0.7535 0.8175 0.8333 0.8659 0.8819 0.7389 0.7244
0.8962 0.6772 0.9316 0.4828 0.6278 0.5598 0.4026 0.6712 0.9515 0.6955 0.7709 0.8685 0.7500 0.7551 0.4618 0.8017 0.5865 0.8317 0.7938 0.7243
0.9296 0.6591 0.9466 0.5682 0.6107 0.5424 0.4670 0.9191 0.9344 0.6152 0.8168 0.8341 0.7476 0.7522 NaN NaN 0.6833 0.8418 0.7584 0.7232
0.9700 0.6798 0.9911 0.5412 0.7162 0.8047 0.7422 0.9062 0.9490 0.7060 0.9850 0.9027 0.7793 0.7817 0.9959 0.9808 0.9682 0.9315 0.9087 0.7917
0.9435 0.5102 0.9325 0.5449 0.7107 0.7748 0.5358 0.9126 0.9361 0.6885 0.9752 0.8955 0.7702 0.7720 0.9464 0.9844 0.9036 0.8974 0.7934 0.7037
0.9556 0.5798 0.9603 0.5430 0.7135 0.7892 0.6217 0.9005 0.9425 0.6971 0.9801 0.8991 0.7747 0.7768 0.9695 0.9826 0.9348 0.9141 0.8456 0.7450
significantly better classification accuracy compared to the traditional k-nn classifier. A major issue related to the class imbalance problem in datasets, often faced by the traditional k-nn with fixed k value (received as input), has been successfully addressed here.
5 Conclusion and Future Work In this chapter, we have presented a classifier called KNN-DK, which is a modified version of the traditional k-nn classifier. The main advantage of the KNN-DK over traditional k-nn is that the KNN-DK does not require k as user input. For the traditional k-nn algorithm, prediction of the k value is challenging and the accuracy of the classifier highly depends on the value of k as well as the distribution of data in the datasets. Moreover, the traditional k-nn classifier could not handle class imbalance problem of a dataset. The proposed KNN-DK has been found to predict the class
34
N. Hoque et al.
label of a test object based on a dynamic computed k value very accurately for three distinct appropriate domains. Based on our experimental analysis, we conclude that the proposed KNN-DK algorithm outperforms the traditional k-nn and other competing methods by significant margins. As future work, we are implementing a parallel version of the KNN-DK method, this parallel version of the method will be evaluated using various big datasets.
References 1. X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, S.Y. Philip et al., Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008) 2. S. Zhang, KNN-CF approach: incorporating certainty factor to KNN classification. IEEE Intell. Inform. Bull. 11(1), 24–33 (2010) 3. Y. Qin, S. Zhang, X. Zhu, J. Zhang, C. Zhang, Semi-parametric optimization for missing data imputation. Appl. Intell. 27(1), 79–88 (2007) 4. X. Zhu, S. Zhang, Z. Jin, Z. Zhang, Z. Xu, Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 23(1), 110–121 (2011) 5. S. Zhang, X. Li, M. Zong, X. Zhu, D. Cheng, Learning k for kNN classification. ACM Trans. Intell. Syst. Technol. (TIST) 8(3), 43 (2017) 6. J. Maillo, S. Ramrez, I. Triguero, F. Herrera, kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl.-Based Syst. 117, 3–15 (2017) 7. N. Paperno, P. McDaniel, Deep k-nearest neighbors: towards confident, interpretable and robust deep learning (2018) 8. A.S. Arefin, C. Riveros, R. Berretta, P. Moscato, GPU-FS-kNN: a software tool for fast and scalable KNN computation using GPUs. PLoS ONE 7(8), e44000 (2012)
Identification of Emotions from Sentences Using Natural Language Processing for Small Dataset Saurabh Sharma, Alok Kumar Tiwari, and Dinesh Kumar
Abstract Emotions are a significant part of human nature. The study of emotion can lead research automatically to analyze sentiment either of a sentence or a human being. Multiple researchers did their researches in the area of Natural Language Processing (NLP) and specially sentiment analysis. But there is a way to study it more efficiently, if the emotion drafted in the sentence can be identified, and also, the new symbols like emojis are taken into consideration. Most of the studies are done with a feasible amount of data. But, if a problem arises, then there is less dataset available initially. So, achieving good accuracy with less data is a herculean task. So, in this chapter, the emotion drafted in multiple sentences is figured out with less amount of data, which also contains some symbols used in the modern era for the communication of information. Keywords Natural language processing · Emotion detection · Machine learning
1 Introduction Emotions play a significant role in our day-to-day life. When a newspaper or story is read, it hits our brain’s neural system and another emotion is generated. So, in this way, the sentences or a set of text is directly or indirectly responsible for the creation of emotion. A data scientist needs to discover a way to detect or identify what a sentence wants to tell. In other words, the identification of emotion of sentence is an essential thing to be studied. Also, new trends in the area of data science, i.e., machine learning, especially deep learning can be used to detect the pattern through which the S. Sharma (B) · A. K. Tiwari · D. Kumar KIET Group of Institutions, Delhi-NCR, Ghaziabad, India e-mail: [email protected] A. K. Tiwari e-mail: [email protected] D. Kumar e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_3
35
36
S. Sharma et al.
nature of sentence can be identified. Machine learning is a way to make machines learn through its experience with datasets given as input, whereas deep learning deals with the learning process through neural networks. This chapter represents the study of the number of emotions an English sentence can express. There are five class of emotions identified for this study, namely neutral, worry, sadness, happiness, love, and anger. In other words, it is a modified sentiment analysis problem of multi-class classification.
2 Background The section consists of multiple previous pieces of research, and the research gaps found in those works.
2.1 Previous Related Works Munezero et al., in their work, described the differences amongst emotion, affect, sentiment, opinion, and feeling. They gave a schematic structure of all of the five linguistic terms, and some suggestion was given for their efficient detection [1]. Xu et al. proposed a deep learning approach for emotion detection in medical data. In this study, a multi-modal emotional fatigues system was designed. Convolutional neural network-based autoencoder was used to extract multiple features from ECG. The model got more than 85% accuracy for the same [2]. Batbaatar et al. proposed semantic-emotion neural network, which was able to use both semantic and emotional information from various word representations. In this, a hybrid type of network was used in which a type of RNN was used to store context, and CNN was used to extract features. They did it for six emotions and the model provided better results [3]. Cui et al. did a study on medical text classification, in which the main objective was to fine-tune the pre-existing machine learning model to get better results. In their research, they used regular expression-based text classifiers, which utilized machinegenerated regular expression and machine learning models for text classification [4]. Lima et al. applied text mining to detect the phase or stage of the technological development of multiple photovoltaic panels [5]. Ali and his team worked on the classification of Urdu hhort texts and developed infix stripping and stemming rules for the Urdu language [6]. Wang et al. used multiple variants of scope-based CNN to classify text in a group of multiple datasets. The study was all about scope-based information retrieval parallel training of various datasets [7]. Some researchers also worked on bi-level feature extraction mechanisms for text mining purposes for railways [8]. Liu et al. worked on sentiment analysis tasks for consumer health- and medical-related social media articles [9].
Identification of Emotions from Sentences Using Natural …
37
Rashid et al. proposed a hybrid variant of inverse document frequency and fuzzy Cmeans clustering for the classification of biomedical documents [8], whereas Samant et al. proposed improvement for term weighting schemes in order to classify text in vector space model [10].
2.2 Gaps Identified in Previous Studies The previous studies were done either for sentiment analysis for a proper area like medical or railways, or these were done for a proper language. So, the studies can be more efficient if new communication methods like emojis are taken care of. Also, the emotion of a sentence can be detected in general to classify any of the sentences into a class. All the studies have sufficient datasets available. But what if a study is to be done in a very initial stage of problem and less amount of data is available for the same. So, the research should be done to figure out the classifiers which can perform better in less amount of data too [11–13].
3 Proposed Methods 3.1 Traditional Machine Learning Models Some of the well-known machine learning algorithms are applied as state of the art in order to get more accurate results. The description of those is as follows: Naive Bayes Classifier (state of the art). It is the well-known probabilistic classifier, which is based on Bayes’ theorem. Multiple studies are done with this model in the area of traditional sentiment analysis. So, it is used for analysis purpose in this study. It is based on conditional probability; the formula to calculate it is as follows: P (M | N) = P (N | M) P(M)/P(N). where N is an independent variable or event. M represents class. P (M | N) = posterior probability of N belonging to class M. P (N | M) = likelihood of N when class is N. P(M) = prior information of class M. P(N) = evidence of independent variable N. Support Vector Classifier. In this method, there are set of hyperplanes predicted in order to differentiate the population into classes. To classify the population for the multidimensional space of text data, the first task should be to transform chunks of
38
S. Sharma et al.
data into some vectors and then use word frequency as a feature for study. It is also used as the state-of-the-art method. Decision Tree Classifier. This classifier uses three types of nodes, namely root, hidden, and terminal nodes. The input taken from the data source passed through multiple test cases or conditions and fed into hidden nodes until it gets classified into multiple determined classes. Random Forest Classifier. In this classifier, multiple decision trees operate as an ensemble. The class prediction from every tree is treated as a vote, and the class with the maximum number of votes is considered as the final prediction.
3.2 Deep Learning Models Deep learning is an extended part of machine learning. It consists of multiple sequential models, which can be used to predict multiple performance metrics. Initially, normal and hybrid convolutional neural network was tested, but it got very bad accuracy due to less amount of data. Some of the models used in the study are as follows: Long Short-Term Memory (LSTM). It is a type of RNN that uses the experience got from previous data. It has three kinds of gates: input gate, forget gate, and output gate. The new data entered from the input gate is filtered by forget gate, where the redundant data is filtered, and using a nonlinear function, the final output is generated through the output gate. Gated Recurrent Unit (GRU). It uses only two gates, update gate and reset gate. The first of which is used to update the information, and the other is to reset the network.
4 Methodology 4.1 Workflow Being an NLP problem, it needs multiple steps to analyze the sentences. The first step is undoubtedly the data collection or creation step. Then, data is preprocessed and cleaned. After that, various classifiers based upon machine learning and deep learning are applied, and performance metrics are calculated. The schematic diagram is given by Fig. 1.
Identification of Emotions from Sentences Using Natural …
39
Fig. 1 Workflow of analysis
4.2 Description Data Collection. The data from target HR AI hackathon-2018 was taken, which was provided by bi-nary fountain organization. It is a comparatively small dataset for the study but was enough to do multiple analyses. The shape of data is (2200.3), in which one column was for indexing purposes. The next two columns are review, which is a sentence and corresponding emotion derived from the review sentence. The composition of the dataset is given in Table 1, and Fig. 2 shows the composition of the dataset. Data Preprocessing and Cleaning. To make the analysis less complex and more efficient, the preprocessing steps are taken, which include: Target mapping. The labels are initially in the text form, but for making it easier for analysis, the labels are mapped into some constant numbers. Table 1 Reviews and their count
Type of reviews
Value count
Neutral
622
Worry
586
Sadness
376
Happiness
338
Love
270
Anger
8
40
S. Sharma et al.
Stop words filtering. Stops words or the most general words in English, especially non-negative stop words, are filtered out, taking into consideration that the composition of negative data is very less in the dataset. Cleaning. In this step, some symbols like apostrophe, special characters, etc., spacing, emails, URLs, some misspelled characters, e.g., “soooo” is converted into “so,” and some other punctuation words are either transformed or filtered out from the dataset. Also, the emojis, unique words, titles, etc. are classified into the positive and negative sense. Stemming. To get the root word from a given word, the porter stemming concept is used. Lemmatization. To bring context to the word, WordNet Lemmatizer is used. Term Frequency–Inverse Document Frequency. A term frequency tells about the number of times a word occurred in a document, whereas the inverse document frequency shows the number of times a word occurred in a corpus of documents. tf-idf is used for weighting words according to their importance. Here, the same is used for reviews or sentences in the dataset, and then, cleaned data is merged into a data frame with the tf-idf features. Synthetic Minority Oversampling Technique. It is abbreviated as SMOTE. In this technique, the minority class is unsampled using an algorithm, in which the up sampling is done by the selection of similar records and then alteration of it with the help of its difference to the neighboring records. In the study, some classes have very little data for training. So, it is used there. Feature Extraction and Classification. In this phase, the features, which the model will use for the analyses, are extracted from data and then fed into the classifier algorithm for training and testing purposes. The classifier classifies the data into multiple classes given by users according to its algorithm. In deep learning models, these two phases are considered under classification only. Training–Testing. The phase consists of training of algorithms with training dataset and then testing of them on testing dataset. Performance Metrics. In this phase, the performance metrics like accuracy, recall, etc. are calculated to check whether the algorithm will work properly against given data or not.
5 Experiments 5.1 Naive Bayes Classifier Three variants of naive Bayes classifiers are used. The normal naive Bayes provided 72% accurate results with highest cross-validation score of 67%. Then, considering the features as binary features, Bernoulli NB is used that provided 66% accurate results with 72% cross-validation score. At last, considering the features as event
Identification of Emotions from Sentences Using Natural …
41
Fig. 2 Composition of the data
Fig. 3 Results from naive Bayes classifier
probabilities, multinomial NB is used, which provided 60% accuracy with 58% cross-validation. The results are shown in Figs. 3, 4, and 5.
5.2 Support Vector Classifier Support vector classifier was not able to predict efficiently and provided classification results in only two classes with 50% accuracy and 48% validation. The result is given in Fig. 5.
42
S. Sharma et al.
5.3 Decision Tree Classifier This classifier provided few satisfactory results and predicted all the classes as a result. But the highest accuracy achieved was 90% with validation of 70%, which is a sign of overfitting. Figure 6 shows the predicted classes.
5.4 Random Forest Classifier Finally, random forest classifier is used. It provided the results in 5 or 6 resulting classes, but the overall accuracy got was 91% with the highest validation achieved was 81% (Fig. 7). Fig. 4 Results from Bernoulli NB classifier
Fig. 5 Results from Support Vector Classifier
Identification of Emotions from Sentences Using Natural … Fig. 6 Results from decision tree classifier
Fig. 7 Results from Random Forest Classifier
Fig. 8 Results from CNN Classifier
43
44
S. Sharma et al.
Fig. 9 Results from LSTM Classifier
Fig. 10 Results from GRU classifier
5.5 Convolutional Neural Network (Normal and Hybrid) Due to less amount of training data, CNN was not able to perform properly and provided an overfitted result of 30% accuracy with 23% validation accuracy. A hybrid version of it with an RNN layer also provided 29% accuracy with 28% validation. The result is there in Fig. 8.
5.6 Long Short-Term Memory LSTM provided 63% accuracy with 35% validation score. The LSTM network was composed of five layers consisting of one LSTM layer, two dropouts, one input and embedding layer, and one dense output layer and finely tuned for the same problem. The results are shown in Fig. 9.
Identification of Emotions from Sentences Using Natural … Table 2 Performance scores and classes predicted of various models
45
Model name
Highest accuracy(%)
Validation accuracy(%)
Classes
Naive Bayes (All Variants)
60–72
58–72
05
Support vector machine
50
48
02
Decision tree
90
70
06
Random forest
91
81
05–06
CNN (normal and hybrid)
29–30
23–28
01–02
LSTM
63
35
05
GRU
52
32
05
5.7 Gated Recurrent Unit GRU provided 52% accuracy with 32% validation score. The GRU network was composed of five layers consisting of one GRU layer, two dropouts, one input and embedding layer, and one dense output layer and then finely tuned for the same. The results are there in Fig. 10.
6 Results In the end, it is seen that the random forest gets better accuracy score due to its voting mechanism. The next better performers are RNNs due to their sequential mechanisms. But other mechanisms failed due to the small size of a population in the dataset. Also, anger class was not predicted by multiple classifiers due to a very less amount of data for its training. The results and classes predicted are summarized in Table 2.
7 Conclusion It can be concluded from this study that the random forest classifier predicted the class better than any other classifier used in the process. The other classifiers that gave relatively good results can also give an efficient performance, but for that, more data is required, and the study is done using a relatively small dataset. So, other classifiers are either resulting in an overfit or less accurate result. The previous studies are done using a relatively large dataset, and the result got in this study is lesser in terms of numbers due to the use of the relatively small dataset.
46
S. Sharma et al.
It also concludes that for small datasets, random forest can perform better in this type of analysis.
8 Limitations and Future Scope The study is done using a dataset that is very non-homogeneous in nature. The study can be extended with the use of a comparatively large dataset, which may contain uniformly distributed data in terms of emotions. Also, the use of multiple other classifiers may give better results. Acknowledgements The research is done under the affiliation of the KIET Group of Institutions, Ghaziabad, India. The authors acknowledge and convey their gratitude toward Target HR AI Hackathon community and Binary Fountain organization, whose dataset of 2018 challenge is partially used for this study.
Glossary CNN Convolutional Neural Network LSTM Long Short Term Memory GRU Gated Recurrent Unit SVM Support Vector Machine NB Naïve Bayes SMOTE Synthetic Minority Oversampling Technique TF-IDF Term Frequency-Inverse Domain Frequency
References 1. M. Munezero, C.S. Montero, E. Sutinen, J. Pajunen, Are they different? affect, feeling, emotion, sentiment, and opinion detection in text. IEEE Trans. Affect. Comput. 5(2), 101–111 (2014) 2. J. Xu, Z. Hu, J. Zou, A. Bi, Intelligent emotion detection method based on deep learning in medical and health data. IEEE Access 8, 3802–3811 (2019) 3. E. Batbaatar, M. Li, K.H. Ryu, Semantic-emotion neural network for emotion recognition from text. IEEE Access 7, 111866–111878 (2019) 4. M. Cui, R. Bai, Z. Lu, X. Li, U. Aickelin, P. Ge, Regular expression based medical text classification using constructive heuristic approach. IEEE Access 7, 147892–147904 (2019) 5. A. de Lima, A. Argenta, I. Zattar, M. Kleina, Applying text mining to identify photovoltaic technologies. IEEE Latin Am. Trans. 17(05), 727–733 (2019) 6. M. Ali, S. Khalid, M.H. Aslam, Pattern based comprehensive urdu stemmer and short text classification. IEEE Access 6, 7374–7389 (2018) 7. F. Wang, T. Xu, T. Tang, M. Zhou, H. Wang, Bilevel feature extraction-based text mining for fault diagnosis of railway systems. IEEE Trans. Intell. Transp. Syst. 18(1), 49–58 (2017)
Identification of Emotions from Sentences Using Natural …
47
8. J. Rashid et al., Topic modeling technique for text mining over biomedical text corpora through hybrid inverse documents frequency and fuzzy K-means clustering. IEEE Access 7, 146070– 146080 (2019) 9. K. Liu, L. Chen, Medical social media text classification integrating consumer health terminology. IEEE Access 7, 78185–78193 (2019) 10. S.S. Samant, N.B. Murthy, A. Malapati, Improving term weighting schemes for short text classification in vector space model. IEEE Access 7, 166578–166592 (2019) 11. M. Kumar, V.M. Shenbagaraman, R.N. Shaw, A. Ghosh, Predictive data analysis for energy management of a smart factory leading to sustainability, in: Innovations in Electrical and Electronic Engineering. Lecture Notes in Electrical Engineering, vol. 661, ed. by M. Favorskaya, S. Mekhilef, R. Pandey, N. Singh (Springer, Singapore, 2021). https://doi.org/10.1007/978-98115-4692-1_58 12. S. Mandal, S. Biswas, V.E., Balas, R.N. Shaw, A. Ghosh, Motion prediction for autonomous vehicles from lyft dataset using deep learning, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, pp. 768–773 (2020). https://doi.org/10.1109/ICCCA49541.2020.9250790 13. J. Wang, Y. Li, J. Shan, J. Bao, C. Zong, L. Zhao, Large-scale text classification using scopebased convolutional neural network: a deep learning approach. IEEE Access 7, 171548–171558 (2019)
Comparison and Analysis of RNN-LSTMs and CNNs for Social Reviews Classification Suraj Bodapati, Harika Bandarupally, Rabindra Nath Shaw, and Ankush Ghosh
Abstract This chapter presents and compares results of simple and efficient deep learning models to perform sentiment analysis and text classification. Natural language processing is a massive domain that enables in finding solutions for many day-to-day tasks, and sentiment analysis falls under this domain. A typical sentiment analysis task can be described as a process of classifying opinions expressed in a text as positive, negative, or neutral. This chapter employs two models; one model is built using recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) and the other with convolutional neural networks. In our first model, RNN-LSTMs model has been used to capture the semantic and syntactic relationships between words of a sentence with help of word2vec. In the second model, one-dimensional convolutional neural networks were used to learn structure in paragraphs of words and the techniques invariance to the specific position of features. The IMDB movie reviews dataset set is being used for sentiment analysis in both the models and the results are compared. Both the models yielded excellent results. Keywords Deep learning · Artificial neural networks · Convolutional neural networks · Recurrent neural networks · Long short-term memory (LSTMs) · Word2vec S. Bodapati JPMorgan Chase and Co., New York, USA e-mail: [email protected] H. Bandarupally Computer Science Engineering, Chaitanya Bharathi Institute of Technology, Hyderabad, India e-mail: [email protected] R. N. Shaw Department of Electronics and Communication Engineering, Galgotias University, Greater Noida, India e-mail: [email protected] A. Ghosh (B) The Neotia University, Sarisha, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_4
49
50
S. Bodapati et al.
1 Introduction Two, the eminent deep learning models are recurrent neural networks (RNNs) and convolution neural networks (CNNs). These models are leveraged in multiple domains like speech recognition, machine translations, identifying the emotional tone, or subjective information in the given sentence (sentiment classification of text). There are special RNN models, namely long short-term memory networks (LSTM) which have a basic RNN architecture, but, can store and access information more efficiently than regular RNNs. Sentiment classification of texts is a classic topic of deep learning. Text classification plays a major role in multiple real-time applications. For instance, spam filtering, document search, Web search, predicting the polarity of a sentence, etc., can be performed by employing it. Meanwhile, major advancements and breakthroughs in image classification were made possible by the CNN model. Moreover, this model forms the pivot of the current computer vision systems [1–5]. CNNs can also be applied to natural language processing (NLP) tasks to obtain interesting results. For sentence classification tasks, instead of image pixels, the input sentences are represented as a matrix. These vectors are word embedding like word2vec. There are different methods to achieve the task for text classification, machine learning algorithms like logistical regression, SVM, etc., are at the center of these applications. These algorithms need the text input to be represented as vector [6, 7]. Another traditional method for fixed-length vector representations is bag-ofwords; a text in from of a sentence is represented as multiset of its words. The frequency of each word can be used as a feature for training of the classifier. The multiple downsides for this method are ignoring the grammar and order of words that results in different sentences having the same vector representations. Another widespread method is n-grams. The limitation of the n-grams model is that it takes into account the word order in short sentence, but it suffers from data sparsity and high dimensionality. These traditional and simple methods have limitations for many NLP tasks. In this chapter, new methodologies such as CNNs and RNN-LSTMs have been proposed to overcome the drawbacks of traditional methods and algorithms, improve the accuracy of sentiment analysis and finally compare the results of the two best deep learning methods.
2 Concepts 2.1 Artificial Neural Networks Artificial neural networks are an essential domain of the deep learning field. ANNs are used in a multitude of applications. All that an ANN does is that it tries to emulate the way in which the human brain functions. ANN consists of interconnected Web of nodal points namely neurons and edges that entail the role of connecting the neurons
Comparison and Analysis of RNN-LSTMs and CNNs for Social …
51
[8, 9]. Typically, an ANN receives an input signal on which it performs several intense calculations and finally generates the required output which can then leveraged in many problem-solving techniques. An ANN generally comprises of threelayers that is input layer followed by the hidden and output layers. Because of the ability of hidden and output layer neurons to classify suitably, an ANN can be perceived as a collection of classifiers.
2.2 Recurrent Neural Networks Recurrent neural network is a very powerful deep learning technique. It is used in myriad solutions like speech recognition, language translation, stock prediction, image recognition, image captioning, etc. The results that they produce in different applications are noteworthy. RNNs are neural networks that are good at modeling and processing sequential data for predictions. A concept called ‘sequential memory’ is inherently used by RNNs. This mechanism is generally employed by human brain to recognize sequential patterns. RNNs tend to replicate this phenomenon. RNN comprises a looping mechanism that enables data flow from one step to the following. This data is the hidden state which represents all the previous inputs. However, RNNs face an issue called ‘short-term memory’ caused by the vanishing gradient problem. This causes a trouble in retaining information from previous steps and cannot learn long-term dependencies.
2.3 Long Short-Term Memory Networks (LSTMs) Essentially, LSTMs function typically like RNNs but they are capable of learning long-term dependencies. They alleviate the problem of vanishing or exploding gradient and give us much better accuracy in comparison to RNNs. Just by including a few additional interactions to the RNN system, a better version of it, i.e., LSTM can be generated. Typically, an LSTM is made of three gates and one cell state. These are input, output, and forget gates and each of these gates have different set of weights. The gates are different tensor operations that can learn what information needs to be added to or removed from the hidden state thus enabling the mitigation of the issues with the extremely high or low values of the gradient.
2.4 Convolutional Neural Networks (CNNs) CNN behaves like an essential building block for image recognition. However, they can also be used in natural language processing problems. CNNs are quite different from the traditional feed forward neural networks as they treat data as spatial data The
52
S. Bodapati et al.
input layer of CNN is fed with pixels of an image in the form of an array. Just like an ANN, CNNs also have hidden layers in which complex calculations are performed and feature extraction is carried out [10, 11]. This layer incorporates matrix filters and carries out the convolution and pooling operations to identify specific patterns that could be hidden within the image. The final fully connected layer ultimately recognizes the target in the image.
2.5 Word2Vec There are different forms of data which can be taken as input; for example, logistic regression model takes in quantifiable features, reinforced learning models take reward signals as input [12]. Similarly, for the models, Word vectors will be used instead of strings which enable us to perform common operations such as backpropagation and dot products. These word vectors are fashioned to represent the word, the context, and meaning. Word embedding is the term used for vector representation of a word. In order to perform the task of creating word embedding, Word2vec model is used. This model picks the words in the sentence with same context semantics and same connotations and places then in the same vector space [13]. This model employs a huge dataset of sentences and the vectors for individual word are given as output exclusively in the corpus. This result of the Word2vec model is known as embedded matrix. The entire Wikipedia Corpus, and applying Word2Vec, it is converted into an embedding matrix. The Word2vec model is trained by obtaining each sentence in the dataset and slide a fixed size window to calculate window’s center word. This model uses loss function and optimization procedures to generate vectors for each word.
3 Model Implementation 3.1 Methodology Using RNN with LSTM In the first step, a pre-trained model Word2vec which was trained by Google over Google-news dataset is loaded. Word2vec contains three million word vectors containing a dimensionality of three hundred. Import two data structures, a Python list and an embedded matrix which holds all the word vector values. On obtaining the vectors, take an input sentence and construct its vector representation. TensorFlow’s embedded lookup function can be used to generate word vectors (Fig. 1). The next step of this model is creating the ids matrix for our dataset. IMDB movie review dataset is used for training the RNN-LSTM. It contains twenty-five thousand movie reviews with twelve thousand five hundred negative and positive reviews, respectively. Matplot library is used to visualize the data in a histogram
Comparison and Analysis of RNN-LSTMs and CNNs for Social …
53
Fig. 1 Flowchart representation of RNN and LSTM model
format. From Fig. 2, it can be inferred that average number of words in a file is 250. In order to convert our data into ids matrix, the movie training set and integer dataset is loaded to obtain 25,000*250 matrix (25,000 number of files and average number of words per file 250). On creation of the ids matrix, begin working with RNNs. Some hyper-parameters, such as size of the batch, training iterations number, LSTM units number, and output classes are defined. In this model, two placeholders are specified, one for inputs into the network and other for labels. The label placeholders represent values set [1, 0] or [0, 1] based on the positive or negative training example, respectively. Every row present in the integer input placeholder represents each training example’s integer representation included in the batch. Word vectors are obtained using various functions present in tensor flow which output 3D tensor of dimensionality batch size with max sequence length. On visualizing this 3Dimensional tensor by considering the data points in integer input tensor like a result of a D-dimensional vector that it refers to (As shown in the Fig. 3). On obtaining in our required format, the data is fed as an input into the LSTM network. tf.nn.rnn_cell is used. Basic LSTM cell function to specify number of
54
S. Bodapati et al.
Fig. 2 Visualizing data with a histogram
Fig. 3 Integreizied inputs and their labels
LSTM units required to build our model. The number of LSTMs in our model is a hyper-parameter which is tuned to obtain the optimum number. The dropout layer LSTM cell is covered to avoid our network from overfitting [3]. The next step is to feed the 3D tensor completely with input data and the LSTM cell to a function known as tf.nn.dynamic_rnn. The function to unroll the entire network and generate a path flow of data through RNN model is used. Multiple LSTM cells are being stacked upon each other such that the last hidden state vector of the first LSTM feeds to the second LSTM in the model and so on. This helps the model to improve information long-term dependency but also leads more parameters into our model and hence probably the training time increases for additional training examples, also the overfitting chances. The first output of the dynamic RNN is the final hidden state vector. This vector is multiplied and re-shaped with a bias term and the last weight matrix to obtain the last output values. Next step is to define correct accuracy and prediction metrics in order to track the network model performance by defining appropriate number hyperparameters. The correct prediction formulation task is done by taking a look at index
Comparison and Analysis of RNN-LSTMs and CNNs for Social …
55
of the max-value of the two output values, next check the match with the training labels. Define a standard-cross-entropy loss containing a soft-max layer placed on the top of the final prediction-values. Adam optimizer with the default learning rate of 0.001 is used.
3.1.1
Hyper-parameter Tuning
Selecting the right values for hyper-parameters (like number of LSTM units, word vector size, etc.) is a key part of training deep-neural-networks [3]. Especially with recurrent-NNs and LSTMs, some important factors comprise of the size of the word vectors and number of LSTM units. • Number of LSTM units: the parameter is reliant on the average-length of the input-text. Higher number of LSTM units allows the model to retain more information and improves the ability of more to express. On the other hand, the model takes longer to train and becomes computationally expensive. The number of LSTM units used for our project is 64. • Word vector size: The typical dimensions for a word vector range from fifty to three hundred. The greater size vector can condense more info about the word, but also increases the computation expense of the model. • Optimizer: The optimizer used in our model is Adam, which is popular due to its adaptive learning property. The optimal learning rates vary based on selection of the optimizer. 3.1.2
Training and Testing the Model
The first step of training involves loading in batch of reviews and the labels associated with them. Implement the run function which has two arguments; first argument is fetches which describes value interested for computing and the optimizer, as it is the component used to minimize loss-function. The second is the data structure where is provided with all of the placeholders. The loop will be repeated for a certain number of iterations. Test set movie reviews are loaded into model. These are the un-trained movie reviews on model. The model computes the reviews by predicting the sentiment and generates results for each test batch (Fig. 4).
3.2 Methodology Using One-Dimensional Convolutional Neural Networks In this model, Keras module and various methods present in it are used to build our convolutional neural network model [2]. One-dimensional convolutions and pooling by the Conv1D and MaxPooling1D classes, respectively, are supported by Keras.
56
S. Bodapati et al.
Fig. 4 Output showing the accuracy of the model
Figure 5 shows an overview of entire process using CNN. In the first step, packages and classes necessary to build model and initialize the random-number-generator to a constant are imported. The next step is to load and ready the IMDB-dataset. After this, define the CNN model. Post-embedding the Conv-1D layer and input layer are inserted. The convolutional layer in our model has thirty-two feature-maps and inputs the embedded word representation with three vector elements of the wordembedding at one time Fig. 5 Flowchart for the CNN model
Comparison and Analysis of RNN-LSTMs and CNNs for Social …
57
Fig. 6 Network structure of CNN
[4]. The convolutional layer in this chapter is tailed by a 1D max-pooling layer a length of two which halves the size of feature-maps and with a stride from the convolutional layer. Fit the model using the model.fit() function present in Keras. On running the model, a summary of the network structure is obtained. The convolutional layer conserves dimensionality of embedding- input layer in the CNN model of thirty-two-dimensional input with max 500 words. Pooling layer is used to compress this representation by halving it. The accuracy with which the model classifies the polarities present in the sentence is obtained. By running the steps as described in the flowchart, a network structure is given as output as shown above. From Fig. 6, it can be inferred that the dimensionality of embedding input layer of 32-dimensional input is preserved with a maximum of 500 words. Accuracy achieved by running our CNN model for ten iterations is 86.80% (Fig. 7).
4 Results and Discussions 4.1 Comparison Between the Performance of CNN and RNN-LTSM
58
S. Bodapati et al.
Fig. 7 Accuracy of CNN model
Attribute
CNN
LSTM-RNN
Accuracy for ten epochs
86.80%
84.16
Accuracy for two epochs
88.56
80.16%
Time taken for training dataset
Within 2 min
Approximately 5 min
Optimizer used
Adams
Adams
Dataset used
IMDB
IMDB
5 Conclusion Deep learning is a successful field with diverse advancement with help of convolutional networks, recurrent Neural networks, backpropagation, LSTM networks, and many more. The models presented in the chapter are evidence to the strength of deep learning. Various components are involved in the pipeline and process of creating, training, and testing an RNN-LSTM model and a CNN model to classify movie reviews and compare the results outputted by these deep learning models. As shown in the comparison table that for the same number of epochs, RNNs are slow and fickle to train; they take more time for training when compare to one-dimensional CNNs. Also, for text classification, feature detection is important like searching for
Comparison and Analysis of RNN-LSTMs and CNNs for Social …
59
angry words, etc. One-dimensional convolution neural networks gave better result as shown from the accuracy of the model.
References 1. A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet classification with deep convolutional neural networks, in Proceedings of NIPS 2012 (2012) 2. S. Paul, J.K. Verma, A. Datta, R.N. Shaw, A. Saikia, Deep learning and its importance for early signature of neuronal disorders, in 2018 4th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, pp 1–5 (2018). https://doi. org/10.1109/ccaa.2018.8777527 3. S. Mandal, S. Biswas, V.E. Balas, R.N. Shaw, A. Ghosh, Motion prediction for autonomous vehicles from lyft dataset using deep learning, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, pp. 768–773 (2020). https://doi.org/10.1109/iccca49541.2020.9250790 4. Y. Kim, Convolutional neural networks for sentence classification, in Proceedings of the 2014 Conference on Empirical Methods in Natual Language Processing (EMNLP). Doha, Qatar (Association for Computational linguisitcs, 2014), pp 1746–1751 5. M. Kumar, V.M. Shenbagaraman, R.N. Shaw, A. Ghosh, Predictive data analysis for energy management of a smart factory leading to sustainability, in: Innovations in Electrical and Electronic Engineering, ed. M. Favorskaya, S. Mekhilef, R. Pandey, N. Singh. Lecture Notes in Electrical Engineering, vol 661 (Springer, Singapore, 2021). https://doi.org/10.1007/978981-15-4692-1_58 6. S. Mandal, V.E. Balas, R.N. Shaw, A. Ghosh, Prediction analysis of idiopathic pulmonary fibrosis progression from OSIC dataset, in 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, pp. 861–865 (2020). https://doi.org/10.1109/gucon48875.2020.9231239 7. H. Sak, A. Senior, F. Beaufays, Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition (2014). ArXiv e-prints 8. M. Hermans, B. Schrauwen, Training and analysing deep recurrent neural networks, in Advances in Neural Information Processing Systems, pp. 190–198 (2013) 9. R.J. Williams, J. Peng, An efficient gradient-based algorithm for online training of recurrent network trajectories. Neural Comput. 2, 490–501 (1990) 10. Y. Belkhier, A. Achour, R.N. Shaw, Fuzzy passivity-based voltage controller strategy of grid-connected PMSG-based wind renewable energy system, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India, pp. 210–214 (2020). https://doi.org/10.1109/iccca49541.2020.9250838 11. R.N. Shaw, P. Walde, A. Ghosh, IOT based MPPT for performance improvement of solar PV arrays operating under partial shade dispersion, in 2020 IEEE 9th Power India International Conference (PIICON), SONEPAT, India, pp. 1–4 (2020). https://doi.org/10.1109/piicon49524. 2020.9112952 12. Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, Sanjeev Khudanpur, Recurrent neural network based language model. Interspeech 2, 3 (2010) 13. S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Blockchain-Based Model for Expanding IoT Device Data Security Anand Singh Rajawat, Romil Rawat, Kanishk Barhanpurkar, Rabindra Nath Shaw, and Ankush Ghosh
Abstract Healthcare institutions around the world become an organised and usercentred structures. Managing vast volumes of data, such as IoT device data or medical records of each person, however, contributes to an increase in human resources and security threats. To solve these challenges, healthcare IoT increases the quality of patient care and lowers costs by efficiently allocating medical services. Via blockchain technology, we have a safety mechanism for healthcare data through creating the SHA256 hash algorithm of each data such that every modification or alteration of data confirmed with improved simulation results that due to blockchain methodology give the security. Our proposed algorithm based on SHA256 hash algorithm to every block which are verified by every node, data cannot be altered by nefarious source. In this research work, we proposed a blockchain-based model with consensus protocol and SHA256 hash algorithm related to the priorities of the verifiability, suitability, extensiveness, uniqueness, sturdiness, and coercion resistance.
A. S. Rajawat · R. Rawat Department of Computer Science Engineering, Shri Vaishnav Vidyapeeth Vishwavidyalaya, Indore, India e-mail: [email protected] R. Rawat e-mail: [email protected] K. Barhanpurkar Department of Computer Science and Engineering, Sambhram Institute of Technology, Bengaluru, Karnataka, India e-mail: [email protected] R. N. Shaw Department of Electronics and Communication Engineering, Galgotias University, Greater Noida, India e-mail: [email protected] A. Ghosh (B) The Neotia University, Sarisha, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_5
61
62
A. S. Rajawat et al.
Keywords Blockchain · IoT · Health care · Data security
1 Introduction In India, number of stage or domain IoT device adopt very fast rate. According to Deloitte report all around the world, a number of the government and private organisation spend a lot of amount on IoT-based software and hardware rapidly from $726 billion in 2019 to $1.1 trillion in 2023, according to a market research report. To achieve such enormous progress, an IoT stack must be designed, protocols defined and proper layers developed for an infrastructure which will provide IoT devices with services. Most IoT implementations currently focus on the unified model of serverclient, communicating across the Internet to cloud servers [1]. In this domain implies that new paradigms may have to be proposed, while this solution will work properly nowadays. In the past, decentralised architectures these proposals. Relation between security and privacy, becoming the next phase in the delivery of cloud services to blockchain where multiple peers and technologies may assist. Blockchain technologies are capable of monitoring, organising, executing transactions and information retrieval from a wide variety of devices, allowing apps to be developed which do not need a centralised cloud. Many businesses such as IBM, Amazon, and Google go deeper and speak about blockchain as a tool to democratise the future IoT, so it solves the crucial problems of its rapid adoption today: Due to costs relating to the rollout and management of consolidated clouds and server farms, many IoT technologies are also pricey. If the retailer may not build such a system, the burden by middle layer maintenance is to be concern as millions of mobile devices (health information alerts) have to distribute daily app upgrades. Figure 1 show the how to generate the medical alerts for required the medical emergency applying the blockchain technology for securing the IoT device data. It is hard that IoT adopt the trust technology partners it is, grant those authorities system access and control, enabling them to gather and evaluate consumer data. Therefore, the essence of potential IoT implementations should be secrecy and anonymity. This article will be structured as follows. The second section of the chapter discusses blockchain basic technique, define, type and need of the blockchain. The third section discusses the related work. In section four, the proposed model for medical blockchain and describe the proposed model, and experimental results will be shown. Finally, conclusions and proposals for further research are provided.
Fig. 1 IoT-based healthcare system with blockchain
Blockchain-Based Model for Expanding IoT Device Data Security
63
2 Blockchain Basics A blockchain is like a public database [2], whose data is exchanged between a peer network. As stated earlier, as it fixed, the double-spend dilemma is a longer-lasting financial problem known as, it is considered to be bitcoin’s biggest contribution. The approach suggested by bitcoin consisted of searching. For the more of mining nodes needs to reach consensus, adding to the blockchain of the actual transactions. While the idea of blockchain emerged as a cryptocurrency platform, in order to use a blockchain and create decentralised applications, it is not essential to establish a cryptocurrency.
2.1 Working of Blockchain To use a blockchain, a P2P network with all the nodes involved in using such a blockchain is necessary first. Each network node receives two keys: key of public that is other user are used to encrypt messages sent to a node, and a private key that enables those messages to be read by the node. Two different keys are then used, one to encrypt and the other to decrypt.
2.2 Define the Blockchain Type There are various types of blockchains. It can also be differentiated between public and private blockchains, as well as approved and permissionless blockchains.
2.3 Requirement of the Blockchain There before moving through the specifics of how to use an IoT blockchain implementations, it must be stressed that with any IoT situation, a blockchain is not necessarily the right option. Applications for IoT, specifically, a developer can think in order to decide whether the use of a blockchain is appropriate whether an IoT application requires the following features: Decentralisation when there is no trustworthy centralised framework, IoT implementations require decentralisation. Many people, however, already implicitly trust those firms, government officials or banks, so a blockchain is not needed if there is shared trust. Payment system IoT application can enable third parties to carry out economic transactions, many application is not. Furthermore, economic transfers can also be done by conventional payment networks, but they typically require the paying of processing fees, and banks or intermediaries need to
64
A. S. Rajawat et al.
be trusted. Most IoT networks gather information that must be time-stamped and stored sequentially. However, with conventional databases, such needs can be easily met, especially in cases where confidentiality is ensured or attack are carried out and it is rare.
3 Related Work The Ejaz and Anpalagan [3] discussed on the Internet of things (IoT) that offer solutions to real-world challenges with the goal of improving the quality of living. Connectivity of physical objects via the Internet via wired or wireless technologies is the core of the IoT Naqvi et al. [4]. The numerous methods suggested in this chapter for the developing blockchain in the digital health field have been addressed. Where patients document interoperability, confidentiality, data exchange, data storage, and data retrieval by using blockchain technologies at the IoHT. Using blockchain, technologies make data safer and improve the interoperability. Bhawiyuga et al. [2] proposed and suggested the concept of the IoT-to-blockchain platform to incorporate healthcare system Liu et al. [5]. Blockchains are encrypted by nature and are an example of a high security fault tolerant distributed computing system. Blockchains are fundamentally immune to data manipulation, since the data of any given block cannot be retroactively altered without the modification of all the following blocks and the complicity of the network. Dwivedi et al. [6] proposed approach for providing protection and privacy for electronic medical records Azbeg et al. [7]. In this article, we discuss an IoT and blockchain-based network framework to promote diabetes follow-up and to help patients treat it better themselves Dwivedi et al. [6]. In this article, in a safe and private manner to produce warnings relevant to authenticated healthcare providers. The advantages and realistic challenges of blockchain-based IoT security methods are also described in this paper [6] (Table 1). Table 1 Comparison of several studies carried out for data integration based on blockchain domain in various industries S. No.
Author
Year
Type of blockchain-based industries
1
Jamil et al.
[8]
Significant signs of health-related parameters in hospitals
2
Kim et al.
[9]
Enhancement in over-all security in healthcare industry (blockchain)
3
Feng et al.
10
Continuous growth of blockchain in agriculture industry
4
Al-madani et al.
[11]
Usage of blockchain in cyber-security platforms
5
Wei et al.
[1]
Usage of blockchain in cyber-security platforms
Blockchain-Based Model for Expanding IoT Device Data Security
65
4 Proposed Model For Medical Blockchain Think a standard healthcare IoT situation where Alice has a variety of IoT devices connected to database server, such as a thermostat and intrusion detection system. The proposed model consists numbers of states comprising of end IoT devices are connected in a blockchain system with the medical data centre. Data access and data management usage cases are considered in this research work. Alice should be able to view temperature data remotely, or IoT device dependent on access rights should be able to access one another. In addition, these IoT device should be able to store cloud storage information on the basis of access policies. Let us add all the phases before addressing Fig. 2 model details (Table 2). • Health service monitor and provider—This entails all the mobile gadgets in a smart home that are present. Although these systems do not have any special identifier, access control is difficult to enforce. Therefore, the proposed paradigm has added special IDs that are account addresses in the blockchain. Each address has a private key, so this address can only be used by those with it to make certain transactions. Transactions are used to interdevice correspondence. • Participant identification number: Participants (health worker and patient details connection through the blockchain structure) are able to classify any single unit using a standard blockchain framework. Data supplied and fed into the machine is permanent and defines real information supplied by a computer individually. • Consensus protocol and SHA256 secure the communication: Knowledge and correspondence can be protected if processed as blockchain transactions.
Fig. 2 Proposed model-based medical blockchain model using SHA256 hash algorithm for securing IoT device data
66
A. S. Rajawat et al.
Table 2 Sensor-based health datasets, alerts and parameters S. No.
Data sets
IoT technology
Alerts
Parameters use
1
Sensor-based health datasets
Heart rate sensor
High heart rates
Heart rates
2
Blood pressure sensor
High blood pressure
Blood pressure
3
Visual sensor
Visual problem
Visual
4
EMG sensor
EMG problem
EMG
5
ECG sensor
ECG problem
ECG
6
Pressure sensor
High pressure
Pressure
7
Temperature sensor
High temperature
Temperature
8
Respiration sensors
High respiration
Respiration
Blockchain (consensus protocol and SHA256) can consider transfers of computer messages as transactions authenticated by intelligent contracts in order to secure communication between devices. (a)
(b)
Consensus protocol: when using a blockchain method, the main technology that is being used is a novel technology for applying the medical IoT data. In those methods, all the nodes available in the system is given the entire information rather than giving single information of that area. So even if one node is destroyed then the other nearby possible nodes can be used and retrieved. In this method, the power is distributed equally across all the nodes and hence they lack a central system which prevents it from getting destroyed. So even when the attackers decide to attack a single node the data can be easily retrieved from the other node available. So changing the information in a particular node will not change the medical records (visit, EMR data, patient report, IoT data) and the attack on a particular node can be easily identified. All the nodes are protected by block information and order to enter a particular node the person needs permission from all the block information which might be difficult to obtain in case of certain illegal transactions. Cryptographic: Each and every single block available is being connected to each other with the help of a timestamp. For example, the previous and the upcoming block are being connected to each other with the help of a SHA256 hash algorithm that protects all the block through the connectivity and safeguards the integrity from various attackers.
Blockchain-Based Model for Expanding IoT Device Data Security
67
Table 3 Comparison of several studies carried out for data security based on blockchain domain in various industries S. No.
Author
Year
Blockchain-based technique for data security
1 2
Li et al.
[12]
Intrusion detection system
Park et al.
[13]
Integrity management system
3
Nakasumi et al.
[14]
Supply management system
4
Hema Kumar et al.
[15]
Wireless sensor networks (WSN)
Algorithm 1 Input: Blockchain network represent the BCN (Blockchain network) selected the IoT sensors data and patient records for encrypted Output: Determine the trusted IoT sensors data node Step 1. IoT device data to store in blockchain-based database (SHA256 hash algorithm) and every event traced by the Health service monitor and provider and users (Patients) Step 2. Register the every IoT device using SHA256 hash algorithm Step 2.1. Need to Register IoT device using SHA256 hash algorithm BCN (Blockchain network) for connecting the any Health service monitor and provider and users (Patients) Step 2.2. Check the condition every IoT device register with blockchain Then After that registration BCN (Blockchain network) allow to user for use the services Step 3. Compute the trusted factor on the bases of device records Step 4. Assign the particular rating for every IoT device between the 1 and 10 If the rating of particular IoT device higher then 9 then represent the IoT device is trusted If the particular device rating is less then 9 need to check the all record Then need to reregister BCN (Blockchain network) (Table 3). Python and Scala, H2 O-based blockchain technology, learning libraries will be used for the advancement and experimentation of the venture. Resources, for example, IBM Watson Studio, Anaconda Python, and Scala and Python libraries will be used for this procedure. Preparing will be led on NVIDIA GPUs for preparing a probabilistic demonstrating and profound learning approach for maladies forecast. Utilising Apache Zeppelin the information will be recovered from database and making dashboard that shows information in graphs, lines and tables continuously [16–18]. In light of the proposed framework engineering, information from screens (IoT) can be dissected continuously and send an alarm to mind suppliers, so they think in a split second about changes in a patient’s condition [19–21]. Experimental evaluation of the proposed and traditional methods has been successfully carried out and numerous findings have been reported for different parameters. In previous subsections of performance analysis, the effects of operation and blockchain state parameters are discussed [22, 23]. The method acted as planned
68
A. S. Rajawat et al.
and the suggested structure was optimistic for all output metrics for all healthcare details. In comparison, the precision of the suggested solution was similar to 90%. After any unique interval of time, the proposed algorithm measures the confidence of other nodes (Fig. 3).
Fig. 3 Comparative analysis no. of authentication and number of the node
Fig. 4 Comparative analysis average authentication delay and number of receivers
Blockchain-Based Model for Expanding IoT Device Data Security
69
Where it is very possible for intruders to brute-force target, the protection tactics deployed individually at the level of travel, understanding, and operation (Fig. 4). The suggested system, however, retains the blockchain for the whole network, where predicting consensus protocol and SHA256 of all blocks (nodes) at once becomes quite difficult. Introduces probabilistic authentication system situations, where all methods define the rightful node by capable of defining the node which is trusted.
5 Conclusions and Future Work The blockchain is supposed to revolutionise the IoT devices. Taking into account the problems described in this article, the convergence of these two innovations should be tackled. The enforcement of legislation is important to the integration of blockchain and the IoT as part of government technology. Such adoption may accelerate contact between patient, doctor and hospital. Consensus protocol and SHA256 will also play a vital role in incorporating the IoT device data into the mining processes and in spreading ever more blockchains. However, a dualism may emerge between trust in the data and encouraging the incorporation of embedded devices. Finally, beyond the scalability and capability of storage that impact all technologies, we will applying this approach in real-time scenario and securing the medical datasets.
References 1. P. Wei, D. Wang, Y. Zhao, S.K.S. Tyagi, N. Kumar, Blockchain data-based cloud data integrity protection mechanism. Future Gener. Comput. Syst. (2019). https://doi.org/10.1016/j.future. 2019.09.028 2. A. Bhawiyuga, A. Wardhana, K. Amron, A.P. Kirana, Platform for integrating internet of things based smart healthcare system and blockchain network, in 2019 6th NAFOSTED Conference on Information and Computer Science (NICS) (2019). https://doi.org/10.1109/nics48868.2019. 902379 3. W. Ejaz, A. Anpalagan, Dimension reduction for big data analytics in internet of things, in SpringerBriefs in Electrical and Computer Engineering, pp. 31–37 (2018). https://doi.org/10. 1007/978-3-319-95037-2_3 4. M.R. Naqvi, M. Aslam, M.W. Iqbal, S. Khuram Shahzad, M. Malik, M.U. Tahir, Study of blockchain and its impact on internet of health things (IoHT): Challenges and opportunities, in 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (2020). https://doi.org/10.1109/hora49412.2020.9152846 5. W. Liu, S.S. Zhu, T. Mundie, U. Krieger, Advanced block-chain architecture for e-health systems, in 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) (2017). https://doi.org/10.1109/healthcom.2017.8210847 6. A.D. Dwivedi, L. Malina, P. Dzurenda, G. Srivastava, Optimized blockchain model for internet of things based healthcare applications, in 2019 42nd International Conference on Telecommunications and Signal Processing (TSP) (Budapest, Hungary, 2019), pp. 135–139. https:// doi.org/10.1109/tsp.2019.8769060
70
A. S. Rajawat et al.
7. K. Azbeg, O. Ouchetto, S.J. Andaloussi, L. Fetjah, A. Sekkaki, Blockchain and IoT for security and privacy: A platform for diabetes self-management, in 2018 4th International Conference on Cloud Computing Technologies and Applications (Cloudtech) (Brussels, Belgium, 2018), pp. 1–5. https://doi.org/10.1109/cloudtech.2018.8713343 8. F. Jamil, S. Ahmad, N. Iqbal, D.-H. Kim, Towards a remote monitoring of patient vital signs based on IoT-based blockchain integrity management platforms in smart hospitals. Sensors 20(8), 2195 (2020). https://doi.org/10.3390/s20082195 9. S.-K. Kim, U.-M. Kim, J.-H. Huh, A study on improvement of blockchain application to overcome vulnerability of IoT multiplatform security. Energies 12(3), 402 (2019). https://doi. org/10.3390/en12030402 10. H. Feng, X. Wang, Y. Duan, J. Zhang, X. Zhang, Applying blockchain technology to improve agri-food traceability: A review of development methods, benefits and challenges. J. Cleaner Prod. 121031 (2020). https://doi.org/10.1016/j.jclepro.2020.121031 11. A.M. Al-madani, A.T. Gaikwad, IoT data security via blockchain technology and servicecentric networking, in 2020 International Conference on Inventive Computation Technologies (ICICT) (2020). https://doi.org/10.1109/icict48043.2020.9112521 12. D. Li, Z. Cai, L. Deng et al., Information security model of block chain based on intrusion sensing in the IoT environment. Cluster Comput. 22, 451–468 (2019). https://doi.org/10.1007/ s10586-018-2516-1 13. J. Park, E. Huh, Block chain based data logging and integrity management system for cloud forensics (2017) 14. M. Nakasumi, Information sharing for supply chain management based on block chain technology, in 2017 IEEE 19th Conference on Business Informatics (CBI) (Thessaloniki, 2017), pp. 140–149. https://doi.org/10.1109/cbi.2017.56 15. M. Hema Kumar, V. Mohanraj, Y. Suresh et al., Trust aware localized routing and class based dynamic block chain encryption scheme for improved security in WSN. J Ambient Intell. Human Comput. (2020). https://doi.org/10.1007/s12652-020-02007-w 16. M. Banerjee, J. Lee, K.-K.R. Choo, A blockchain future for internet of things security: a position paper. Digit. Commun. Netw. 4(3), 149–160 (2018). https://doi.org/10.1016/j.dcan. 2017.10.006 17. A.D. Dwivedi, G. Srivastava, A decentralized privacy-preserving healthcare blockchain for IoT, Sensors 19, 326 (2019); https://doi.org/10.3390/s19020326 www.mdpi.com/journal/sen sors 18. G. Srivastava, J. Crichigno, S. Dhar, A light and secure healthcare blockchain for IoT medical devices, in 2019 IEEE Canadian Conference of Electrical and Computer Engineering (CCECE) (Edmonton, AB, Canada, 2019), pp. 1–5. https://doi.org/10.1109/ccece.2019.8861593 19. Y. Belkhier, A. Achour, R.N. Shaw, Fuzzy passivity-based voltage controller strategy of gridconnected PMSG-based wind renewable energy system, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA) (Greater Noida, India, 2020), pp. 210–214. https://doi.org/10.1109/iccca49541.2020.9250838 20. S. Mandal, S. Biswas, V.E. Balas, R.N. Shaw, A. Ghosh, Motion prediction for autonomous vehicles from Lyft dataset using deep learning, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA) (Greater Noida, India, 2020), pp. 768– 773. https://doi.org/10.1109/iccca49541.2020.9250790 21. S. Mandal, V.E. Balas, R.N. Shaw, A. Ghosh, Prediction analysis of idiopathic pulmonary fibrosis progression from OSIC dataset, in 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON) (Greater Noida, India, 2020), pp. 861–865. https://doi.org/10.1109/gucon48875.2020.9231239 22. R.N. Shaw, P. Walde, A. Ghosh, IOT based MPPT for performance improvement of solar PV arrays operating under partial shade dispersion, in 2020 IEEE 9th Power India International Conference (PIICON) (SONEPAT, India, 2020), pp. 1–4. https://doi.org/10.1109/piicon49524. 2020.9112952
Blockchain-Based Model for Expanding IoT Device Data Security
71
23. M. Kumar, V.M. Shenbagaraman, R.N. Shaw, A. Ghosh, in Predictive Data Analysis for Energy Management of a Smart Factory Leading to Sustainability, eds. by M. Favorskaya, S. Mekhilef, R. Pandey, N. Singh. Innovations in Electrical and Electronic Engineering. Lecture Notes in Electrical Engineering, vol. 661. (Springer, Singapore, 2021). https://doi.org/10.1007/978-98115-4692-1_58
Linear Dynamical Model as Market Indicator of the National Stock Exchange of India Prabhat G. Dwivedi
Abstract The nonlinear, non-stationary structure of the Indian stock market data makes its study complex. We present a linear dynamical model (LDM) by using latent and observed variable as n−dimensional vector time series with Gaussian noise. We estimate parameters using Kalman filter for expectation (E-step) and mathematical optimization technique for maximization (M-step) of the EM algorithm. The result depicts that the model extracts some abstract structure of stock market data that helps in understanding price trends. LDM could be used as a market indicator than a predictive model. The prediction accuracy of LDM is more except during major national, international events and during COVID-19 period. Keywords Linear dynamical model · Expectation maximization algorithm · Kalman filter · Stock market
1 Introduction Investors aim to determine the future stock price to make profit. They use fundamental analysis for value investing [1, 2] and technical analysis for timing stock investing [3]. Fundamentals of a stock depend on the true informations. The cases in [4, 5] shows that information of a company can be manipulated which makes it difficult to find intrinsic value of a stock. The chartist believes that chart pattern of a stock has all information for price prediction [6]. But, Harshad Mehta’s security scam and the dot-com bubble were speculation without any fundamental value [7, 8]. The efficient market hypothesis (EMH) assumes that current stock price inherits all information and it is not possible to make profit from any information known to public [9, 10]. Warren Buffet is well known for his value investing and strategies for understanding P. G. Dwivedi (B) Department of Mathematics, Institute of Chemical Technology, Mumbai, India e-mail: [email protected] Mithibai College of Arts, Chauhan Institute of Science, Amrutben Jivanlal College of Commerce and Economics, Mumbai, India © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_6
73
74
P. G. Dwivedi
business contradicts EMH [11]. Traditional time series techniques to understand the trend in stock price are different types of moving average and autoregressive integrated moving average model. With the advent of machine learning techniques like support vector machine, principal component analysis, artificial neural network, recurrent neural network, and genetic algorithm has changed the study of financial instruments [12, 13]. There is also a combined study of text mining and machine learning techniques. Kalman filter is a technique of separating noise from the given sequential observed data with high accuracy. It has applications in real-time prediction of global positioning system, aircraft, spacecraft dynamic location of ships, and robotics [14, 15]. We aim to estimate latent variable with the help of the observed variable for stock price analysis using Kalman filter. We propose LDM to understand behavior of Indian stock market as a dynamic vector sequential model. Section 2 contains some basic definitions and results which we need for LDM. We derived estimates for LDM using EM algorithm in Sect. 3. In Sect. 4, we applied LDM to Indian stock market data and concluded our observations of the model in Sect. 5.
2 Definitions and Results Let N and R denote the set of natural and real numbers, respectively. The kdimensional normal density function [16] for a random vector Y ∈ Rk with mean vector μ and a positive definite covariance matrix is given by
1 (2π )k ||
1 exp {− (Y − μ) −1 (Y − μ)}. 2
If we have a joint Gaussian distribution of vector variate Ya μa aa ab Y = ∼N , Yb μb ba bb . Then, the conditional distributions are given by where ba = ab −1 −1 (Yb − μb ), aa − ab bb ab ), Ya |Yb ∼ N(Y |μa − ab bb −1 −1 aa (Ya − μa ), bb − ab aa ab ) Yb |Ya ∼ N(Y |μb − ab
and the marginal distribution are given by Ya ∼ N(Ya |μa , aa ), Yb ∼ N(Yb |μb , bb ). Vectorization vec : Rm×n → Rmn is an isomorphism which converts a matrix C ∈ Rm×n into a vector that is
Linear Dynamical Model as Market Indicator of the National …
75
vec(C) = [c11 , . . . , c1n , c21 , . . . , c2n , . . . , cm1 , . . . , cmn ] where cij is ijth entry of C [17]. Random variables X and Z are conditionally independent to a random variable Y if and only if P(XZ|Y ) = P(X |Y )P(Z|Y ). It is denoted by X ⊥⊥ Z|Y . For parameter estimates of M-step, we use the following results from matrix differential calculus [18]. ∂Tr(CD) = D , ∂C ∂Tr(DCD ) = C(D + D ), ∂C
∂Tr(C D) = D, ∂C ∂ ln |C| = (C −1 ) . ∂C
3 Linear Dynamical Model A normalized n−dimensional vector time series is given by Z1 , Z2 , . . . , ZK . We assume that we can predict the future trend using historic data and Zk+1 depends on Z1 , Z2 , . . . , Zk . Hence, the joint probability distribution of K observations Z1 , Z2 , . . . , ZK can be expressed as p(Z1 , Z2 , . . . , ZK ) =
K
p(Zk |Z1 , Z2 , . . . , Zk−1 ).
k=1
For each observation Zk , we construct a state-space model by introducing a latent variable Yk (may be of different dimensions) with normal distribution and conditional independence with Yk−1 and Yk+1 . The general form of transition and emission distribution can be expressed as follows. Yk = PYk−1 + wk where wk ∼ N(wk |0, Y ). Zk = QYk + υk where υk ∼ N(υk |0, Z ).
(1) (2)
We make the following assumption for the initial latent variable. Y1 = μ0 + W0 that is Y1 ∼ N(Y1 |μ0 , Y(0) ). We estimate the model parameters {P, Y , Q, Z , μ0 , Y(0) } by EM algorithm. In the E-step, we use Kalman filter [14] for estimating posterior marginals of latent variables Yk by minimizing the error of imperfect information. We observe latent Markov model as we share information from Z1 to Zk . We approximate unknown μk , mean over Yk , with an unbiased prior Pμk−1 with covariance matrix E[ξk ξk ] = k−1
76
P. G. Dwivedi
where ξk = μk − Pμk−1 , k−1 = PY(k−1) P + Y . But, with new observation Zk , we can update μk . We also have Zk = Qμk + υk with E[υk ] = 0 and E[υk υk ] = Z . We also assume that υk are uncorrelated with μk−1 and μk . We define estimate of μk as (3) μ˜ k = C + G k Zk where a vector C and a matrix G k are to be determined such that its estimate is unbiased with minimum variance. Hence, we have E[μk ] = E[μ˜ k ], Pμk−1 = E[C + G k Zk ] = C + G k E[Zk ] = C + G k E[Qμk + υk ] = C + G k (QE[μk ] + E[υk ]) = C + G k QPμk−1 + 0 C = (I − G k Q)Pμk−1 .
(4)
Hence, from (3) and (4), we get μ˜ k = (I − G k Q)Pμk−1 + G k Zk , μ˜ k = Pμk−1 + G k (Zk − QPμk−1 ).
(5)
Let ξ˜k = μk − μ˜ k . We wish to minimize E[ξ˜k ξ˜k ] = E[(μk − μ˜ k ) (μk − μ˜ k )] = E[Tr((μk − μ˜ k ) (μk − μ˜ k ))]
Trace of a scalar is scalar
= E[Tr((μk − μ˜ k )(μk − μ˜ k ) ]
= Tr(E[(μk − μ˜ k )(μk − μ˜ k ) ]) = Tr(Y(k) ), where Y(k) =
Tr(ST) =Tr(TS) E is a linear operator
E[(μk − μ˜ k )(μk − μ˜ k ) ].
μk − μ˜ k = μk − Pμk−1 − G k Zk + G k QPμk−1 = μk − Pμk−1 − G k (Qμk + υk ) + G k QPμk−1 = μk − Pμk−1 − G k Q(μk − Pμk−1 ) − G k υk = (I − G k Q)ξk − G k υk . Since μk and μk−1 are not correlated with υk , hence, ξk and υk are uncorrelated.
Linear Dynamical Model as Market Indicator of the National …
77
Y(k) = E[(μk − μ˜ k )(μk − μ˜ k ) )] = (I − G k Q)E[ξk ξk ](I − G k Q) + G k E[υk υk ]G k = (I − G k Q)k−1 (I − G k Q) + G k Z G k ) T = k − G k Qk−1 − k−1 Q G k + G k Qk−1 Q G k + G k Z G k .
Differentiating Tr(Y(k) ) with respect to G k , we get ∂Tr(Y(k) ) = −2k−1 Q + 2G k Qk−1 QT + 2G k Z . ∂G k
(6)
To find G k , we equate (6) with 0, we get k−1 Q = G k (Qk−1 QT + Z ),
(7) −1
G k = k−1 Q (Qk−1 Q + Z ) . T
(8)
The variance is minimum and its value is T + G Q Q G + G G Y(k) = k − G k Qk−1 − k−1 Q G k k−1 k Z k k k T = k−1 − G k Qk−1 − k−1 Q G k + G k (Qk−1 Q + Z )G k = k−1 − G k Qk−1 − k−1 Q G (by (7)) + Q G k−1 k k = k−1 − G k Qk−1 = (I − G k Q)k−1 .
(9)
The steps in Kalman filter can be interpreted as follows. 1. The μk , mean over Yk is forward projection by a transition probability matrix P of μk−1 , mean over Yk−1 . 2. It predicts observation Zk as Qμk = QPμk−1 . 3. Hence, μk can be approximated by Pμk−1 with Zk − QPμk−1 as error and G k as its coefficient of proportionality. G k is also known as Kalman gain matrix. We have generated information with normalized marginal distributions corresponds to p(Yk |Z1 , Z2 , . . . , Zk ) is denoted by
α (Yk ) = N(Yk |μk , Y(k) ).
The normalizing factor of α(Yk ) = p(Z1 , . . . , Zk , Yk )
78
P. G. Dwivedi
is
k
ck1 = p(Z1 , Z2 , . . . , Zk )
k1 =1
where ck1 = p(Zk1 |Z1 , Z2 , . . . , Zk1 −1 ). It gives the relation α(Yk ) .
α (Yk ) = k k1 =1 ck1 We also deduce that ck N(Yk |μk , Y(k) ) = N(Zk |QYk , Z )N(Yk |Pμk−1 , k−1 ). ck N(Yk |μk , Y(k) ) = N(Zk |QPμk−1 , Qk−1 Q + Z )× N(Yk |Pμk−1 + G k (Zk − QPμk−1 ), (I − G k Q)k−1 ). Hence, we get ck = N(Zk |QPμk−1 , Qk−1 Q + Z ). ψ(Yk−1 , Yk ) = p(Yk−1 , Yk |Z) where Z = {Z1 , Z2 , . . . , Zk } p(Yk−1 , Yk , Z) = p(Z) p(Yk−1 , Yk , Z1 , Z2 , . . . , Zk ) = p(Z1 , Z2 , . . . , Zk ) p(Zk , Yk , Yk−1 ) p(Z1 , . . . , Zk−1 |Zk , Yk , Yk−1 ) = p(Z1 , Z2 , . . . , Zk−1 ) p(Zk |Z1 , Z2 , . . . , Zk−1 ) p(Z1 , . . . , Zk−1 |Yk−1 )p(Yk−1 ) p(Zk |Yk )p(Yk |Yk−1 ) = p(Z1 , . . . , Zk−1 ) ck p(Z1 , . . . , Zk−1 Yk−1 ) p(Zk |Yk )p(Yk |Yk−1 ) = p(Z1 , . . . , Zk−1 ) ck
α (Yk−1 )p(Zk |Yk )p(Yk |Yk−1 ) = ck α (Yk )
α (Yk−1 )
p(Zk |Yk )p(Yk |Yk−1 ) = ck
α (Yk ) (k)
=
(k−1)
) N(Yk |μk , Y )N(Yk−1 |μk−1 , Y N(Zk |QYk , Z )N(Yk |Pμk−1 , k−1 )
× N(Zk |QYk , Z )N(Yk |PYk−1 , Y ) (k−1) = N Yk−1 |μk−1 + Hk−1 (Yk − Pμk−1 ), (I − Hk−1 P)Y (k) N(Yk |μk , Y ) , −1 (k−1) (k−1) (k−1) −1 P (PY P + Y ) = Y P k−1 .
where Hk−1 = Y
(10)
Linear Dynamical Model as Market Indicator of the National …
79
Equation (10) shows that
α (Yk−1 ) and
α (Yk ) are jointly normal and
(k−1) Y μk−1 Hk−1 Y(k)
α (Yk−1 ) ∼N . , (k) μk
α (Yk ) Y Hk−1 Y(k) Also, ψ(Yk−1 , Yk ) is a Gaussian with mean components of
α (Yk−1 ) and
α (Yk ), that is, μk−1 and μk and covariance between them is given by cov(Yk−1 , Yk ) = Hk−1 Z(k) .
(11)
As latent and observed variable follow Markov chain, we obtain the joint probability distribution of sequential data for LDM as follows. p(Z, Y ) = p(Z1 , . . . , Zk , Y1 , . . . , Yk ), p(Z, Y ) = p(Y1 )
K
p(Yk |Yk−1 )
k=2
K
p(Zk |Yk ),
(12) (13)
k=1
where Y = {Y1 , Y1 , . . . , Yk }, Z = {Z1 , Z2 ), . . . , Zk }. We get the complete data log-likelihood function by taking the logarithm of (13), which is given by ln(p(Z, Y | ) = ln(p(Y1 |μ0 , Y(0) ) +
K
ln(p(Yk |Yk−1 , P, Y )
k=2
+
K
ln(p(Zk |Yk , Q, Z ).
(14)
k=1
We estimate model parameters = {P, Y , Q, Z , μ0 , Y(0) } using maximum likelihood [19]. We need following expectations to evaluate . E[Yk ] = μk , ] = Hk−1 Y(k) + μk μ E[Yk , Yk−1 k−1 ,
and
E[Yk Yk ] = Y(k) + μk μ k .
We choose the initial model parameter old and take the expectation of (14) with respect to p(Y |Z, old ) that defines the function E( , old ) = EY | old [ln(p(Z, Y | )].
(15)
80
P. G. Dwivedi
We maximize this function with each component of . We maximize (15) with respect to μ0 and Y(0) by using 1 1 (0) (0) −1 E( , ) = − ln |Y | − EY | old (Y1 − μ0 ) Y (Y1 − μ0 ) + constant, 2 2 old
where constant term contains all terms independent of μ0 or Y(0) . E( , old ) 1 −1 −1 ln |Y(0) | − Tr Y(0) × EY | old Y1 Y1 − Y1 μ = − μ Y + μ μ 0 1 0 0 0 2 + constant.
(16) −1
Optimizing (16) with respect to μ0 and Y(0) , we get = E[Y1 ]. μnew 0 Y(0)new
=
(17)
E[Y1 Y1 ]
−
E[Y1 ]E[Y1 ].
(18)
Similarly, to maximize P and Y , we substitute for p(Yk |Yk−1 , P, Y ) in (14) using (1) and absorb all terms independent of P and Y in constant term. K 1 K −1 −1 −1 Yk Yk − PYk−1 Yk E( , ) = ln |Y | − Tr Y EY | old 2 2 k=2 + constant. (19) − Yk Yk−1 P − PYk−1 Yk−1 P old
Maximizing (19) with respect to P and Y , we get P
new
=
K
E[Yk Yk−1 ]
k=2
Ynew =
×
K
−1 E[Yk−1 Yk−1 ]
.
(20)
k=2
K 1 E[Yk Yk ] − P new E[Yk−1 Yk ] k −1 k=2
− E[Yk Yk−1 ](P new ) − P new E[Yk−1 Yk−1 ](P new ) .
(21)
Finally, we substitute for p(Zk |Yk , Q, Z ) in (14) using (2) and absorbing all terms independent of Q and Z in constant term to estimate Q and Z .
Linear Dynamical Model as Market Indicator of the National …
81
K ln |Z−1 | 2 K 1 Zk Zk − QYk Zk − Zk Yk Q − QYk Yk Q − Tr Z−1 EY | old 2
E( , old ) =
k=1
+ constant.
(22)
Maximizing (22) with respect to Q and Z , we get Qnew
K K −1 = (Zk E[Yk ]) × E[Yk Yk ] . k=1
Znew
(23)
k=1
K 1 Zk Zk − Qnew E[Yk ]Zk − Zk E[Yk ](Qnew ) + = K k=1 new new Q E[Yk Yk ](Q ) .
(24)
Thus, (17), (18), (20), (21), (23), and (24) give the estimate of model parameter
= {P, Y , Q, Z , μ0 , Y(0) }. LDM prediction error of the matrix observation Zn as Z˜ n is given by Z˜ n = Zn − Z˜ n F /Zn F where ZF = ZF =
√ Z Z
is a Frobenius norm. The graph of marginal distribution as the number of EM iteration gives the maximum likelihood convergence to true value of the parameter .
4 Application of LDM to Indian Stock Data We studied daily data of 49 stocks data from National Stock Exchange of India (NSE-50) April 1, 2013 to August 20, 2020 (1817 epochs). For each data, we have daily opening, closing, adjusted closing, high, low, adjusted close price, and volume. We have not considered only one stock out of 50 listed stocks at NSE that is HDFC Life Insurance Company Ltd as we do not have sufficient data to analysze. We used vec function to convert matrix time series of 6 × 49 order into a vector time series in R6×49 . Then, we normalized it to fit LDM using moving windows of a length of investment decision period. We state the result for 300 length window. We use 80%
82
P. G. Dwivedi
Fig. 1 Price versus time
of the data to estimate the parameters = {P, Y , Q, Z , μ0 , Y(0) } of training LDM and dynamically predict for remaining 20% of the data. Figure 1, price vs time shows error in prediction of original series. But, LDM trend captures the direction of market movements which helps in understanding price behavior of the NSE-50. We state the following observations about the graph. 1. When the difference between original and LDM trend is greater then there is a stronger the momentum(for buying or selling). 2. Whenever original time series crosses the LDM from upside, we observe a buying opportunity and vice versa. 3. If we consider 0.5 as the intermediate level, then we get 0.9–1 as the overbought zone and signals reversal pattern from higher to lower. Similarly, 0.1–0.2 as the oversold zone and gives reversal signal from lower to higher. 4. Shallow LDM series shows the sustainable trend. 5. Steep LDM trend is not sustainable and look for reversal. 6. LDM is unpredictable during major political events in 2019 and COVID-19 complete lockdown in India since March 24, 2020. We observe the following in the error versus time Fig. 2. 1. As error goes down with lower peaks and valley, it shows that original trend is strong by making higher peaks and valley. 2. It shows that there is inverse relation between error and original trend. 3. As slope of error moves downwards with lower peaks and lower valleys, it shows less volatility in the original trend. 4. As error is moving upward by making higher peaks and higher valleys, it shows high volatility in the original trend. 5. Whenever error comes in the range of 0.2 and 0.3 (volatility squeeze) and looks for reversal.
Linear Dynamical Model as Market Indicator of the National …
83
Fig. 2 Error versus time
Fig. 3 Log-likelihood versus number of EM iterations
The convergence to log-likelihood of the true model is observed in log-likelihood versus number of EM iterations Fig. 3.
5 Conclusion In this chapter, we studied 49 stocks of National Stock Exchange of India with the help of linear dynamical model. We observed that the LDM with Kalman filter can be used as a market indicator with higher accuracy. This indicates that the model inherits some abstract features of the nonlinear and non-stationary structure of the data. It failed as price predictive model. But, LDM gave the price movement direction even
84
P. G. Dwivedi
during national elections, some major economic decisions, and COVID-19 period where data had higher volatility. The failure of LDM as predictive model and some features captured by error graph gives room of improvement to the model. There is a limitation in the model with the choice of Gaussian error. The model could be applied to other higher dimension time series data to see more application Kalman filter.
Glossary Kalman filter: It is a technique of separating noise from the given sequential observed data with high accuracy. Linear dynamical model: It is a linear model in latent and observed variables with Gaussian noise.
References 1. E. Fama, K. French, The cross-section of expected stock returns. J. Finance 47(2), 427–65 (1992). [Online]. Available: https://EconPapers.repec.org/RePEc:bla:jfinan:v:47:y:1992:i:2:p: 427-65 2. E.F. Fama, K.R. French, Common risk factors in the returns on stocks and bonds. J. Fin. Econ. 33(1), 3–56 (1993). [Online]. Available: http://www.sciencedirect.com/science/article/ pii/0304405X93900235 3. What is technical analysis? definition, basics and examples. https://www.thestreet.com/investing/technical-analysis-14920339, May 2019. [Online]. Available: https://www.thestreet.com/investing/technical-analysis-14920339 4. R. Aggarwal, G. Wu, Stock market manipulations. J. Bus. 79(4), 1915–1953 (2006). [Online]. Available: http://www.jstor.org/stable/10.1086/503652 5. Ramalinga raju held guilty in satyam fraud: How the scam unfolded. https://economictimes.indiatimes.com/, Apr 2015. [Online]. Available: https://economictimes. indiatimes.com/news/politics-and-nation/ramalinga-raju-held-guilty-in-satyam-fraud-howthe-scam-unfolded/articleshow/46860287.cms?from=mdr 6. R. Edwards, J. Magee, W. Bassetti, Technical Analysis of Stock Trends (2012) 7. 1992: The securities scam. https://www.businesstoday.in, Jan 2011. [Online]. Available: https:// www.businesstoday.in/magazine/focus/year-1992-round-up/story/11663.html 8. What did we learn from the dotcom stock bubble of 2000. Time, Mar 2015. [Online]. Available: https://time.com/3741681/2000-dotcom-stock-bust/ 9. E.F. Fama, The behavior of stock-market prices. J. Bus. 38(1), 34–105 (1965) 10. E. Fama, Efficient capital markets: Ii. J. Finance 46(5), 1575–617 (1991). [Online]. Available: https://EconPapers.repec.org/RePEc:bla:jfinan:v:46:y:1991:i:5:p:1575-617 11. J. Price, E. Kelly, Warren Buffett: Investment Genius or Statistical Anomaly? (2004) 12. Y.-D. Zhang, L. Wu, Stock market prediction of standard and poor’s 500 via combination of improved bco approach and bp neural network. Expert Syst. Appl. 36, 8849–8854 (2009) 13. L. Ryll, S. Seidens, Evaluating the Performance of Machine Learning Algorithms in Financial Market Forecasting: A Comprehensive Survey. arXiv preprint arXiv:1906.07786 (2019) 14. R.E. Kalman, A new approach to linear filtering and prediction problems. Trans. ASME–J. Basic Eng. 82(Series D), 35–45 (1960)
Linear Dynamical Model as Market Indicator of the National …
85
15. Q. Li, R. Li, K. Ji, W. Dai, Kalman filter and its application, in 2015 8th International Conference on Intelligent Networks and Intelligent Systems (ICINIS), pp. 74–77 (2015) 16. T.W. Anderson, An Introduction to Multivariate Statistical Analysis (Wiley Series in Probability and Statistics), 3rd edn. (Wiley-Interscience, 2003) 17. J. R. M. Karim, M. Abadir, Matrix Algebra (Econometric Exercises), 1st edn., ser. Econometric Exercises 1 (Cambridge University Press, Cambridge, 2005) 18. K.B. Petersen, M.S. Pedersen, The Matrix Cookbook (Technical University of Denmark, 2012) 19. Z. Ghahramani, G. . Hinton, Parameter Estimation for Linear Dynamical Systems(1996)
E-Focused Crawler and Hierarchical Agglomerative Clustering Approach for Automated Categorization of Feature-Level Healthcare Sentiments on Social Media Saroj Kushwah and Sanjoy Das Abstract The chapter focus on the data science field and it concentrates on the initial stage as a novel focused crawler used for collecting the reviews then the next stage of a database similar to bigdata (sentiments) preprocessing along with it cleanses to deal with noise as well as the missing data in many reviews. The frequent pattern growth of associating rule mining, part of speech (POS), information on interaction, and sentiwordnet2 is used to detect and extract feature words frequently, and therefore, features and sentimental words are necessary to extract the orientation detection. Finally, for the grouping of contraception procedures, we use data science hierarchical agglomerative clustering algorithm. Keywords First keyword · Second keyword · Third keyword
1 Introduction The twenty-first century has seen a torrential flow of data. This bigdata (sentiments) have appeared over the last two decades on numerous industries, channels, and platforms, as well as cell phones, e-commerce sites, social media, health inspections, and Internet searches, which have given rise to the birth of big data [1]. Most medical large-scale data from social media networks viz Twitter, Facebook, Tumblr, Reddit, Pinterest, and Instagram were identified as well as portable (mobile) speedy messaging apps like Telegram or also WhatsApp [2]. 500 million tweets are sent daily, and 40 million are widespread or run-of-the-mill day by day. In the interim 4.3 billion, Facebook posts with 5.75 billion likes are estimated to be posted daily. More S. Kushwah (B) Noida International University, Greater Noida, India e-mail: [email protected] S. Das Indira Gandhi National Tribal University, RCM, Imphal, India e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_7
87
88
S. Kushwah and S. Das
than 60 billion Webpages are indexed, more than 3 million requests are generated per minute by Google, and articles are produced from over 15,000 real-time sources [3]. These large masses of large medical data (feels) contain valuable data for the diagnosis of diseases. The sentiment-examination is a procedure to extract or also understand the feelings categorical in the text [4]. Consequently, the retrieval of the required content from a huge database is a critical task [5]. There is no problem and challenges are exposed from the large volume of analyst reviews [6] to extract the pair of feature opinions. In the modern world, data/feelings are a useful term. However, the increasing data are unstructured and need to be analyzed to make efficient decisions. This is a time consuming and complex process for companies, and hence data science is emerging. Data science techniques can be employed to extract useful patterns from these mass data. Data science provides meaningful information that is evolved from huge counts of big data or complex data. Data or data-based science combines various areas of work to interpret data in statistics and computation for decision-making purposes. A data scientist often collects large volumes of feelings, analyses them, and interprets them to improve a company’s operations. Data science professionals develop statistical models that analyze feelings and detect patterns, trends, and links in data sets. And this information can be used to predict the behavior of consumers or to identify business and operational risks. Data scientists generally divide the process of sentiment analysis into three levels, including document-level analysis, the sentence as well as an entity-level analysis [4]. Document-level analysis: The general picture of the document is determined by the document-level sentiment analysis. Presuming that the document is linked to a unique thing or object is this level of analysis helpful. Some authors focus on the extraction of Kim et al. [7]. The subject is extracted from the records by Reynar [8]. Analysis of the sentence level: Subjectivity classification is also called a sentiment analysis [9]. It distinguishes between subjective and objective data. Each sentence is regarded just like an independent part, or also it predetermines in such a manner the sentence should have a one-off opinion. Text segmentation while P segments are different. The focus on the divides of opinions from the text was on Fragkou et al. [10], Choi [11]. Analysis of the entity and the level of aspect: The analysis of the entity’s feelings and the level of the aspect capture the combination of emotions. One of the major features of the sentimental analysis of the entity and aspect level is that it examines the opinion directly rather than the paragraphs, phrases, and papers. The objective is to detect feelings about objects along with their components. Therefore, data scientists today concentrate primarily on the drawing attribute as well as the corresponding opinion Scaffidi et al. [12], Zhu and others [13], Moghaddam et al. [14], Hu et al. [15], Popescu et al. [16], and summarize them as follows. Zakaria Elberrichi1 et al. [17] explore this way of categorizing text documents using the WordNet concept in these papers. It can be used in the representation of
E-Focused Crawler and Hierarchical Agglomerative Clustering …
89
text by using multivariate µ2 to include background knowledge. The rule generating algorithm for the categorization of texts based on the tree of choice is described by Johnson et al. [18]. For sentiment analysis, Choi et al. [19] employed domain-based techniques that include different types of corpus representation steps. In this field, the author has prepared the domain purpose, which can include numerous parameters and the query set in combining it. Fermín [20] uses the dependency resources model that is a list of features and opinion words connecting a pattern and which calculate a wider set of opinions accurately. Liu et al. [21] explore the explicit and implicit characteristic methods of retrieving attributes as well as corresponding opinion from the dictionary of words along with the feature’s compilation. The limitations of this approach are very typical as both the feature and the feeling indicator. Small-scale coupling, wordnet, and part of the dictionary cannot be successfully performed so that this procedure can resolve the problem as a connexion between the opinion and the corresponding characteristic word. The information hidden about the methods and products taken can be very helpful for different people who want to adopt the methods and industries of healthcare, but it can be difficult at some point to analyze a great many reviews relating to the proposed method. Netflix also uses algorithms to create custom user recommendations based on its viewing history. In return, companies that have customer-oriented strategies have to collect customer-oriented information with efficient techniques, to remain competitive on the market [22]. To know that the loss or gain by using its tools and techniques is increasing for our industry. This allows us to easily handle all things. This chapter is organized as follows: In Sect. 2, motivation is discussed. Section 3 discussed various related works. Section 4 discussed the proposed methodology. The result analysis and evaluation of various methods are discussed in Sect. 5 and the conclusion in Sect. 6.
2 Motivation Data science is rapidly evolving, and its uses will certainly change lives in the future. The valuable information obtained from gigabyte (consumer or patient feedback) data from the medical sector plays a key role. 1. 2. 3. 4.
To bring value to consumers, big data and data science are used by companies for everyday activities. Big data are used by asset management companies to predict the likelihood that security prices will rise or fall at a certain time. The business owner can show all the diagrams up and down by helping a data science company. Many opinions provide insights into the business. Most vendors use data scientists to predict the sales performance of contraceptive products to achieve effective results from huge feedback from patients and consumers.
90
S. Kushwah and S. Das
5.
To compile the various contraceptive products based on customer feedback, data scientists use better tools and techniques. It helps us to use the right methods of contraception. A new field of study based on large data is opened up by an increase in the amount of available information with massive user reviews helping to develop better functional tools in all sectors. Due to technological advances and collection techniques, it is possible to constantly increase access to data. Based on information collected from various sources used by the methodology in data science, individuals can monitor patterns and behavior.
6.
7.
To address these problems, the larger contribution was described in six directions—first, data gathering by a newly focused crawler, second pre-treatment for collected reviews, a third approach to FP growth to extract the common feature. This chapter described the challenges of gaining understandable knowledge at the entity or attribute level. Forth, uncommon Corpus, POS, and context information recognition features. Fifth, Thesaurus and sentiWordNet2 as a tool to discover the word and feature opinions for the sentiment. Sixthly, categorization of the use of the HAC Data Science algorithm for contraception. With the results per the fourth categorization by the contraception method, the results of an attribute-based insightful system of knowledge extraction can be improved also with medium accuracy, recall, and precision.
3 Related Work Many methodologies have been proposed to identify information extraction. Certain approaches are given below, though they have their merit and merit. To improve the classification of men and women using methods of contraception. Various approaches to our proposed methodology must be studied. Huang et al. [23] work on the innovative effort to incorporate the part of speech and also the part-structure function in conditional random fields to improve the learning process, to calculate the similarity, and the expression of product features between contexts based on differences. So we can create a very efficient graph or categorize an algorithm that categorizes the function expression collection into the different semantic graphs. Two-fold clustering methods are the focus of Jia’s et al. [24] model for product feature classification. The major contribution of this method is that in place of context words received, it is possible to create the interrelation between product characteristics and group sentiment data by automatically clustering the full feature of the product into limitations. The major contribution of Wang [25] is the factorization of the rating inference tensor. It provides a precise prevision about unknown ratings. The quality of the recommendation can be improved using the method of extracting opinion from reviews and integrated into co-operative filtering to produce a certain recommendation. The quality of the recommendation can be improved.
E-Focused Crawler and Hierarchical Agglomerative Clustering …
91
In [26], authors currently summarize several attributes of the various focused crawlers. The focused crawler is divided into two distinct categories, Semantic as well as Social Semantic. The ontology used by Semantic in addition to Social Semantic Focused crawler to obtain Web sites connected to the subject from social sites, and a lot of Web pages are usually shared by people interested in some subject. The classy Web crawlers are using widespread first methodology when extracting around the Web sites, gathering all the Web sites which include appropriate as well as irrelevance data. It creates great storage along with time. F.M. firstly, deep and focused crawling strategies are compared with the amount of and variety of vidéos in the sign language by Shipman and Monteiro [27]. When a cascade classifier is used and movementbased characteristics are generated by classifier accuracy in sampled frames, they look into the compromises. Kalmukov and Valova [28] propose different methods for weight calculation and proposes a web-based image building architecture. Akyol, Mehmet Ali, and et al. [29] studied a distributed structure in which distributed focused crawlers are composite to recognize and notify the perspective of their consumers with the distributed complex processing of events. The distributed focussed crawler can be used to rack Web sites from different data sources. Distributed crawlers are being used in this context to serve many users. Singh et al. [30] focus on the issue of the mining of opinion based on features. Firstly, use a multi-function segmentation methodology, in which segments of multifunctional opinions phrases in the same function are proposed. Secondly, the context data along with wording dictionary takes account of the non-relevant characteristic identification, feeling words used for characteristic polarity identification are finally used in an unexpected (K-medoid) clustering with the constraint to calculate efficiency and effectiveness. Das et al. [31] focused on feature-based review excavating or also proposed a dynamic system to generation regularly, based on the customers’ demand, of featurebased summary of certain features with a special polarity of opinion and changed the summary to customer demand. First, using the word-level probabilistic approach, a method of function extraction (frequent and uncommon). Secondly, it identifies the relevant opinion word in addition to makes a feature along with sentiments words pair. Thirdly, they have purposed an opinion orientation identification algorithm. To summarize certain features that are of use to the user with a specific periodical opinion, each feature pair is assigned to the feature-based cluster (positive, negative or neutral). Authors in [32] propose that a solution can be taken to the challenges of finding data collections belonging to multi-clusters. This kind of fuzzy procedure may minimize the loss of data during the categorization of a lot of data. According to a fluid clustering algorithm, clusters contain the data of each patient, depending on the associateship appointed to the specific patient. More than one cluster center can include patients suffering from more than one disease. To collect information about the occurrence of the disease, the data can be accurately collected. Thus, the use of fuzzy classification algorithms will give the efficiency comparison graph below more efficient results compared to the use of other clustering algorithms. This methodology can be improved and adapted from a future perspective to real-time applications.
92
S. Kushwah and S. Das
The current approaches for classification of public health are described word embedding-based clustering methodology to health-related categorization [33]. Unlike other general public santé methodologies, this is word embedding-based technique. Semanticized word information can be represented by the word embedding vectors. As per the semantic similarity, the tweets could be categorized into different clusters of alike words. The tweet may be cut into the phrase cluster’s similarity measure. The higher threshold enhances precision. The lower threshold improves the reminder. To analyze tweets with the k-means and CS methods, a new hybrid clustering (CSK) method was introduced. The method proposed modifications to the CD random initialization process using k-means solutions that enhance its performance. This methodology has been approved on four datasets of Twitter. For all of the data sets considered for a better comparison, student t-test, box plot along convergence plot examine were carried out too. The experimental and statistical results have represented the efficacy of this methodology. But, this approach represents increased accuracy examine with other techniques and improvements are still needed in terms of accuracy [34]. The future study will therefore be to investigate the possible choices of improving precision by indoctrinating certain functional chosen techniques to use various methods of optimization. However, the handling of sarcasm and irony tweets can be improved. Contextual information can also be considered for the classification of tweets on a word and postal level along with domain-based ontology.
4 Proposed Methodology The main objective of the approach mentioned in this theory to categorize the various adopting contraceptives from user-generated reviews using attribute-based product parameter classification including the following four steps: • • • • • •
Gathering of contraceptive reviews. Preprocessing of gathering reviews. Frequent and infrequent feature (attribute) extraction from the review sentences. Extraction opinion word inside the sentences. Analysis of the Polarity of the opinions. Clustering of feature-based different contraceptive methods.
4.1 Contraceptive Assesment Gathering In health care, human lives, and so on, online review plays an important role. Before buying anything, most users read online review reviews. They collect the product information and purchase it after the correct information has been observed. This data is collected via the Internet.
E-Focused Crawler and Hierarchical Agglomerative Clustering …
93
The researcher has gone back to various email Web sites, Facebook, Twitter, Gmail, Whatsapp, Yahoo, Rediff, rocket mail, and other Web sites for review of contraceptive methodologies. Researchers have selected the rest of the Western U.P. They have chosen from every district 1PHC, 1CHC, 1PVT, and 1 state hospital and medical facility. A total of 92 hospitals and health centers have been selected. They pick 10 reviewers from every district, including men and women, to collect more and more data by random means. Overall, they have collected 920 reviews.
4.2 Novel Focused Crawler A focused Web crawler is an essential tool for collecting domain information. It is more efficient than general Bread-First or Depth-First search crawlers. The focused search algorithm is a key-focused technique for crawlers which directly affects the quality of the search. The purpose of this chapter is to increase a new rushing algorithm to enhance the effectiveness of the focussed Web crawler. It consists of two (1) preprocessing searches (2) reverse links. Two phases are divided into focused crawler work. In the first phase, filtration is then collected, and in the second, the seed page URL is chosen to help locate the next child nodes (i.e., the next link for the corresponding pages). This algorithm’s purpose is to utilize full latent information from the source page S and apply preprocessing on each Web page. These words are the basic words we want to find in a specific subject. This can be done automatically by the machine. Not only will computation be limited in its type of page, but all hyperlink Web pages will have an extract. This means that all pages are classified according to their Web page S subjects, by using the hyperlink indicated on the Web page S’ reciprocal link. Suppose that the final theme marshal is Bi … Bn , all the pages to be sorted to B1 , B2 , B3 , …, Bn during the search process. It will be further searched because such pages are probably similar to the Web page S. Invert link-based search is this algorithm. (1) (2) (3) (4) (5) (6)
All links pointed to the source Web page S are analyzed. Compute pertinence of each hyperlink from Web page S and preprocessing. Add a relevant queue and continue page if relevant. Else, page classification. Continue this page if you belong to B1 . Remove this page, otherwise.
4.3 Preprocessing Preprocessing of machine learning data is a critical step in increasing data quality to promote meaningful insights from the collection of user-generated reviews. In
94
S. Kushwah and S. Das
simple terms, data preprocessing in machine learning is a technique in data science that makes raw data readable and understandable. The problem with real-world preprocessing, lacking certain interesting attributes and containing only aggregate data, is incomplete due to the absence of attributes variables. If the sensations collected contain an error, it is shown as noisy or outlier. Inconsistent data containing code or nomenclature discrepancies are shown [33]. Cleaning of feelings is the initial step in which missing values are fulfilled, outliers are identified or removed and inconsistencies with flat-ray data are resolved. The data reduction, while minimizing enormous data, should generate similar analytical results. In the case that unoccupied contents occurring like a section of data reduction, statistical aspects were substituted given theoretical ones. The output of this process is ideal when it is performed in a database. Because it is a very challenging process to search for necessary information from a big database. Notably, the contraceptive examination database needs to include a whole field to extract the precise result. Preprocessing has consequently been carried out to fulfill each field, besides, to produce more desirable results for this process. Significant changes in the database therefore produce better outcomes. Step by step process of preprocessing (sentiment cleaning processing) Algorithm Noisy content Removing as well as review normalization. SEED LIST: A part of the dictionary (POS) is used. The seed list is preparing using wh-family, helping verb, determiner, preposition, numbers, punctuation marks, URLs, special characters, blacklist and non-sensical words, forward/backward slash, dash, etc. NORMALIZATION LIST it includes hashtag (#), two or more repeated character into one such as aaaaa into a, bbbbbb into b…. etc., symbol or numbers with words (“*-99t321 AMAGING”) and suffix like s, ss, ies -> y, ly, ful, es, sses -> ss, etc. The steps of implementation of noisy data removing algorithm are given below: Step 1 The seed list and normalization list documents are segmented individually and stored in an array. Step 2 A particular word or number, etc., is taken out from the retrieved review sentences and matched with both lists. Step 3 If the review word or number, etc., matches, then this word or number is stored in an array and is removed accordingly. Thus, the comparison goes on until the length of the array is completely exhausted. Step 4 Complete Step 3 and go to Step 2.
E-Focused Crawler and Hierarchical Agglomerative Clustering …
95
4.4 Identification Feature (Attribute) and Word Opinion Analysis Identification feature (attribute) and feature word analysis are among the most difficult tasks. As a text containing opinion reviews, the identification and extraction process of the feature are implemented and then features are found. This chapter covers essentially the analysis of opinions and comprises three main steps. • Words feature extraction. • Words of feeling extraction. • Function opinion pair formation. The process of characteristic identification is said to have naturally been done based on a minimization by human intervention and a frequent classification of these two categories.
4.5 Identification of Frequent Feature The noun word as part of the speech is used to indicate frequent function by the process of frequent feature identification. This will, of course, identify any domain’s name. But people can sometimes use phrases, comment on different opinions, and also use different words that have the same characteristics [5]. To resolve the problem of function identification, we use FP-Growth for association rule mining. FP-GROWTH (for Association Rule Mining) To compress and represent both the database and the FP tree structure, FP Growth has adopted a split-and conquering strategy. The strategy for dividing and conquering a compressed database is used to divide into a different set of conditions and mining tasks. Building FP-Tree Step 1: • Scanning database and managing data item support. • Delete rare items with less than minimum support. • The residual frequent items are arranged in descending order according to the frequency and usage of the common FP-tree. Step 2: • The complete data is necessary for all transactions stored in the database: • Inserted item in a tree when leaving earlier. • In this case, the counter support should be increased by 1.
96
S. Kushwah and S. Das
• Use a single link list with the same item and extract the FP-tree frequently specified item. Generation of common item set: • Table of headers included the same item of data input. • Pointers on each frequent data item’s first occurrence in the header table. • The structure of the FP-tree keeps the common items in compressed form. A support count is assigned; all the unusual items are minimal deleted. Other transactions to decrease. Create FP-tree that is the most item support and then add 1. The frequent features obtained from the above step are added to the current set of features so that a frequent (relative) set of features is established.
4.6 Frequent Identification Feature (Attribute) Various people express or present in different ways the same aspect. In this example, the opinion “long” may be used to describe the time and duration of a product which uses a method. These opinion words can be used to extract aspects during the frequent extraction of features. In the procedure of rare characteristic identification, people have used the same dialogue and omitted the word and the other complex word. The word function is not defined directly. Both antonyms and synonyms are part of the thesaurus and are used to draw context information in this type of review phrase, with adjectives that are easily found highly dependent on the domain. These adjectives and adverbs indicated the characteristics available. Finding such features is difficult. Using the given list of adjectives and adverbs, the features combined and recognized individually will modernize the associated features to the specific features set. The characteristics are useless, they have irrelevant characteristics. These are removed from the set of functions. The feature phrases are strongly linked to and removed from the feature set between the review sentences processed. Grouping and collecting is prepared for synonyms in the sequence of sentences in the set. In other terms that have the same meaning in opinions, the same method of contraception was explained. We create the opinion feature after receiving all of the feeling words of the reviews. We find words of opinion in the nearest function words and form the pair.
4.7 Sentence Level Assessment of Opinion Polarity After the feature pair has been created. With the help of the Internet, we calculate the polarity of opinions. We have the online dictionary with positive and negative opinions and in the thesaurus, we have antonym and opinion synonyms.
E-Focused Crawler and Hierarchical Agglomerative Clustering …
97
The polarity of opinions is checked at sentence level about the assessment of negation (nothing, ever wasn’t, weren’t, weren’t, shan’t, is not, wasn’t did, don’t, haven’t, etc.). The meaning of the word opinion was changed by the words of negation. Words like no or no can change the polarity of the word opinion in some respects as described below: 1. 2.
Positive polarity-Word of negative negation Word of positive refusal-Negative polarity.
4.8 Clustering In unsupervised machine learning, hierarchical, agglomerative clustering is an important and well-established technique and operates in a bottom-up manner. Hierarchical algorithms for clustering are deterministic, stable, and do not require as input a predetermined number of clusters. Agglomeration clustering schemes start from the partition of the data set into singleton nodes, and gradually merge the current pair of closest nodes into a new node until one final node, comprising the whole data set, is left and it is easy to implement hierarchical clustering. For the interpretation of the dendrogram produced, the current hierarchical relationship between clusters is very helpful and beneficial. Accept any valid distance measure less sensitive to handling by cluster shapes. Various sensitivity clusters. The visualization capability should be good. Algorithm • In the beginning, a single element cluster (leaf) is created by each data point. • The matrix between the clusters shall be calculated. Repeat • At each algorithm step, the two closest (or closest) are combined. • Update the matrix of distance. Until. • All the data points given belong to only one cluster (i.e., the root cluster). The distance of one link: There should be less than a distance between the K i and K j clusters between any X i and X j objects. Sim K i , K j = minisim X i , X j X i ∈ K i , X j ∈ K
(1)
98
S. Kushwah and S. Das
5 Evaluation and Analysis Feature-based clustering techniques are evaluated with successive methods from method 1 to method 5. All reviews have been collected from internet sources and various shopping websites. Method-1, Method-2, Method-3, Method-4, and Method-5 on wordnet, thesaurus, and corpus. All reviews were collected from the Internet source and various shopping websites, such as www.amazon.com, www.facebook.com, www.jabung.com, WhatsApp, and healthcare sites. A total of 2020 reviews were collected concerning a specific product shown in Table 1, while textual reviews of various products included 7578 sentences. In the given Table 1, the review shows the total number of reviews sent shows the number of sentences collected from several websites in the document reviews. Method-1 represents long-acting reversible products (like IUD, etc.). Method-2 represents hormonal contraception, like pills or injections, etc. Method-3 represents barrier techniques, such as condoms, etc., Method-4 awareness of fertility, Method5, permanent contraceptive methods such as vasectomy and tubal ligation. All the experiments were performed on the assessments. Experiments for the clustering of men and women using our clustering approach were carried out. Precision =
The No. of frequent opinions Total number of frequent as well as infrequent crawled data Recall =
Accuracy =
Sum of total numerical values of precision Total No.of contraceptive methods
(2) (3)
Sum of the positive as well as negative sentiments Total Number of Data (opinions)
(4)
There are three parameters in 4838 reviews and 10,837 sentences to calculate the product’s performance according to the proposed approach. It has various featureopinion words that can be used for accuracy, reminder, and accuracy assessment. Two different methods have been tested for the various methods of contraception. 1.
Methods K-MEAN (POS + NB), we represent a k-mean clustering algorithm with part of speech as well as Naïve Baye.
Table 1 User generated reviews Dataset
Method-1
Revw
250
Method-2 410
Method-3 550
Method-4 450
Method-5 360
Sent
940
1500
2060
1710
1368
E-Focused Crawler and Hierarchical Agglomerative Clustering …
99
Table 2 Performance of various and feature categorization approach using Method-1 Approach Precision K-MEAN (POS + NB)
Recall
63
60
62
65
HAC[(POS + 74 FP-GROWTH + CW) + NR]
70
67
69
72
OUR
82
89
86
92
85
K-MEAN (POS + NB)
68
62
67
64
69
HAC[(POS + 74 FP-GROWTH + CW) + NR]
68
72
69
75
OUR
80
87
89
91
83
70
65
71
66
69
HAC[(POS + 78 FP-GROWTH + CW) + NR]
72
80
74
76
OUR
92
97
94
93
Accuracy K-MEAN (POS + NB)
2.
3.
Method-1 Method-2 Method-3 Method-4 Method-5 66
95
HAC [(POS + FP-GROWTH + CW) + NR] we represent as hierarchical agglomeration clustering with Part of speech (thesaurus), Frequent patterngrowth, Context word, Narration Rule. HAC [NFC + IPR + (POS + FP-GROWTH + CW + Cr + Th) + NR] we also represent as hierarchical agglomeration clustering with [(Novel Focused Crawler + Improved Preprocessing + (Part of speech, Frequent pattern-growth, Context word, + Corpus + Thesaurus) Narration Rule].
The methods of HAC [NFC + IPR + (POS + FP-GROWTH + CW + Cr + Th) + NR] achieve better precision than the methods of K-MEAN (POS + NB). The performance of three different clustering or classification methods is shown in Table 2. Table 2 shows the efficiency of each method measured based on precision, recall, and accuracy. It represents clustering performance using three different approaches to cluster the classification method. Use k-mean algorithms in the first approach with the Naive classification of Bayes and POS dictionary and second approach hierarchical agglomeration clustering with Part of speech, Frequent pattern-growth, Context word and Narration Rule. For better results, the proposed methodology achieves three parameters such as accuracy, recall, and precision and they are used to calculate the results. Better results are achieved from the proposed methodology (Fig. 1). This method is 97% (percent) of the maximum accuracy presented by the five average methods adopted. So, this approach includes NFC as well as IPR, hierarchic clustering algorithm [NFC + IPR + (POS + FP-GROWTH + CW + Cr + Th) + NR] which provides higher average accuracy, recall, and precision. In terms of Method-1, Method-2, Method-3, Method-4, and Method-5 as compared to the POS + NB techniques or frequent pattern growth (FP-growth) to the [NFC + IPR +
100
S. Kushwah and S. Das
PRECISION
85
82 72
69
67
70
74 65
62
60
63
66
K-MEAN (POS+ NB)
92
Method 3 86
Method 2
89
Method 1
HAC[(POS +FPGROWTH+CW)+NR]
OUR
Fig. 1 Precision value calculated from k-mean (POS + NB) and HAC[(POS + FP-Growth + CW) + NR] or also HAC [NFC + IPR + (POS + FP-Growth + CW + Cr + Th) + NR] approach
(POS + FP-GROWTH + CW + Cr + Th) + NR] techniques, proposed technique provide better performance. From the top, evaluation and comparative analysis of the results that are performed by the use of three clustering methodology, to cluster the classification of the products of contraception, it can be observed that our proposed approach produces significantly better results than other approaches (Figs. 2 and 3).
91
89
83
80
Method 3 75
69
72
68
74
69
64
67
62
68
Method 2
87
RECALL Method 1
Fig. 2 Performance graph based on recall value calculated from k-mean (POS + NB) and HAC [(POS + FPGrowth + CW) + NR] or also HAC [NFC + IPR + (POS + FP-Growth + CW + Cr + Th) + NR] approach
E-Focused Crawler and Hierarchical Agglomerative Clustering …
ACCURACY Method 2
93
94
97
76
74
80 72
78 69
66
71
65
70
92
Method 3 95
Method 1
101
Fig. 3 Performance graph based on accuracy calculated from k-mean (POS + NB) and HAC[(POS + FP-Growth + CW) + NR] or also HAC [NFC + IPR + (POS + FP-Growth + CW + Cr + Th) + NR] approach
The outcome clearly shows that the HAC [NFC + IPR + (POS + FP-GROWTH + CW + Cr + Th) + NR] experiments relating to various clusters and categorization methods, such as the POS + NB, are better recalled by our method than the other.
6 Conclusion The suggested approach methodology produces an accurate and more relevant summary of the categorization of a feature-based adopted healthcare domain. The great advantages to identify a frequent characteristic of a new focused crawler are to reduce the processing time, memory to store the visited pages, and reducing impertinent information or also give a notable improvement in the classic crawler. We have used the FP-GROWTH (association rule mining) algorithm, contextual information helps to improve accuracy and also maximizes storage capacity or online text dictionary, wordnet, and thesaurus. Analyzing sentiments by incorporating the feature, opposites or synonyms associated with the sentiments to enhance precision, accuracy, and recall our approach or also using this approach.
References 1. R. Addo-tenkorang, P.T. Helo, Big data applications in operations/supply-chain management: a literature review. Comput. Ind. Eng. 101, 528-543 (2016)
102
S. Kushwah and S. Das
2. S. Shayaa et al., Sentiment analysis of big data: methods, applications, and open challenges. IEEE Access 6, 37807–37827 (2018). https://doi.org/10.1109/ACCESS.2018.2851311 3. B.R. Reddy, Y.V. Kumar, M. Prabhakar, Clustering large amounts of healthcare datasets using fuzzy C-means algorithm, in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), 978-1-5386-9533-3/19/$31.00 ©2019 IEEE 4. T.K. Shivaprasad, J. Shetty, Sentiment analysis of product reviews: a review, in International Conference on Inventive Communication and Computational Technologies (ICICCT 2017), 978-1-5090-5297-4/17/$31.00 ©2017 IEEE 5. R. Vijayarajeswari, M. Nagabhushan, P. Parthasarathy, An enhanced symptom clustering with profile based prescription suggestion in biomedical application. J. Med. Syst. 43, 172 (2019). https://doi.org/10.1007/s10916-019-1311-8 6. B. Singh, S. Kushwah, S. Das, P. Johri, Issue and challenges of online user generated reviews across social media and e-commerce website, in Proceeding of IEEE International Conference on Computing Communication and Automation (ICCCA-2015), 15–16 May 2015, pp. 818-822. https://doi.org/10.1109/ccaa.2015.7148486 7. S.M. Kim, E. Hovy, Extracting opinions holders, and topics expressed in online new media text, in Proceeding on Workshop on Sentiment and Subjectivity in Text (2006) 8. J.C. Reynar, An automatic method of finding topic boundries, in Proceedings of the 32nd Annual Meeting Association for Compuational Linguistics (1994), pp. 331–333 9. Z. Wenhao, H. Xu, W. Wei, Weakness finder: find product weakness from Chinese reviews by using aspects based sentiment analysis. Exp. Syst. Appl. (2012) 10. P. Fragkou, V. Petridis, A. Kehagias, A dynamic programming algorithm for linear text segmentation. Proc. J. Intell. Inform. Syst. 23(2), 179–197 (2004) 11. F.Y.Y. Choi, Advances in domain independent linear text segmentation, in Proceeding of the 1st Meeting of the North American Chapter of the Association for Computational Linguistics (2000), pp. 26–33 12. C. Scaffidi, K. Bierhoff et al., Red Opal: product feature scoring from reviews, in Proceeding of the 8th ACM Conference on Electronic Commerce (2007), pp. 182–191 13. J. Zhu, H. Wang, M. Zhu, B.K. Tsou, M. Ma, Aspect-based opinion polling from customer reviews. IEEE Trans. Affec. Comput. 2(1), 37–49 (2011) 14. S. Moghaddam, M. Ester, AQA: aspect-based opinion question answering, in IEEE 11th International Conference on Data Mining Workshops (ICDMW) (2011), pp. 89–96 15. M. Hu, B. Liu, Mining opinion feature in customer review, in Proceedings of the 9th National Conference on Artificial Intelligence (2004) 16. A. Popescu, O. Etzioni, Extarcting product feature and opinions form reviews, in Proceedings of the Conference on Empirical Methods on Natural Language Processing (2006), pp. 339–346 17. Z. Elberrichi1, A. Rahmoun, M.A. Bentaalah, WordNet for text categorization. Proc. Int. Arab J. Inform. Technol. 5(1) (2008) 18. D.E. Johnson, F.J. Oles, T. Zhang, T. Goetz, A decision-tree-based symbolic rule induction system for text categorization. Proc. IBM Syst. J. (2002) 19. Y. Choi, Y. Kim, S. Myaeng, Domain-specific sentiment analysis using contextual feature generation, in Proceedings of the 1st International CIKM Workshop on Topic-sentiment Analysis for Mass Opinion (ACM, 2009) 20. C.L. Fermín, A knowledge-rich approach to feature-based opinion extraction from product reviews, in Proceedings of the 2nd International Workshop on Search and Mining Usergenerated Contents (ACM, 2010) 21. L. Liu, Z. Lv, H. Wang, Opinion mining based on feature-level, in 5th International Congress on Image and Signal Processing (CISP), 16–18 October 2012, pp. 1596-1600 22. A.M. Bigorraa, O. Isakssonb, M. Karlberga, Aspect-based Kano categorization. Int. J. Inform. Manag. 46, 163–172 (2019) 23. S. Huang, X. Liu, X. Peng, Z. Niu, Fine-grained product features extraction and categorization in reviews opinion mining, in Proceeding of IEEE 12th International Conference on Data Mining Workshops (2012)
E-Focused Crawler and Hierarchical Agglomerative Clustering …
103
24. W.J. Jia, S. Zhang, Y.J. Xia, J. Zhang, H. Yu, A novel product features categorize method based on twice-clustering, in 2010 International Conference on Web Information Systems and Mining (WISM), 23–24 October 2010, vol. 1, pp. 281-284 25. Y. Wang, Y. Liu, X. Yu, Collaborative filtering with aspect-based opinion mining, in Proceeding of IEEE 12th International Conference on Data Mining (2012) 26. Mohd. A. Khan, D.K. Sharma, Self-adaptive ontology-based focused crawling: a literature survey, in 2016 5th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) (IEEE, 2016). INSPEC Accession Number 16544165. https://doi.org/10.1109/icrito.2016.7785024 27. F.M. Shipman, C.D.D. Monteiro, Crawling and classification strategies for generating a multilanguage corpus of sign language video, in JCDL’19: Proceedings of the 18th Joint Conference on Digital Libraries (ACM, 2016), pp. 97–106. https://doi.org/10.1109/JCDL.2019.000 23,2019 28. Y. Kalmukov, I. Valova, Design and development of an automated web crawler used for building image databases, in 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) (2019). Electronic ISSN 2623-8764, INSPEC Accession Number 18820580. https://doi.org/10.23919/MIPRO.2019.8756790 29. M.A. Akyol et al. A context aware notification architecture based on distributed focused crawling in the big data era, in European, Mediterranean, and Middle Eastern Conference on Information Systems (Springer, Cham, 2017) 30. B. Singh, S. Kushwah, S. Das, Multi-feature segmentation and cluster based approach for product feature categorization. Proc. I.J. Inform. Technol. Comput. Sci. 1, 1–3 (2014) 31. S. Das, B. Singh, S. Kushwah, P Johri, Opinion based on polarity and clustering for product feature extraction. Int. J. Inform. Eng. Electron. Bus. (IJIEEB). 8(5), 33–42 (2016). ISSN 2074-9023 32. B.R. Reddy, Y.V. Kumar, M. Prabhakar, Clustering large amounts of healthcare datasets using fuzzy C-means algorithm, in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS). 978-1-5386-9533-3/19/$31.00 ©2019 IEEE 33. A.C. Pandey, D.S. Rajpoot, M. Saraswat, in Twitter sentiment analysis using hybrid cuckoo search method. Inform. Process. Manag. (2017) 34. X. Dai, M. Bikdash, B. Meyer, From social media to public health surveillance word embedding based clustering method for Twitter classification. 978-1-5386-1539-3/17/$31.00 ©2017 IEEE
Error Detection Algorithm for Cloud Outsourced Big Data Mohd Tajammul, Rabindra Nath Shaw, Ankush Ghosh, and Rafat Parveen
Abstract Cloud storage is a fast-growing technology allowing multiple users to store data at one place. It can accommodate big data which is beyond the capacity of local storage due to limited space. Moreover, it is economic place as it charges customers on pay per use basis. When users transfer data from their local storage to cloud, during transmission, errors are detected and corrected by Secured Socket Layer (SSL) and Transport Layer Security (TLS). But what happen if it is altered after storing on cloud space. Supposed data is stored or uploaded on cloud space, and somebody enters into that space with malicious intention and makes some changes in the sensitive as well as financial data; then, how the users will come to know that their data had been breached and altered during it was stored on cloud space. For maintaining the data integrity, users encrypt it by some well-known algorithm like Data Encryption Standard (DES), Advanced Encryption Standard (AES), Rivest–Shamir– Adleman (RSA), Blowfish, International Data Encryption Algorithm (IDEA), and Dynamic Algorithm. These algorithms may be cracked by hit and trial method using large no. of available computing elements offered by cloud. This chapter proposes an algorithm that detects errors or unauthorized changes that may encounter in big data for the duration it was stored on cloud space. Keywords Cloud security · Cloud storage security · Error detection · Big data · Integrity · Cloud security algorithm M. Tajammul · R. Parveen Jamia Millia Islamia, Jamia Nagar, New Delhi, India e-mail: [email protected] R. Parveen e-mail: [email protected] R. N. Shaw Department of Electronics and Communication Engineering, Galgotias University, Greater Noida, India e-mail: [email protected] A. Ghosh (B) The Neotia University, Sarisha, West Bengal, India e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_8
105
106
M. Tajammul et al.
1 Introduction 1.1 Motivation Cloud computing is considered as the next-generation architecture for IT solutions [1]. With the advancement in technologies, it enables users to drop their data and to deploy their applications on cloud which is totally different from the classical computing [1]. The economic and powerful processors together with “Software as a Service (SaaS)” called multitenant architecture [2]. The heavy bandwidth and the flexible network connections make it possible for clients to subscribe services form the remote locations. No doubt that cloud is really promising architecture and offering services at large scale with reasonable charges, but transferring data at remote location either for storage or computation purpose is highly risky as cloud storage is untrusted space [3–6]. Since cloud architecture is robust in nature, it is multitenant in nature [7–10]. Cloud storage system allows its user to upload their data on cloud and then support the users to retrieve that data with the assistance of infinite numbers of available resources [11]. Current scenario states that the data uploaded on cloud server needs to be backed up to get rid from the data loss [11]. Since data that is to be stored on cloud server handled securely while transferring from users to cloud or cloud to users by SSL and TLS but once data is reached to cloud storage, it resides there for long time. Since multiple users use the same storage simultaneously, it leads user’s data to highly risky zone [5]. Recent research uncovers that the capacity of the local server or desktop or laptop does not permit users to store big data on their hard drive. Hence, users have no other way to store data than cloud space. A number of methods discussed in literature review in next section tell us that there are two methods of checking your data stored on cloud against its integrity (I) Through third-party audit or (II) by keeping a copy data as backup and match the cloud data with its copy available on local storage. A common question arises here: if we are sending our data on cloud storage, then why to keep its copy on local storage as it is inadequate to accommodate it. It is just a wastage of the local server against method (I) Why to go to TPA, since it is also third party for us like CSP. CSP being third party is treated as untrusted. Hence, TPA is also untrusted. This chapter is a small attempt to find the answer following identified research questions: RQ1: If CSP is third party, TPA is also third party, can we eliminate from the auditing? RQ2: Is it possible for us to audit the data by customer itself by eliminating the TPA? RQ3: Can we develop a method or an application through we can store data on cloud and to know its integrity on cloud, is it possible for us to store some chunks (residues) of this big data in the form of KBs only on our local server to perform auditing in order to saving our local space and to perform auditing of the prestored data on cloud simultaneously?
Error Detection Algorithm for Cloud Outsourced Big Data
107
2 Literature Review A number of researchers have focused on data verification stored remotely. Ateniest et al. [12, 13] introduced the concept of “Provable Data Possessions (PDP)” model to achieve possessions of files on untrusted storage. They used RSA for auditing outsourced data. In their next research, they focused on dynamic PDP [13]. Naresh Vurukunda and B. Thirumala Rao focused on data storage security [14] where they divided security issues and challenges in three broad categories: data storage issues, identity management and access control issues, and contractual and legal issues. Authors had suggested possible solutions for the data storage issues, identity management, and contractual as well as legal issues. Majhi and Dhal [15] focused on security vulnerability in cloud platform where authors investigated systematically various possible attacks on different cloud platform. Moreover, authors have ensured the functional correctness as well as security in context of cloud computing infrastructure. Ejimogu and Basaran [16] carried out a systematic mapping study on soft computing techniques in cloud environment; authors have sketched an inclusive systematic mapping outline of the recent research and highlighted research gaps. Zhang et al. [17] proposed security framework for business cloud. They called it as Cloud Computing Adoption Framework (CCAF). The framework was succeeded in blockage of 99.95% of Trojan and achieved more than 85% of blocking of 100 attacks. The CCAF took 0.012 s per Trojan or virus detections. Garg et al. [18] developed a framework for raking cloud computing services. Authors proposed a framework and mechanism to measure the quality and prioritize services. The framework produces healthy environment for competition among various CSPs. Due to availability, this framework CSPs tries to satisfy Service-Level Agreement (SLA) at their level best and subsequently try to improve their Quality of Services (QoS). Tajammul and Parveen [3] discussed key generation algorithm for storing data on cloud storage where authors designed and developed an algorithm that was dynamically generating key for the data encryption. Authors passed this key to the DES for encryption in subsequent step, but they did not discuss much and more about key management in their algorithm. In their subsequent research work, Tajammul and Parveen [4] proposed a two-pass multidimensional and key generation algorithm for cloud storage security. The main idea behind the algorithm was to design and develop an algorithm that was encrypting data with unique key every time. Algorithm was successfully generating unique key for each separate file and subsequently encrypting it in next pass. In their next work, Tajammul and Parveen [5] designed and developed an integrity testing framework. This framework was capable of testing the integrity of the cloud uploaded data at character level. In their next research work, Tajammul and Parveen [6] proposed an auto-encryption algorithm that was designed to improve the security of the cloud outsourced data. This algorithm was taking plain text as input and uploading its cipher text to the cloud. This algorithm was more secure and advanced version of two-pass algorithm which has been referenced in [4] because this algorithm was taking one more step than the previous one because of making more confusion to the malicious insider while making framework more
108
M. Tajammul et al.
robust. Rani and Jindal [19] proposed real-time object detection and tracking using velocity control. This system was capable of detecting and tracking of simple to complex objects including simple and complex backgrounds. It was s also effective in detecting and tracking the co-occurrence of two objects.
3 Research Gap Cloud outsourced data suffers from security breaches. Data auditing is a method of finding activities or checking activities happening with the data uploaded on cloud. For data auditing, TPA is required which is third party and hence untrusted, and therefore, why to go to TPA if it is untrusted. Let us consider a situation where data is uploaded on cloud storage, and somehow it is breached there. After entering in user space, malicious insider makes some changes in users’ data. Then, a question arises here: How the parent organization or owner of the data will come to know that the data was altered on cloud? There is no separate concept to achieve the answer of this problem. Existing algorithm except [5] are relying on security algorithms to maintain data integrity. But security or encryption and decryption algorithm available in literature is so far focusing on this problem. Although security algorithms work until they are cracked, we are focusing here a situation where security algorithms are cracked. After cracking, security algorithm fails to maintain integrity. Hence, there is a gap in the literature of an algorithm that could test the integrity of the data or test whether the data is same as it was uploaded on cloud or not.
4 Proposed System The proposed system is really an extension of the algorithm referenced in [6–10]. The integrity testing algorithm was capable of detecting errors for the textual data composed of small alphabet and numeric keys from 0 to 9. For handling the textual data, algorithm was computing a matrix of 6 × 6. Matrix was having only 36 entries, that is, 26 entries for alphabets and 10 entries for handling numeric keys, but the proposed algorithm is capable of handling 26 small caps and 26 big caps as well as 10 numeric keys out of remaining two more entries, one for full stop other for question mark. For accommodating such entries, algorithm is creating a matrix of 8 × 8 that is of 64 entries in total. Previous version was inadequate to accommodate 26 big caps as well as full stop and question mark [13, 19]. Some more changes have been made in the proposed architecture of the proposed system like data division and encryption by two separate algorithms before uploading data on cloud storage. The proposed system takes textual files into consideration and computes matrix M1 of 8 × 8. Once matrix is computed, the given data is divided into two parts, and then, each part is encrypted by separate algorithm before uploading it on cloud
Error Detection Algorithm for Cloud Outsourced Big Data
109
storage. Once data is encrypted and uploaded on cloud storage by two separate algorithms using different keys with the help of key generation algorithms, this data resides there for a long time. To test its integrity, data is downloaded from the cloud storage, and again, a matrix M2 is computed. This matrix is also having same number of entries that is 64 entries as M1 has. Now, we will discuss matrix generation method from the given data.
4.1 Matrix Computation To compute M1 from the input file, first of all compute the occurrences of each alphabet including small caps to big caps and numeric keys as well as full stops and question marks. The first entry of M1 will be filled by the number of frequencies of ‘a’, second entry of M1 will be filled by the number of frequencies of ‘b’, third entry of M1 will be filled by the number of frequencies of ‘c’ and so on, …, the 26th entry of M1 will be filled by the number of frequencies of ‘z’. Similarly, 27th entry of M1 will be filled by the number of frequencies of ‘A’, 28th entry of M1 will be filled by the number of frequencies of ‘B’, third entry of M1 will be filled by the number of frequencies of ‘C’ and so on … the 52nd entry of M1 will be filled by the number of frequencies of ‘Z’, 53rd entry of M1 will be filled by the number of frequencies of ‘0’, 54th entry of M1 will be filled by the number of frequencies of ‘1’, 55th entry of M1 will be filled by the number of frequencies of ‘2’, and so on, …, the 62nd entry of M1 be filled by the number of frequencies of ‘9’, the 63rd entry of the M1 be filled by the number of frequencies of ‘.’, and finally, the last entry of M1 will be filled by the number of frequencies of ‘?’. Thus, M1 will have 64 entries in total. As the value of entries will be very high for big data, then matrix M1 will use remainder division (%) or modulus operator. As soon as the value of any entry increases beyond million, the modulus operator comes into play for more specifically (%M) which is applied consequently, and the value for that particular entry again starts from zero (Fig. 1).
4.2 Algorithm Selection for Encryption For encrypting the first half portion of the data that is F1, any data security algorithm may be selected from AES, RSA, IDEA, or two-pass algorithm [4], DES has not been taken into consideration because the encryption function of the DES was officially withdrawn in 2005 [20]. For encrypting the second part of the data file that is F2, any data security algorithm out of the above discussed may be selected, but it should be different from the algorithm that has already been chosen for F1. More specifically say, if AES has been chosen for encryption of F1, then any other algorithm except AES may be chosen for encryption of F2. On successful encryption of both of the files, cipher text is uploaded on cloud storage. Since cloud storage does not allocates us as contiguous storage space [21],
110
M. Tajammul et al.
Fig. 1 Architecture of the proposed system
it means if we are uploading two separate files on cloud storage, it is not necessary always that both of them are adjacent to each other in context of storage space [22]. The division of file F into F1 and F2 and uploading it on two separate storage spaces protects user’s data from leakage or alteration whole file at a time. Malicious insider will be unable to know whole data at a time and at one place. This mechanism reduces the chance of hacking.
4.3 Data Downloading, Decryption, Attaching Files Together, and Error Detection To test the integrity of the cloud uploaded data, first of all user needs to download it [23]. After downloading it from cloud, compute matrix M2 in same way as matrix M1 was computed. Moreover, if modulus had been used in M1, then it will be used in M2 also, and if it had not been used in M1, then it will not be used in M2. Once M2 is computed, next step is to check equality among the entries of M1 and M2. If
Error Detection Algorithm for Cloud Outsourced Big Data
111
both of the matrices are same at entry level, then no error is there or data that was uploaded on cloud has not been altered in that duration it resided on cloud [24, 25, 26]. Otherwise, if both of the matrices are not same at entry level, then entries that are different are focused and errors are detected. For instance, third entry of matrix M2 is different than that of M1, and it implies that ‘c’ character of the file has been altered on cloud by malicious insider. For two entries, like 23rd and 56th entries of M1 and M2 are differing, then ‘w’ and ‘3’ were altered by the malicious insider.
4.4 Algorithm
Algorithm: Error Detection for Big Data Input: Integer i=0, j=0 and, F as plain text Output: Error detection on F Step1:M1 Computer the occurrences of each character of the input text file and store in matrix M1 a: If data is big data then use %M to reduce high range of entries, where M stands for Million b:If data is normal data don’t use %M Step2:Divide the data file F into two parts that is F1 and F2 Step 3:Select encryption algorithm1 for F1, E1=algorithm1_Enc(F1) Step 4:Select encryption algorithm2 for F2 that is different from algorithm1, E2=algorithm2_Enc(F2) Step 5:Upload the encrypted files on cloud storage Step 6:To audit or to detect error, download encrypted files form cloud storage Step 7:Select decryption function of algorithm1 and decrypt E1, F1=algorithm1_Dec(E1) Step 8:Select decryption function of algorithm2 and decrypt E2, F2=algorithm2_Dec(E2) Step 9:Combine F1 and F2 as F=F1+F2 Step 10:Computer M2 in same fashion as M1 computed Step 11:To detect error or to audit data perform step 12a and 12b
112
M. Tajammul et al.
Step 12:Whilei0.6 & ≤1.0
Strongly Negative Negative Weakly Negative Neutral Weakly Positive Positive Strongly Positive
132
J. Gautam et al.
3.3 Summarized Report on the Information The system uses the information from the database created earlier to analyze what percentage of the total people whose tweets were analyzed feel what sentiment toward a particular topic. The report is showcased as a set of percentages corresponding to one out of seven emotions that the tweets were analyzed for.
3.4 Pie Chart To make it easier for the user to compare the results, a pie chart based on the report is created.
3.5 Heat Map The key feature of our system is its ability to create a heat map using the GPS location of the tweeter when making the tweet. The location is plotted as a point on the map, and this point gets bigger as more tweets pop up from the same location. This allows for an insight about the people’s views about a particular topic location wise. It can also let the user understand the public opinion about an issue on a specific location.
4 System Architecture The system being formulated will be implemented in six steps. The first step will involve inputting a topic in the form of a hash tag that the system will perform its analysis on, along with the number of tweets that the system will work with. The next step will be data extraction from Twitter servers using ‘tweepy’ library and performing data cleaning on that data to remove any unnecessary information and garbage values. The third step will be the sentiment analysis of individual tweets using ‘TextBlob’ library to find the polarity of each tweet as a float value between ‘−1’ and ‘1’. The fourth step will be the construction of a database of the gathered tweets and information associated with each tweet such as the real name of the tweeter, screen name, time of tweeting, polarity, and GPS location on an excel sheet. Fifth step is the analysis of the database created in the fourth step to create a summarized report of the sentiments of tweeters and plotting that information on a pie chart using ‘pyplot’ library Fig. 1. This will give a visual representation to the user showcasing what percentage of tweets out of the ones searched and show what sentiment, thus enabling a direct comparison for the user to access the public opinion. Finally, the last step will use
Twitter Data Sentiment Analysis Using Naive Bayes Classifier …
133
Fig. 1 Flowchart of the system architecture
the GPS location fetched in the database and plot that information as points on a heat map. The points will get bigger as more tweets pop up from the same location.
4.1 Algorithms for Analyzing Sentiments Different algorithms are available which can be implemented for analyzing sentiments. Rule base is available for sentiment analysis which uses NLP techniques and lexicon. Other approaches include automatic approaches and hybrid approaches. Automatic approaches are generally based on machine learning techniques. One of them uses Naive Bayes classifier which is discussed in the coming section.
4.2 Naive Bayes Classifier Naive Bayes classifiers belong to the probabilistic classifiers and are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem. Maximum-likelihood training can be done by evaluating a closed-form expression, which takes linear time, rather than by expensive iterative approximation as used for many other types of classifiers. This system uses a hybrid approach using both rule-based systems and automatic systems in order to achieve higher accuracy of sentiment calculation. ‘TextBlob’ library has been used for processing text data. It provides an API for natural language processing (NLP) tasks.
134
J. Gautam et al.
4.3 Google Maps Heat Maps Layer Intensity of data is visualized at geographical points using a heat map. A colored overlay will appear on top of the map by enabling a heat map. Higher intensity areas will be colored red, and lower intensity areas will appear green. The heat map layer provides client side rendering of heat maps. The size and intensity of the plotted points/dots increase with the increasing traffic of tweets of a particular sentiment in this system.
In the above example, bubbles having a redder tone and thicker size are places having more intensity of a particular sentiment than places having a lighter and greener tone.
5 Case Study The following test cases were run:
Twitter Data Sentiment Analysis Using Naive Bayes Classifier …
135
The number of locations plotted on the heat map depends on the number of tweeters who have provided GPS permission to their Twitter app or to the Twitter Web site. The system can be modified to make points be plotted on the heat map only for tweets showing negative emotions or positive emotions or for any sentiments out of the seven sentiments included in the system.
6 Naive Bayes Accuracy Test Using Weka Weka tool is used for machine learning algorithms. It has functions available for preparation of data, classification and regression techniques, clustering, etc. For Naive Bayes classification, the very first step is to extract the relevant fields from the csv file, which in our case is screen-name, sentiment, and score.
136
J. Gautam et al.
From the tools section in Weka, the extracted dataset is viewed by using arff viewer, and it is then saved in the “.arff” extension file format, so that later Naive Bayes classification can be performed upon it. Now, for classification, click on the Explorer’s tab, and load the dataset in Weka tool. After that, click on classify tab, and select the classifier as Naive Bayes classifier.
6.1 Classifier Output
Twitter Data Sentiment Analysis Using Naive Bayes Classifier …
137
6.2 Results 100 tweets were used for the testing purpose, giving 96% efficiency with relative absolute error of 10.36% and root relative squared error of 34.80%. This proves that our classification of tweets using NAIVE BAYES classifier is fairly accurate and can be relied upon.
7 Conclusion From above results, it can be concluded that this system is easy to use and understand and can allow anyone to easily research about people’s views on any topic. Rather than having to go through Twitter data manually or using a Web site to download datasets and then using a separate tool to analyze that dataset with, a single simple tool can be used. The added functionality with this system is that heat maps are constructed letting the user of this system easily visualize trends location wise. Such a tool that performs analysis on micro blogging sites is highly useful for corporations that depend upon people’s opinions and market trends to create newer products that can be sold easily. During election time, the opinion polls that are often done right before elections can also be automated with this system. The opinion of people in a location for a particular political party can let the parties mend their ways or make better decisions. Hence, this system has use everywhere.
138
J. Gautam et al.
8 Future Scope This system has immense future scope of improvement. (1) (2) (3)
A graphical user interface or GUI can be implemented to make the system more user-friendly. A method to allow the user to import older datasets of Twitter data for analysis. All the set of inputs can be displayed in a single window to enable a more comfortable viewing for the user.
References 1. A. Kumar, T.M. Sebastian, Machine Learning assisted sentiment analysis, in Proceedings of International Conference on Computer Science and Engineering (ICCSE’2012) (2012) 2. C.M. Bishop, Pattern Recognition and Machine Learning 3. Mark Lutz, Programming Python, 4th edn 4. S. Mandal, S. Biswas, V.E. Balas, R.N. Shaw, A. Ghosh, Motion prediction for autonomous vehicles from Lyft dataset using deep learning, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA) (Greater Noida, India, 2020), pp. 768– 773. https://doi.org/10.1109/ICCCA49541.2020.9250790 5. Y. Belkhier, A. Achour, R. N. Shaw, Fuzzy passivity-based voltage controller strategy of gridconnected PMSG-based wind renewable energy system, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA) (Greater Noida, India, 2020), pp. 210–214. https://doi.org/10.1109/ICCCA49541.2020.9250838 6. S. Mandal, V. E. Balas, R. N. Shaw, A. Ghosh, Prediction analysis of idiopathic Pulmonary fibrosis progression from OSIC dataset, in 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON) (Greater Noida, India, 2020), pp. 861–865. https://doi.org/10.1109/GUCON48875.2020.9231239 7. R.N. Shaw, P. Walde, A. Ghosh, IOT based MPPT for performance improvement of solar PV arrays operating under partial shade dispersion, in 2020 IEEE 9th Power India International Conference (PIICON), SONEPAT, India, 2020, pp. 1–4. https://doi.org/10.1109/PIICON 49524.2020.9112952 8. M. Kumar, V.M. Shenbagaraman, R.N. Shaw, A. Ghosh, Predictive data analysis for energy management of a smart factory leading to sustainability, in ed. by M. Favorskaya , S. Mekhilef , R. Pandey, N. Singh, Innovations in Electrical and Electronic Engineering. Lecture Notes in Electrical Engineering, vol. 661 (Springer, Singapore). https://doi.org/10.1007/978-981-154692-1_58 9. B. Pang, L. Lee, A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts, in Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04) (2004), pp. 271–278 10. M. Hu, B. Liu, Mining and summarizing customer reviews. in Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2004), Seattle (2004) 11. A. Esuli, F. Sebastiani, Determining term subjectivity and term orientation for opinion mining 2006, in 11th Conference of the European chapter of the association for computational linguistics (2006) 12. A. Go, R. Bhayani, L. Huaug, Stanford University, Twitter sentiment classification using distant supervision, in The Third International Conference on Data Analytics (2009) 13. E. Koulompis, T. Wilson, J. Moore, Twitter sentiment analysis: the good the bad and the OMG! in The fifth International AAAI Conference on Weblogs and Social Media (2011)
Twitter Data Sentiment Analysis Using Naive Bayes Classifier …
139
14. H. Saif, Y. He, H. Alani, Semantic sentiment analysis of twitter, in The 11th International Semantic Web Conference (ISWC 2012) (11–15 November 2012, Boston, MA, USA) 15. G.A. Miller, R. Beckwith, C.D. Fellbaum, D. Gross, K. Miller, WordNet: An online lexical database. Int. J. Lexicograph. 3(4), 235–244 (1990) 16. A. Bifet, E. Frank, Sentiment knowledge discovery in Twitter streaming data, in Proceedings of 13th International Conference of Discovery Science, Berlin (Springer, Germany, 2010) 17. B. Pang. L. Lee, Opinion Mining and Sentiment Analysis 18. T.M. Mitchell, Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression 19. A. Goel, J. Gautam, S. Kumar, Real time sentiment analysis of tweets using Naive Bayes, in Proceedings 2nd International Conference on Next Generation Computing Technologies, 14–16 Oct 2016
Computing Mortality for ICU Patients Using Cloud Based Data Sucheta Ningombam, Swararina Lodh, and Swanirbhar Majumder
Abstract Computing Mortality for ICU patients, who are in critical conditions and in need of extra intensive care has been a major problem. The focus of this work is to predict patient’s health mortality through health record data from ICU Mortality Prediction Challenged. Data are taken from the first 24 h to figure out the in-hospital death by using few models from machine learning. Here, in this health record-based work, personal health information particularly for ICU patients are recorded and observed by the physicians. These methods are cost-effective, reliable, easily accessible, and are maintained in a Cloud platform to increase the quality of service. We have taken 6 general descriptors recorded at the time of admission to a particular unit ward and other different time-series measurements collected during the first 24 h. This chapter focuses on predicting the mortality of ICU patients by checking their health-care data. We have used online mode that can be access by the physicians, patients, and other staff members easily. Therefore, it has the considerable potential to provide an accurate result with a simple and easily accessible mode. As there is less available research works on ICU patients with Cloud Computing. That’s why, our approach has the potential to reach the prediction of mortality for in-hospital ICU patients using machine learning. Keywords Machine learning · Cloud computing · Health-care · Mortality prediction · ICU patients
S. Ningombam (B) · S. Lodh · S. Majumder Department of Information Technology, Tripura University, Agartala, Tripura, India e-mail: [email protected] S. Lodh e-mail: [email protected] S. Majumder e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_11
141
142
S. Ningombam et al.
1 Introduction In the past few years, clinical practice is increasingly focused on medicine for the vast data sets of high-performance systems used for the study of both the human body and cellular machinery [1]. The Intensive Care Unit (ICU) allows for radical, life-saving treatment, such as mechanical ventilation, to patients who are seriously ill. ICUs do have a very large workforce to continuously monitor all patients and to ensure that any improvements in ill-patient treatment are detected and corrected before they are fatal, a strategy that has been shown to improve results. In-hospital mortality is understood to depend upon the patient’s demographics and on the presenting conditions, also as on simply and habitually performed tests and measurements recorded within the initial hours from admission. Prediction models, like the APACHE IV, SAPS 3, and MPM0 III scores, are developed within the last 3 decades, primarily to check the effectiveness of medicines [2]. The matter of mortality prediction rate within the ICU has been the perspective from a statistical point through some mortality prediction models like APACHE, SAPS, MODS [3] that use a group of functional predictors, numerical factors, and therefore the occurrence of certain chronic conditions, to predict a score that is a proxy for the possibility of the mortality of unit patients. They have always played an important role in hospitals. They’re well equipped and go by an out-sized variety of medical workers; thus, correct assessment of the physiological standing of patients in ICUs is useful for the rational allocation of medical resources, whereas providing major safeguards for the treatment of patients [4]. While longer patients remain in a hospital, the more the treatment they have, their medical prognosis is not necessarily supported. The mortality rate increases with an extended ICU stay (over 30 days) and more than 70% of patients die at or within one year of treatment in the hospital. The mortality rate is thus increasing. It should, therefore, be kept in mind that long periods of hospitalization reflect the seriousness of the condition of the patients too. The ability to set up essentially a virtual office allows you to easily connect webbased devices used in today’s business environment to access the information is one of Cloud Computing’s main advantages over other techniques [5]. Switching to cloud computing can reduce management and maintenance costs for our medical systems. Instead of purchasing expensive systems and equipment for the business, it will be able to reduce the costs by using the resources of the cloud computing service provider. As it is entirely base on a web server, all the data on the database are stored within the cloud. Yet another benefit is that when physical devices get corrupted or not working properly, a web service provider like AWS services can carry big data. Moreover, we will access data from anywhere, anytime or while you’re off-site, we will connect with our virtual office, quickly and simply.
Computing Mortality for ICU Patients Using Cloud Based Data
143
1.1 Motivation and Goal Prediction of the ICU patient in-hospital death is one of the main issues in this world. Also, a large amount of life is loss in critical ward due to untimely prediction of the survival rate. This makes the need of a specific way of predicting in-hospital death which can be used worldwide to improve the prediction of the mortality rate using machine learning based algorithm. Different types of classification algorithms in the fields of machine learning were explored to forecast mortality and allocate patient mortality risks using specific clinical variables derived from patient reports from a given data-set. Also, a web application is created” ICU INFORMATION PORTAL” with the concept of how a hospital manages the deal with the part of ICU patient’s data maintenance using cloud computing. The work deals with a website where patients and doctors can check the results and mainly the most importantly this work will predict the mortality of the patient’s health. This work is itself a motivation for product development type and for research development type as we can do more research on critically ill patients mortality prediction not only by deep learning or machine learning but by using the cloud platform to make the medical process more speedily in detecting the mortality prediction and save many other lives ahead. If we can make more medical record base applications with advanced features, it can help many ICU patients. The main goal of this work is to help in the prediction of the in-hospital death rate for ICU patients who are in critical conditions using machine learning algorithm along with 4 matrices. It also created a website to predict the abnormal lab test results for in-hospital ICU patients. As sometimes hardware became junk or not working properly, it became hard to check the result and could not give accurate results as the database becomes a big data, the system became more interrupt. We have created a database in cloud server so that it won’t ever have any interrupt in the system and can store big data with no delay in process. You can also check the site by typing https://icuportal.000webhostapp.com/.
1.2 ICU Data Sources The ICU is a hospital ward, where special medical services took place to take care of critically ill patients. Quick and accurate patient decisions are important. This has led to the development of a wide range of decision support systems for intensifiers to prioritize patients who have a high mortality rate [6]. Most mortality estimated systems are known as score-based models which estimate disease severity to predict outcomes. These models are used to evaluate the performance of ICU by using patient demographics and physiological variables such as the urine sample, blood sample, and heart rate collected within 24–48 h after ICU admission [7]. The scorebased models take some features that are not sometimes available at ICU admission. Also, after the primary 24 h of ICU admission, they create decisions in line with a
144
S. Ningombam et al.
set of knowledge. The models refine the score-based models for use under specific conditions to boost proficiency. It presents a model for predicting the chance of death due to cardio-respiratory arrest [8]. We obtained data from the Physio-net ICU clinical data base for the development of models from machine learning. All data-sets were forecast and open ICU data were used. Data from other ICU hospitals are used to solve problems, but here we are one set of 4000 text files (testing, train and final validation sets). The text file is calculated in a time series, but certain tests are calculated once or several times because multiple hospitals calculate the data in a different way. Many patients have been admitted to cardiac, medical, surgical and trauma ICU for a wide range of reasons. The ICU stays were removed for less than 48 h. The guidelines were not excluded of DNR patients (do not resuscitate), or CMO patients (comfort measures only). In the first 48 h following admission to ICU, up to 42 variables were reported at least once [9]. For all instances, not all variables are available. Six of these variables are descriptors (recollected when admitted), while the remainder are time series that can be obtained with many observations [10]. We have taken patient data and stored it to help them understand what the problem is without consulting with the doctor, so that they can learn a little better whether or not it is common. In this work, we display the data-set of the patients in machine learning using python to predict the mortality of the in-hospital death. We have used 2 types of machine learning algorithm with their 4 matrices to predict the result [11, 12]. For web application we used the cloud platform, a model where we use free-text medical notes to predict ICU patient’s abnormality, and store it in the cloud database for further use and to keep on checking the patient’s health result. The data set that is to compare with the patient’s test result is collected from the physio-net, a freely accessible critical care database, are done in PHP and MySQL server using cloud computing that can use raw lab results given by the technician to produce a prediction effectively with minimal processing, and showed importance visualizations, in a way that predicts the patient’s death or survival, according to the model. Moreover, we will check the health of the patients, how much the length of stay is needed for every individual patient that check-in.
1.3 Variables and Attributes The patient features after ICU admission within the first 24 h are the data used in SAPS III [13]. Such variables are a series of measurements made once or several times from regular monitoring. We initially collected the data for distinct ICU stays (all text files available in physio-net ICU challenged) [14]. We handled all ICU stays, including re-admissions for patients who have visited multiple times. These stays are consistent with patients who are above 50 years are mostly admitted to the ICU for surgical, medical, neurological, or coronary critical illness. Variables for different test were recorded at least once during the first 24 h after admission to the ICU. Six
Computing Mortality for ICU Patients Using Cloud Based Data
145
of these variables are general descriptors. Not all variables were available some tests once or more than once or not at all in some cases. Commonly used 6 general descriptors are: • • • • • •
Age Weight Gender ICU Ward Number Length of Stay (in hrs.) Report submission date
Here, we have put all variables and their normal values with physical units each in this Table 1.
1.4 Telemedicine One of the benefits of telemedicine is that “through delivery to patients, hospitals and hospitals providers’ training and development,” medical knowledge has moved from one Internet site to another. Although telemedicine can be seen by several clinicians in a social unit as technology, all ICU unit clinicians have been active in telemedicine for many years by frequent telecommunications consultations. Intensives, nurses and home workers are highly committed to findings and evaluations by on-scene patients, who provide off-site doctors with the expertise they need [12]. The first report [15] may be a run of ICU unit telemedicine was within the year 1982. University-based intensivists supported telemedical consultation to unit patients over months. Intensivists provided intermittent conductive recommendations and interacted with patients, nurses, and physicians through video and telephones. [4] Teleconsultation models are there to assist within the care of rapidly unwell patients in alternative settings, reducing the length of keep of terribly low birth weight infants in neonatal ICUs, influencing trauma patient management and transfer, and providing medicine vital care inmate consultations. This work [16] concerns health by helping these patients quickly to learn their results. The electronic communication and information technologies are used to provide and support healthcare when the distance between participants is separate. The concept is often more commonly used in interactive video applications, usually for consultations with specialty or sub-specific physicians. Health is often used to include educational, academic, administrative and clinical applications including nurses, psychologists, administrators and others.
1.5 Machine Learning The most common approach for predicting the future or the classification of knowledge is machine learning [17, 18]. Algorithms for machine learning are trained by
146
S. Ningombam et al.
Table 1 Normal values with physical units Variables
Normal values
Serum creatinine
F(1.2), M(1.4)
Physical units
Glomerular filtration rate
90 or above
Blood urea nitrogen
7–20
Urine protein
1.0–1.8
Osmolarity
280–295
M mol/L
Uric Acid
0.1–0.4
M mol/L
Urea
2.5–6.5
M mol/L
Albumin globulin
1.0–1.8
M mol/L
Creatinine clearance
F(78 − 116), M(71 − 135)
ml/min
Calcium
8.4–10.6
mg/dl
Alkaline phosphatase
27–110
U/L
Sodium
135–148
M mol/L
Renin
0.15–3.95
Pg/ml/hr
Potassium
3.4–4.5
Meq/L
Creatinine urine
60–250
Mg/dl
Albumin
35–50
g/dl
ALP
44–147
IU/L
ALT
M (29–33), F (6–34)
IU/L
AST
M (8–40), F (6–34)
IU/L
Bilirubin
0.3
Mg/dl
BUN
M (0.9–1.3), F (0.6–1.1)
Mg/dl
Cholesterol
110–220
Mg/dl
Creatinine
M (0.7–1.3), F (0.6–1.1)
Mg/dl
Glasgow coma score
3–15
Glucose
70–130
Mg/dl
HCT
M (45–52), F (37–48)
Percentage
Heart rate
(60–100), (70–100)
Bpm
Serum potassium
3.5–5.5
Meq/l
Lactate
2
Mmol/dl
Serum magnesium
0.7–1
Mmol/l
MAP
60
mmHg
Mechanical ventilation
0: false, 1: true
(yes/no)
Serum sodium
135–145
Meq/L
PaCO2
38–42
mmHg
PaO2
75–100
mmHg
pH
7.35–7.45
(0 14)
Platelets
150,000–450,000
Cells/nl
Respiration rate
12–20
bpm
Computing Mortality for ICU Patients Using Cloud Based Data
147
Fig. 1 How machine learning works
instances or examples that learn from past experiences and evaluate historical data. When it trains through the examples, it will identify patterns in order to predict the future [19]. It can access data and used it to make a better decision automatically without human intervention. This allows the vast majority of the data to be analyzed with predictive accuracy. In order to perform classification and future predictions, the machine learning algorithm uses patterns in the training data [10]. For the prediction of death in the hospital, we used supervised training in this chapter. It can forecast its models using different structured approaches based on final accuracy. There are various types of matrices use in machine learning but we have used 4 types along with confusion matrices for 2 different algorithms to increase the accuracy of the prediction (Fig. 1).
1.6 Cloud Computing Internet computing also known as cloud computing is the kind of computing that provides computers and devices of an organization via the Internet with various resources—like servers, storage, and applications. A mass-appealing on-demand service at corporate data centers. The cloud enables the data center to operate in a secure and scalable way like the Internet and computer resources. [9] Cloud computing uses its easiest description as services (“cloud services”) and transfers them to shared systems outside firewalls. Instead of the hard drive, applications and services are accessed via the web. In cloud computing, the services are delivered and used over the Internet and are paid for by cloud customers typically on” as-needed, pay-per-use” business model. The cloud infrastructure is maintained by the cloud provider, not the individual cloud customer (Fig. 2).
1.7 Domain Hosting Web hosting provides organizations and individuals with the option of posting websites or web pages. A service provider for web hosting is a company that offers
148
S. Ningombam et al.
Fig. 2 Cloud computing
Fig. 3 Web hosting
the technology and services that are necessary for the internet to view the site or web site. Special computers called servers are used to host or store websites. When you want to access your website, you only need to enter the address or domain of your website in your browser. Your device then connects to your server and delivers your web pages to your server via your browser. Most hosting companies allow you to host their domain. If you do not have a domain, the hosting companies will help you purchase one (Fig. 3).
2 Literature Survey Here, we will be discussing with the related works for ICU patient’s prediction of mortality with different models and approaches. Sharma et al. [20] has done survey on the mortality rate after a patient in the ICU has discharge and life prediction rate
Computing Mortality for ICU Patients Using Cloud Based Data
149
for five years by analyzing how much a patient can survive for a certain period of time which helps in the calculation of the expectancy rate of the patient survival. Silva et al. [8] works on the ICU patient data used for this challenge consist of cardiac, medical, surgical and trauma ICUs admitted records, etc. In the first two days of an ICU patient, information is gathered to indicate which patients survive and which patients don’t. This was done on the Cloud Network back in 2012 and we are using telemedicine again to help ensure the patient’s safety outcomes are stored on the cloud server in a more secure way. Sadeghi et al. [21] produces the latest approach to forecasting mortality using a minimum of 8 classifiers of features derived during the first hour of ICU admission from patients’ heart signals (12 numerical and signalbased features). In order to predict risk, predictable characteristics are calculated on the cardiovascular signal of ICU patients. As various laboratory test results are time-consuming, their proposed method shows the capability of accuracy, recall, F1-score, and area under the receiver operating characteristic curve (AUC) which satisfies both the decision tree classifier in accuracy and interpretability. Therefore, it proves that heart rate signals can be used for foresee mortality in the care unit patients. especially CCUs. As in our work we also have taken first 24 h of the patient at the time of admission in the ICU. We have use cloud server for better performance than this above said paper have given in this computer era. Johnson et al. [22] extract data for each patient’s ICU random time stay which also allows the application model for anupcoming patient’s entire ICU stay. They used AUROC of a Gradient Boosting model (0.920) and using this model, data from the 48 h of a patient’s stay against published harshness of illness scores. They say that the Gradient Boosting model does the highest performance than other models and may provide accurate prediction and a summary of the patient’s health. Neloy et al. [23] aims to build an adequate system for critical patients with a real-time feedback method which we have made by creating a website through cloud server which will be giving real-time results. They propose a basic architecture, associated terminology, and a classified model with machine learning and IBM cloud computing as Platform (PaaS). They developed a mobile application for real-time data and information view. Caicedo-Torres and Gutierrez [24] work shows about using MIMIC-III for Deep Learning model to predict mortality rate using raw nursing notes, together with visual explanations for word importance. Our model reaches a ROC of 0.8629, SAPS-II score, and providing enhanced ability to interpret. It presented ISeeU2, a conventional neural network for mortality prediction. We have use cloud platform instead of deep learning for maintaining the performance of the patients in the ICU by taking the same MIMICIII data to predict the mortality of the ICU patients. Thorsen-Meyer et al. [25] methods for training the machine learning model with quantitative data from the individual hospitals are based on SAPS III variables. We use static and physiological data from time series Electronic medical records for preparation of a recurrent 1-hour neural network. The model was internally validated with 20% of the training by the holdout process. Dataset validated externally using previously inaccessible data. His success was evaluated with the coefficient of correlation (MCC) and area of the Matthews receiver input curve (AUROC) as metrics to create 95% CIs using bootstrapping with 1000 substitutes samples. Nemati et al. [26] uses real-time available data for AISE
150
S. Ningombam et al.
ICU patients that can correctly predict the onset of sepsis in an ICU patient from four to twelve hours before clinical recognition.
2.1 ICU with Cloud Some of the research papers who have done least work with cloud computing recently are as follows: As in [24], they have used deep learning which is also kind of Cloud Computing, they have use MATLAB or other software for their work. They used free medical notes for predicting the mortality of ICU patients which we also have done by taking the nurse text note time to time by monitoring the ICU patients. They used Shapley Value for conventional architecture and experimental setting and we have use PHP and MYSQL SERVER in mortality prediction of ICU patients. In paper [27], they did in cloud/edge/device computing along with AI-oriented in Medical Workload Allocation for Hierarchical for ICU patients to achieved minimum response time in saving life. Their applications involved are short-of-breath alerts, patient phenotype classification, and life-death threats and make the experimental results demonstrate the high efficiency and effectiveness in real-life health-care and emergency applications. But we have only focused on doing the whole work in Cloud Platform by collecting the data from the patient’s first 24 h blood test report as it gives better services for small website like us to work. In paper [28], produces a procedure for mortality prediction in patients with advanced penile cancer using machine learning as there is no early prediction mortality for penile cancer. Their work makes predictive features based on patient demographics, performance, metastasis, lymph node biopsy criteria, and others. Machine learning, deep learning, cloud computing is all internet base. The Cloud Platform with some programming languages can give you more reliable result which will be easy to understand by the patient or the physician. In paper [29], approaches a monitoring form for patients called Intelligence Remote Patient Monitoring to propose an architecture for all major groups in any healthcare services. Cost-effective management entirely is done in the cloud base server enabling the hospitals to achieve high-speed medical processes in predicting an undesired medical condition alert from the patient. We have also used the Cloud platform to monitor and to get the lab test results easily by the patient and the doctor for further observation.
3 Computing ICU Mortality It is necessary to know the mortality rate of ICU patients as it will help in predicting the survival or in-hospital death. The main concept of this work is to predict the mortality rate of in-hospital death using the data-sets collected from various ICU patients using python with machine learning. Another work is the demo of a website “ICU INFORMATION PORTAL” is to mainly deal with the part of ICU patient’s data
Computing Mortality for ICU Patients Using Cloud Based Data
151
maintenance using cloud computing. The work deals with a website where patients and doctors can check the results and most importantly this work will predict the mortality of the patient’s health.
3.1 Model Development We used a machine learning method to foreseethe mortality rate in patients admitted to the ICU. Precisely, we have used 2 models- DecisionTreeClassifier and Random ForestClassifier consisting of four matrices—precision, specificity, sensitivity and accuracy that are compute to update their prediction by integrating and learning from the given data-set. A machine learning model takes the say data sequence then learns and maintains the pattern useful for these big data [30]. ML algorithm combine data from the previous information to make predictions about in-hospital death, making them appropriate for time-series prediction. We used the most holdout method and split the training data-set 90/80/70/60% and a test data-set of 10/20/30/40% for internal validation. All four matrices are derived from the test data-set and the external data-set. To deal with same variables repeating for a patient-id in the table we have used a group of for-loop functions using python. So, according to the time variation when the values of the variables are inserted it will not create another same type of variable name in the particular table. To calculate the ability to challenge nonsurvivors from survivors, we used Matthews correlation coefficient (MCC), positive and negative predictive values, and positive and negative ratios. The MCC is written as (TP*TN) − (FP*FN) MCC = √ + FP)(TP + FN)(TN + FP)(TN + FN) (TP
(1)
where TP and TN specify true positive and true negative, FP and FN specify false positives and false negative. MCC (1) is a value measure for classification models. Positive and negative ratios are used for taking the values of performing a test. The sensitivity (2) and recall or specificity (3) of the model are used to determine whether a +ve or −ve prediction usefully changes the probability that a patient will die. The positive outcome ratio is calculated as sensitivity P(positive test | disease) = 1 − specificity P(positive test | no disease)
(2)
and the negative outcome ratio as 1 − sensitivity P(negative test | disease) = specificity P(negative test | no disease)
(3)
152
S. Ningombam et al.
We have calculated the confusion matrix in percentage so that it will be easy to understand for both models of machine learning algorithm. Some of the parameters used in python for machine learning in fetching the data are xtrain, xtest, ytrain, ytest, ypred for the outcome of the prediction varying by each test-size.
3.2 Chapter Perspective and Scope “ICU INFORMATION PORTAL” is the replacement of the manual hard copy result process. The data have been stored in the hard file or papers; this website uses a cloud server will store all of the data on the website. The main goal of this chapter is to minimize the work and maximize the result processing system. The whole work is supported by the cloud server to run the whole process. User account: The system allows the user to log into their accounts issued at the time of admission to the ICU ward and can access their test results for further observation and conduct a health check-up. The system can accommodate a large number of online users at a time, although the number is not precisely specified. The work provides the patients with a platform to view their results where all the case history will be shown, the patient can also go for a health check-up. It allows the user (patients) to view their lab test results in-case they want to see it before meeting the physician. After getting admitted to a hospital a patient has been given a patient ID no., by using which a patient can get their test result, a doctor’s appointment, and health check-up. It will show the patients whether they need to still stay in ICU or not, need to get a health check-up or to take a test. He/She will be added as a patient of that particular hospital until the payment process. Hospital staff only post results publicly on the website for every particular patient after the test result comes out. Doctors can also check their patient’s result on the website. Admin can control all the portals of the website. Any changes made on the website can only be done by the admin. If there needs to have some maintenance work, it can only be done by admin. The pharmacy section is also there on this portal where patients from the hospital can buy all kinds of medicine and if there is no stock on some of the medicines then they will inform you via SMS on the phone which makes the portal more user-friendly. Payments are made differently for pharmacists and for staying in the hospital. As it is not mandatory to buy medicines from this particular pharmacy provide by the website. It is considered that the user(patient) do have the knowledge of operating the website and to have access to it. As it is a web-based health record system, it requires the operating environment for a user and server GUI. “ICU INFORMATION PORTAL” has 4 types of users: Patients, Doctors, Official Staff, Pharmacists (Figs. 4 and 5).
Computing Mortality for ICU Patients Using Cloud Based Data
153
Fig. 4 Types of users
Fig. 5 ER diagram
3.3 Product Design and Implementation “ICU INFORMATION PORTAL” is a web application that highly depends on the form and version of the browser being installed in the framework, i.e. the version of the browser with the HTML5 support should be used. This work stores all the test results and data information of the ICU patients. Other users like staff members will have the test results before uploading it to the site and their function is to upload all the test results for every individual patient assigned to their particular doctors. The doctor users can also check their particular patients to check they need further observation or not. Overall function is to help the medical unit of the ICU ward in performing their work for better results and functions by collaborating with the cloud server and some programming language. Also, other information can be included if necessary. This work is prepared to be built on the HTML outline which is highly lithe. To yield efficient performance for the database management system we have used the cloud base server for the detail that data being exchange or stored is large. Every
154
S. Ningombam et al.
patient’s profile contains his/her personal information, results, Length of stay in ICU. Patients activities have 3 steps: • Test Results: It will give the patients who have login and are in ICU ward can be able to view their test result and the case history along with the doctor = ’s name and the date of test taken. • Health Check-Up: it looks like a form where the patient or new patient cam get health checkup according to their preferences of which doctor, they want to book the appointment. • Hospital Payment: Those who have stayed the ICU section will have to do the payment according to their length of stay in the critical section and other observations and for testing and monitoring. After that only they will be able to discharge (Fig. 6). Every doctors profile contains his/her personal information, i.e., full-name, address, email, password, degree of the doctor and specialization. it also includes test results of the patient, and health check-up for recovered patients from ICU or new patients. The Doctors activities have 2 steps: • Test Results: After a doctor login the website, it will directly show the patients that they are treating in ICU section with their test results to know the health of the ICU patient. • Health Check-Up: Doctors will get an appointment notice from the website to let them know that a patient has book an appointment on this date along with the timing (Fig. 7).
Fig. 6 Patient’s activities
Computing Mortality for ICU Patients Using Cloud Based Data
155
Fig. 7 Doctor’s activities
Fig. 8 DFD Level 0
3.4 Data Flow Diagram (DFD) The Fig. 8 will only show the content of the data flow diagram. But in Fig. 9, it will show the details of the work in brief. As it is level-1, it will show the main processes that is to be taken in this work. According to the level number, it will have one more level (level-2) which will show the details briefly for one of the main process carried out in the work plan.
3.5 Flowchart A work flow chart will outline the work carried out in the chapter, which will be completed, accepted and received or not. It will then lead to two separate streams: you will either change or cancel the work if the whole plan is not approved. You can continue with the planning process if the design phase is approved (Fig. 10).
156
Fig. 9 DFD Level 1
Fig. 10 Flowchart
S. Ningombam et al.
Computing Mortality for ICU Patients Using Cloud Based Data
157
4 Requirement of Data Driven Computing The whole work is planning to give mortality prediction for ICU patients to check their results easily and can get health check-up appointments online. Physicians can get all the pieces of information of individual patients online just by staying at home or in any place. Lab staff can easily upload the results without a hard copy. The work is planning to give a user-friendly website. Major functions details of the login patient enable them to view test results and can know which test is abnormal just by looking at the result with (*). It enables the login patient to check their case history of all the test that had been taken after they admitted to the critical section. It also enables the staff member to generate the contains of all the results and their corresponding doctors, date of test and different results. The patients can order medicine and ask to delivered them which is another great facility for this application. Users should be able to have different user-based access levels. The system should have different access levels to be provided to the admin and users to access the ICU INFORMATION PORTAL. User interfaces are design as a website for the ICU patients record. Some logical characteristics of the interfaces that the application needs are built on.HTML framework, PHP language, JavaScript, CSS and MYSQL database. • • • •
Back-End—MYSQL, JavaScript and CSS. Font-End—HTML, PhpMyAdmin, Internet Access—Cloud-based server Hardware Interfaces: Windows, a browser which supports HTML5 and JavaScript.
4.1 Non-Fundamental Requirements Performance Requirements: The application is interactive and the delays involved are less. So, there are no immediate delays in every action response. In case of opening windows forms, it is much less than 2 s for error messages or sessions to pop. And assessment has no delays in opening repositories, and the process can be completed for trial and testing in less than two seconds. The delay is based on the distance of 2 devices when connected to the server and their configuration for good communication is less than 20 s. • Security Requirements: ICU data and information transmission should be securely transmitted to server without any changes in the data of the ICU patients. Without registered users (patient-id, dr-id. staff-id) can inter to the website. One particular user of a section only can perform his/her particular actions. Moreover, the payment section has strong authentication mechanism. • Reliability: As the work provide the right tools for discussion, problem solving are made sure that the system is reliable in its operations and securing the sensitive details i.e., the data and the test results of all the patients.
158
S. Ningombam et al.
In the development phase, testing and conferences of users is been continued. So that the quality of the software is been maintained and all the requirements are been fulfilled. • Availability: The test result should be available on the specified date and accurate result as many patients are doing check-up on their test results. • Correctness: The results shown on the work must have the exact result of abnormality or normality of the patient as it will predict the mortality of the critically ill patients. • Usability: As “ICU INFORMATION PORTAL” is easy to handle and navigates in the most expected way with no delays. It satisfies a maximum number of patient’s needs. • Other Requirements: “ICU INFORMATION PORTAL” needs maintenance as it is a long process application. It will need re-factoring and further the requirements can be changed as the field is changing frequently. It should maintain correct result of the patient.
4.2 Hardware and Software Design As the whole work will be on the cloud base. The hardware part will only be the laptop, desktop, or mobile phones. Hardware parts are less than software for this work. We have used a cloud server where all the databases are store in the cloud platform, done on HTML, PHP, MYSQL, JS, CSS to make the work complete. Software plays an important role in this work. This software will help us predict the mortality of ICU patients most easily. Data entries are done in MYSQL server and then all the data are to be upload on a cloud server.
5 Results and Discussion To capture the important features of the model over the desired dataset, four types of experiment have been conducted. For more details and experimental results go through the following topics discussed here. We have used two Machine learning algorithm such as DecisionTreeClassifier and RandomForestClassifier with 4 matrices and their confusion matrices respectively. Each of this algorithm is available in machine learning and more of them are there too. The models are best in its kind in this point of time. A model performing with error in the order of tens is sure to have a fine set of feature extractors. In decision tree classifier is a single and finest form of algorithm used in python and random forest classifier is a multiple of decision tree which takes time in running and extracting the results. From the four matrices, accuracy is done by comparing the actual test set values and predicted values. The recall or specificity is the ratio TP/(TP + FN) which is the ability to find all the positive outcomes. Sensitivity is the difficult analysis which focuses on the
Computing Mortality for ICU Patients Using Cloud Based Data
159
true values of probabilities. These are done to each model which is first loaded with the pre-trained weights. The fully connected layers are then replaced with random weights. Then the model is trained on the given dataset.
5.1 Result Analysis The results of the conduction experiments are divided into tables and analyses accordingly. Our model development included a total of 4000 patient’s data in each set of three, 4093(46.5\%) of patient has in-hospital death in the ICU. 4704(53.5\%) of patient have slightly higher percentage of in-hospital death in the ICU, 4765(90.5\%) of patient have slightly higher percentage of surviving from in-hospital death in the ICU, 45,463(90.5\%) of patients has survived from in-hospital death. Both of training set and testing set for predicting the in-hospital death were 0.84 accuracy for DecisionTreeClassifier. With confusion matrix for each machine learning model are shown below. Model performance decreased slightly when prediction occurred when consumes time regardless of the in-hospital death value point of interest (Figs. 11 and 12). As for RandomForestClassifier, 2274(26\%) of patient has in-hospital death in the ICU. 6466(3.3\%) of patient have slightly higher percentage of in-hospital death in the ICU, 739(74\%) of patient have slightly higher percentage of surviving from inhospital death in the ICU, 49,546(96.7\%) of patients has survived from in-hospital death. The given table-1 corresponds to the experiment conducted on the two-class classification problem. Both of training set and testing set for predicting the inhospital death were 0.88 accuracy. The given table-1 corresponds to the experiment conducted on the one model classification problem. with 4 matrices (Figs. 13 and 14).
Fig. 11 Decision tree classifier
160
Fig. 12 Confusion matrix
Fig. 13 Random forest classifier Fig. 14 Confusion matrix
S. Ningombam et al.
Computing Mortality for ICU Patients Using Cloud Based Data
161
Thus, these two models of machine learning algorithm give the prediction of the in-hospital death in ICU ward which is calculated by python’s jupyter notebook.
5.2 System Design and Implications The purpose is to describe how the work is done as it is a cloud base web application. System Design ensures that the work design meets the requirements specified in the requirements documentation. It describes the system architecture, software, hardware, database design, and security. Here, we have shown four figures showing the interface that users will be seeing in ICU INFORMATION PORTAL (Fig. 15). Main Page In Main Page, after a patient login by using their username and password. Many sections are accessible by the patient which the patient can access such as test results from patients’ guide, find a doctor from the patient’s guide, health check-up from specialties, and services. Pharmacy options are also there for buying medicines, if any medicine goes out of stock, they will restock itself. The payment portal will be done when the patient discharged from the ICU ward (Fig. 16). Test Report Page To get the test result page, the patient has to log in with the username and password from the login page, it will show the main page as shown in the above section. From there, the patient selects the patient’s guide and the test result option will be there in the drop-down menu. Test results are given by the lab technician and it is uploaded by the staff members on the website so that the patient can see their result. If there is any abnormal result it will be in red color and will be mark {*} in it. Also, doctors can
Fig. 15 Home page
162
S. Ningombam et al.
Fig. 16 Main page
Fig. 17 Test Result
check the patient result to predict the mortality of the ICU patient. It is a user-friendly design (Fig. 17).
5.3 Implementation on Cloud System Implementation is the creation and installation of the method to follow the engineering principals by removing the human part element in the equation. This is the system implementation and planning of the work on how the patients will check and see the result. The system implementation will be shown in the following. The implementation that is done for this work to become successful are: • AWS server: It is one of the most trending, one of the best cloud server platforms. Here we are using AWS server for storing the database in cloud server.
Computing Mortality for ICU Patients Using Cloud Based Data
163
• Domain Hosting: GoDaddy gives a domain hosting site and here we have use it to complete the cloud computing by using the IP address from the AWS server EC2. There are many other websites for domain hosting but you cannot get it for free as it needs some charges for giving the domain name. • phpMyAdmin: Aws have phpMyAdmin from EC2 which we have to implement it to our system through some coding. For other cloud computing services, you can also use other programming languages. • MySQL server: It works as a database for phpMyAdmin and we have used this database and store everything on cloud platform by giving some coding on the programming part. • Internet Access: We need high speed of internet in order to not have any delays for accessing the “ICU INFORMATION PORTAL”. For this web application we need internet as everything will be store and will get access from cloud. Without internet this site probably won’t work and can’t get access to it. • Following are the steps that needs to be done in order to connect with the cloud platform for a website: 1. 2. 3.
4. 5. 6. 7. 8. 9.
Go to http://www.000webhost.com/ in a browser. Select Free Registration. Enter a name, a valid mail address and a domain name that you would like to use in the FREE registration form. Something like what you ask will be generated by 000webhost. Click the button for GET FREE Hands. Click on the Upload Files button after checking your e-mail address. Tap on the tab Download Files Now. Choose the folder public.html. Tap the Toolbar button to access files. Click the button Select Files. Choose your machine files and press the Upload Files button (Figs. 18, 19 and 20).
Fig. 18 File upload-1
Fig. 19 File upload-2
164
S. Ningombam et al.
Fig. 20 File upload-3
6 Conclusion In this chapter we have discussed how the two high performance algorithm of machine learning to predict the mortality of the ICU patients. This was in order to help in predicting the mortality rate of the ICU. The performance is inversely proportional for predicting the lead-time. Patients who were wrongly placed as those who can be targeted as in-hospital death give out to significant mortality making this model useful for clinical and diseases processes. Also, the website gives the abnormality result of the critically ill patients in the ICU. Health Check-up part is also there for registered patients. We have put pharmacy so that they can order medicines online. This work is done with the use of internet, machine learning, python and cloud platforms which are the are being using regularly these days. For future work, we can plan to make it more accurate in predicting the mortality rate by using other algorithms of machine learning models and other Artificial Intelligence. Also, we can make it more relevant for upgrading the website into a full working website which can check all the results of a particular hospital.
References 1. D.W. De Lange, S. Brinkman, H. Flaatten, A. Boumendil, A. Morandi, F.H. Andersen, A. Artigas, G. Bertolini, M. Cecconi, S. Christensen, L. Faraldi, Cumulative prognostic score predicting mortality in patients older than 80 years admitted to the ICU. J. Am. Geriatr. Soc. 67(6), 1263–1267 (2019) 2. A. Schoe, F. Bakhshi-Raiez, N. de Keizer, J.T. van Dissel, E. de Jonge, Mortality prediction by SOFA score in ICU-patients after cardiac surgery; comparison with traditional prognostic– models. BMC anesthesiology 20(1), 1–8 (2020) 3. L. Guo, D. Wei, Y. WU, M. ZHOU, X. ZHANG, Q. Li, J. Qu, Clinical features predicting mortality risk in patients with viral pneumonia: the MuLBSTA score. Front. Microbiol. 10, 2752 (2019) 4. C.A. Hu, C.M. Chen, Y.C. Fang, S.J. Liang, H.C. Wang, W.F. Fang, C.C. Sheu, W.C. Perng, K.Y. Yang, K.C. Kao, C.L. Wu, Using a machine learning approach to predict mortality in
Computing Mortality for ICU Patients Using Cloud Based Data
5.
6.
7.
8. 9.
10.
11.
12.
13.
14. 15. 16.
17.
18.
19.
20. 21.
165
critically ill influenza patients: a cross-sectional retrospective multicentre study in Taiwan. BMJ Open 10(2), e033898 (2020) F.S. Ahmad, L. Ali, H.A. Khattak, T. Hameed, I. Wajahat, S. Kadry, S.A.C. Bukhari, A hybrid machine learning framework to predict mortality in paralytic ileus patients using electronic health records (EHRs). J. Ambient Intell. Humanized Comput. 1–11 (2020) W.P. Brouwer, S. Duran, M. Kuijper, C. Ince, Hemoadsorption with CytoSorb shows a decreased observed versus expected 28-day all-cause mortality in ICU patients with septic shock: a propensity-score-weighted retrospective study. Crit. Care 23(1), 317 (2019) P. Reis, A.I. Lopes, D. Leite, J. Moreira, L. Mendes, S. Ferraz, T. Amaral, F. Abelha, Predicting mortality in patients admitted to the intensive care unit after open vascular surgery. Surg. Today 49(10), 836–842 (2019) I. Silva et al., Predicting in-hospital mortality of ICU patients: the physio net/computing in cardiology challenge 2012, in 2012 Computing in Cardiology. (IEEE, 2012), pp. 245–248 D.H. Li, R. Wald, D. Blum, E. McArthur, M.T. James, K.E. Burns, J.O. Friedrich, N.K. Adhikari, D.M. Nash, G. Lebovic, A.K. Harvey, Predicting mortality among critically ill patients with acute kidney injury treated with renal replacement therapy: Development and validation of new prediction models. J. Crit. Care 56, 113–119 (2020) Z. Zhang, B. Zheng, N. Liu, H. Ge, Y. Hong, Mechanical power normalized to predicted body weight as a predictor of mortality in patients with acute respiratory distress syndrome. Intensive Care Med. 45(6), 856–864 (2019) S. Mandal, S. Biswas, V.E. Balas, R.N. Shaw, A. Ghosh, Motion prediction for autonomous vehicles from lyft dataset using deep learning, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India (2020), pp. 768– 773. https://doi.org/10.1109/iccca49541.2020.9250790 Y. Belkhier, A. Achour, R.N. Shaw, Fuzzy passivity-based voltage controller strategy of gridconnected PMSG-Based wind renewable energy system, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India (2020), pp. 210–214. https://doi.org/10.1109/iccca49541.2020.9250838 B.H. Chen, H.J. Tseng, W.T. Chen, P.C. Chen, Y.P. Ho, C.H. Huang, C.Y. Lin, Comparing eight prognostic scores in predicting mortality of patients with acute-on-chronic liver failure who were admitted to an ICU: a single-center experience. J. Clin. Med. 9(5), 1540 (2020) V. Mandalapu et al., Understanding the relationship between healthcare processes and inhospital weekend mortality using MIMIC III. Smart Health 14, 100084 (2019) P.S. Marshall, Tele-ICU in precision medicine: It’s Not What You Do, But How You Do It, in Precision in pulmonary, Critical Care, and Sleep Medicine. (Springer, 2020), pp. 321–331 R.D. Kindle et al., Intensive care unit telemedicine in the era of the big data, artificial intelligence, and computer clinical decision support system. Critical care clinics 35(3), 483–495 (2019) I. Das, R.N. Shaw, S. Das, Performance analysis of wireless sensor networks in presence of faulty nodes, in 2020 IEEE 5th International Conference on Computing Communication and Automation (ICCCA), Greater Noida, India (2020), pp. 748–751. https://doi.org/10.1109/icc ca49541.2020.9250724 S. Mandal, V.E. Balas, R.N. Shaw, A. Ghosh, Prediction analysis of idiopathic pulmonary fibrosis progression from OSIC dataset, in 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India (2020), pp. 861–865. https://doi.org/10.1109/gucon48875.2020.9231239 R.N. Shaw, P. Walde, A. Ghosh, IOT based MPPT for performance improvement of solar PV arrays operating under partial shade dispersion, in 2020 IEEE 9th Power India International Conference (PIICON), SONEPAT, India (2020), pp. 1–4. https://doi.org/10.1109/49524.2020. 9112952 A. Sharma et al., Mortality prediction of ICU patients using machine learning: a survey, in Proceedings of the International Conference on Compute and Data Analysis, 2017, pp. 245–248 R. Sadeghi, T. Banerjee, W. Romine, Early hospital mortality prediction using vital signals. Smart Health 9, 265–274 (2018)
166
S. Ningombam et al.
22. A.E.W. Johnson, R.G. Mark, Real-time mortality prediction in the intensive care unit, in AMIA Annual Symposium Proceedings, vol. 2017 (American Medical Informatics Association. 2017), p. 994 23. A.A. Neloy et al., Machine learning based health prediction system using IBM Cloud as PaaS, in 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI) (2019), pp. 444–450 24. W. Caicedo-Torres, J. Gutierrez. ISeeU2: visually interpretable ICU mortality prediction using deep learning and free-text medical notes, in arXiv preprint arXiv:2005.09284 (2020) 25. H.-C. Thorsen-Meyer et al., Dynamic and explainable machine learning prediction of mortality in patients in the intensive care unit: a retrospective study of high-frequency data in electronic patient records, in The Lancet Digital Health (2020) 26. S. Nemati et al., An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit. Care Med. 46(4), 547 (2018) 27. T. Hao et al., AI-oriented medical workload allocation for hierarchical cloud/edge/device computing, in arXiv preprint arXiv:2002.03493 (2020) 28. R. Chen et al., Machine learning algorithm for mortality prediction in patients with advanced penile cancer, in medRxiv (2020) 29. K. Alghatani, R. Abdelmounaam, A cloud-based intelligent remote patient monitoring architecture, in International Conference on Health Informatics & Medical Systems, HIMS. vol. 19 (2019) 30. M. Kumar, V.M. Shenbagaraman, R.N. Shaw, A. Ghosh, in Predictive Data Analysis for Energy Management of a Smart Factory Leading to Sustainability, ed. by M. Favorskaya, S. Mekhilef, R. Pandey, N. Singh. Innovations in Electrical and Electronic Engineering. Lecture Notes in Electrical Engineering, vol. 661. Springer, Singapore. https://doi.org/10.1007/978-981-154692-1_58
Early Detection of Poisonous Gas Leakage in Pipelines in an Industrial Environment Using Gas Sensor, Automated with IoT Pushan Kumar Dutta, Akshay Vinayak, Simran Kumari, and Mahdi Hussain
Abstract Toxic fumes possess tremendous environmental and life-threatening impacts. People are undergoing several diseases due to these and few lost their life as well. Proper detection of toxic fumes leakage is important for the industries which are within our localities. With this respect, we propose a prototype for sensing the toxic fumes leakage in the industry. Gas leakage can be easily be detected and controlled by using the Internet of things. This project is proposed to avoid industrial mishaps and to monitor harmful fumes and chemicals, switch off the mainline when leakage is found, and generate alarm messages to the director of the industry in real time using recent technology the Internet of things. NODEMCUESP8266 Wi-Fi module is used as a primary microcontroller that is attached to the sensors, such as temperature and variety of fumes sensors, which can continuously monitor leakage. A warning alarm is generated immediately if any leakage is found in either pipeline of the system and the main gas knob is turned off immediately. Data collected by the sensors is saved in the database which can be utilized for further processing and it can be analyzed for developing security management, and monitoring application (Web site or android app) can be used as safety care for workers. Keywords Gas detection · LPG · Internet of things · Android app · NodeMCU · Wi-Fi module
P. K. Dutta (B) · A. Vinayak · S. Kumari Amity School of Engineering and Technology, Amity University Kolkata, Kolkata, India e-mail: [email protected] A. Vinayak e-mail: [email protected] S. Kumari e-mail: [email protected] M. Hussain Computer Engineering Department, Faculty of Engineering, University of Diyala, Baqubah, Iraq e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2021 J. C. Babsal et al. (eds.), Advances in Applications of Data-Driven Computing, Advances in Intelligent Systems and Computing 1319, https://doi.org/10.1007/978-981-33-6919-1_12
167
168
P. K. Dutta et al.
1 Introduction Industries only concentrate on profits without focusing on the health of workers and the environment. Generally, industries are located outside urban areas in the developing countries to tap water resources do not give much attention to safety and security. Talking about the industries, which are situated amid the city, for the raw materials needs to be stricter because many people will be affected if any accident takes place. Accidents can occur in industries because of no proper maintenance, individual fault, components malfunction, etc. [1–3]. This project is proposed to avoid technical accidents and to monitor harmful fumes and chemicals, switch off the mainline when leakage is detected, and generate alarm notification to the supervisor of the industry during real time using recent technology named the Internet of things [2]. LPG leak detection device is based on IoT and Arduino can be mounted in houses, hotels, and LPG cylinder storage areas. It provides real-time information available on the Internet for faster accessing with a gas sensor that can recognize different other dangerous fumes. So, this will ensure the safety of people working in the factory and the people living in the surroundings of the industry as shown in Fig. 1. Fig. 1 Block diagram of the prototype
Early Detection of Poisonous Gas Leakage in Pipelines …
169
1.1 Proposed Methodology Industries which employ miners in deep mines cannot take care of the environmental safety measures to provide better conditions of livelihood including detection of hazardous gases as communities are to be made intelligent. Among them, safety requirements are placed first in smart communities. Through an automated management platform, fires in the residences, gas leaks, and fire alarms are promptly reported. The objective of this work is to present the design of a cost-effective automatic alarm system capable of detecting leakage of liquefied petroleum gas [4–6]. Notification is sent to the administrator that there has been a gas leak at the installation site; however, if there is a leak, the device will prompt the user to call the emergency number so that experts can deal with the situation. Gas leak detection and alarm system help to save the loss of life and property as shown in Fig. 2. The main advantage of this project is that it is already underway can assess the leakage and send the data to the database, where it can be tracked and corrective measures can be taken and it is necessary to install smoke and fire detectors to detect flames. Therefore, the surveillance device offers companies the benefit of being able to track the condition of the atmosphere or the place where the leakage incident might have happened. The alarm mainly takes the detector as the main body and collects combustible gas, smoke, and temperature. When the detected data reaches the preset value, it will give an alarm prompt. Industrial gas leakage at different sites is a huge problem for private, industrial, and gas-powered transport vehicles as shown in Fig. 3. One of the preventive measures to avoid the hazard of gas leakage is the installation of a gas leakage detector at vulnerable locations.
Fig. 2 Data flow diagram
170
P. K. Dutta et al.
Fig. 3 Architecture of ESP8266 components and diagram
1.2 Material Properties and Design Specifications 1.2.1
Hardware Components
NODEMCU ESP8266 (Microcontroller)—We have used NODECU ESP8266 as our project microcontroller to integrate all the sensors that have been used. It is a well-known microcontroller for Arduino projects because of its embedded Wi-Fi [7] services to establish a wireless connection between server and client, more flash memory, bigger RAM, and a dual-core processor which makes this microcontroller a compact and precise one to be used in our project.
1.2.2 (a)
(b)
(c)
Software Components Arduino Software—To get started with the NODEMCU microcontroller we need an integrated development environment (IDE) to process the running state of the sensor. This software is open source and can work on Mac, Windows, and Linux. The environment of the system is scripted in Java which makes it platform independent. This software is independent of any Arduino board. The basic programming language or you call it the Arduino language is embedded C++ or C. An IDE normally consists of a source code editor, build automation tools, and a debugger. Most modern IDE’s have intelligent code completion. MySQL Database—Storing sensors extracted data to MySQL database and retrieving that data to the dashboard so that operations can be performed on them. Wireless Communication—Establishment of remote connection with the server taking the help of TCP/IP and HTTP/HTTPS protocol.
Early Detection of Poisonous Gas Leakage in Pipelines …
171
1.3 Work Process 1.3.1
System Hardware
Independent systems have different designs and are based on different functionalities. In this project, the design includes a single microcontroller, various types of sensors, Internet connectivity for the prototype, and a laptop. The Wi-Fi module is a particularly cost-effective board with an enormous, and ever growing, community. The ESP8266 Wi-Fi module may be a self-contained SOC with integrated TCP/IP protocol stack which will give any microcontroller access to your Wi-Fi network. The ESP8266 is capable of either hosting an application or offloading all Wi-Fi networking functions from another application processor [2, 8, 9]. To anticipate the conduct of a specific zone of interest and to gather the information, the Sensor gadgets are put at various areas. The principle objective of this paper is to plan and execute a sufficient observing framework through which the required boundaries are checked and controlled distantly by utilizing Web and the information gathered from the sensors are put away in the cloud and on the Internet browser to extend the assessed pattern. Specific instruments were used to detect entirely specific harmful gases. The MQ-5 is used as an alkane gas device, the MQ-2 is used as a propane/butane gas device and the MQ-8 is used as a gas sensor. Variable square capacitors are embedded on the circuit of each unit to be ready to monitor the response of the sensor. These square sensors are placed close to the gas supply and co-observe. In the current working process, the microcontroller is configured with some of the sensors namely MQ9, DHT22, and MQ135 sensor to get the readings of gas and temperature in the surrounding. The system generates an output which is then sent to the remote server with the help of wireless connection, TCP/IP, and HTTPS protocol [10–12]. The server is connected to the cloud (database) and the data is stored in the cloud itself. This output is also reflected in the dashboard that is fetching the data from the database and helps the user to keep an eye on any kind of leakage or rise in temperature as shown in Figs. 5 and 6.
1.3.2 (a)
(b) (c) (d)
Connections and Circuitry NODEMCU ESP8266: To operate the microcontroller, power source needed is 3.3–4.7 V which is given through USB port via laptop or with the help of batteries. DHT 22 (Temperature and Humidity): Data pin of sensor is connected to the microcontroller with the digital output pin D5. MQ2:Sampling Rate: To upload the code, it requires a sampling rate of 9600– 15,200 bps. MQ135:Sampling Rate: To upload the code, it requires a sampling rate of 9600–15,200 bps.
172
Fig. 4 Pinouts of NODEMCU ESP8266 Fig. 5 Connection
Fig. 6 Circuitry
P. K. Dutta et al.
Early Detection of Poisonous Gas Leakage in Pipelines …
173
2 Creation of Database and Web site 2.1 Hosting PHP Application and Creation of MySQL Database The goal here is to create a database along with the Web site for the project such that sensors data can be stored and analyzed as per requirements. For this, we need to have the domain name of this project and hosting account which will allow the user to store sensor readings from the microcontroller, i.e., NODEMCU ESP8266 so that one can visualize the readings from anywhere in the world by accessing the server name and domain address [13].
2.2 Creation of Application Programming Interfaces (API) Key Using an API tool, we have tried to make the programmable data for identification of data process associated with the computer. Through it, a computer can view and edit data, just like a person can, by loading pages with sensors data and submitting forms and many other things. When systems are linked with the process there are two analytical approaches which are one side the server that serves the API, and the other side the client that consumes the API and can manipulate it. Here, API key value is generated from the GoDaddy developers page and then used to link both the PHP code and ESP8266 code as well so that the data can be transferred from NODEMCU to the Web site. (a)
(b)
(c)
$api_key_value=“3mM44UaC2DjFcV_63GZ14aWJcRDNmYBMsxceu”; Here, API key value is generated from Google’s cloud platform for using maps services such as src=https://maps.googleapis.com/maps/api/js?key=AIzaSyAk4BBCmeZ uWv0oV_uTjF9wBeW9YocUbOc&callback=myMap Preparing Mysql Database Creation of database, username, password, and SQL table. $dbname = “Data_Final1”; $username = “project”; $password = “project@123”; Structured Query Language (SQL)
Structured query language is a standard database language that is used in this project to create, maintain, and retrieve the relational databases.
174
2.2.1
P. K. Dutta et al.
Creating a SQL Table
After creating database and User account info, with the help of cPanel and “PhpMyAdmin” the creation of a database table is done for the project. To create the table, the following code snipped is required. CREATE TABLE SensorData ( id INT (10) UNSIGNED AUTO_INCREMENT PRIMARY KEY, value1 VARCHAR (10), value2 VARCHAR (10), value3 VARCHAR (10), value4 VARCHAR (10), reading_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP Value insertion into the database $sql = “INSERT INTO SensorData (id, value1, value2, value3, value4) VALUES (‘“ . id. ”’, ‘“ . $value1 . ”’, ‘“ . $value2 . ”’, ‘“ . $value3 . ”’, ‘“ . $value4 . ”’)”; Displaying the data on the Web site: $sql = “SELECT id, value1, value2, value3, reading_time FROM SensorData ORDER BY ID DESC LIMIT 1”; (d)
Use of HTTP (Hypertext Transfer Protocol) in posting request
The Hypertext Transfer Protocol (HTTP) is designed to enable communications between clients and servers. HTTP works as a request-response protocol between a client and a server. Here, when the client submits the HTTP request to the server, the server returns the response to the client. This response contains the data displayed from the sensors which are the requested content. This is how it works. POST is used to send data to a server to create/update a resource. Function in PHP for taking the input: $value1 = test_input($_POST[“value1”]); Preparing HTTP POST request data (The code snippet is from Arduino IDE) String httpRequestData = “api_key=” + apiKeyValue + “&value1=” + String(dht.readTemperature())
Early Detection of Poisonous Gas Leakage in Pipelines …
175
+ “&value2=” + String(dht.readHumidity()) +“&value3=” + String(gps.location.lat(),8) +“&value4=” + String(gps.location.lng(),8)+ “value5=” + String(analogRead(Vib))+”; Serial.println(httpRequestData); Sending an HTTP POST request Int httpResponseCode = http.POST(httpRequestData);
2.2.2
Adding Dynamic Graph to the Web site
The Web site is linked with another page that displays the graphs of the sensor data for the whole day. Here, we are displaying the graph of temperature, humidity, vibration, and oxygen data values. Below is the code snippet for the creation of graphs. // function for creation of graphs. function createTemperatureGraph () { const temperature = document.getElementById('temperature').getContext('2d'); temperaturemyChart = new Chart (temperature, { type: 'line', data: { labels: [], datasets: [ { label: "Temperature (Celsius)", data: [], backgroundColor: "transparent", borderColor: "orange" } ] }, options: { scales: { yAxes: [{ ticks: { beginAtZero: false } } }); }
Adding the download option of the data set. With the display of graphs, there is another option of downloading the data set for the whole day. The data set is available in .txt and. JSON format. Below is the code snippet for the creation of a download option.
// function for creation of download option. function generate () { link_div.style.display = 'none' button.innerHTML = 'Generating... please wait...' button.removeEventListener('click', generate) fetch ("http://emsig.co.in/Data_Final111/data.php", { headers: { 'Content-Type': 'application/json', }, }) .then (data =>data.json()) .then (data => { const all_data = "text/json;charset=utf-8," nent(JSON.stringify(data)) setDataForDownload(all_data) })} button.addEventListener('click', generate) +
encodeURICompo-
176 P. K. Dutta et al.
Early Detection of Poisonous Gas Leakage in Pipelines …
177
Fig. 7 Readings in the Web site (real time, safe condition)
2.2.3
Libraries Used and Code Snipped
Libraries are a collection of code that makes it easy for us to connect to a sensor, actuator, display, module, etc. There are hundreds of additional libraries available on the Internet for different sensors. To use the additional libraries, we will need to install them from the library manager which is available in the sketch dialog box as shown in Figs. 7, 8, 9 and 10. 15TLibraries Used: (a)
(b)
include : This sensor is used for obtaining the temperature and humidity readings. The library helps in activating the sensors circuitry such that the sensor can transmit data. #include : To run the module on Internet connectivity, we require either GSM or Wi-Fi connectivity, for this server and client connection
Fig. 8 MQ9 readings
178
P. K. Dutta et al.
Fig. 9 MQ135 readings
Fig. 10 Comparison of reading using MQ135 and MQ9 sensors in a time interval
(c)
need to be established. This library helps in establishing the connection between the server and the client. #include : This library is similar to the above library; it identifies the client and helps in connection establishment.
15TCode Snippet: (a)
Connecting to the Internet and server
const char* ssid = “Akki”; /* Wi-Fi Credentials const char* password = “aezakmi1”; (User Id, Password)*/ const char* serverName = “http://emsig.co.in/Data_Final111/post-esp-data.php“; //server name String apiKeyValue = “3mM44UaC2DjFcV_63GZ14aWJcRDNmYBMsxceu”; /* API value to connect to the specific server and client */ (b)
Connecting to the client
WiFi.begin(ssid, password); //checking for ssid and password for connectivity Serial.println(“Connecting…”); while(WiFi.status() != WL_CONNECTED) (c)
Fetching data from sensors
Early Detection of Poisonous Gas Leakage in Pipelines …
179
if(WiFi.status()== WL_CONNECTED){HTTPClient http; //client connected http.begin(serverName); //server begin //Specify content-type header http.addHeader(“Content-Type”, “application/x-www-form-urlencoded”); //Preparing HTTP POST request data String httpRequestData = ”api_key=“ + apiKeyValue + ”&value1=“ + String(dht.readTemperature()) + ”&value2=“ + String(dht.readHumidity()) + ”&value3 = ” + String(gps.location.lat(),8) + ”“; Serial.print(“httpRequestData: ”); Serial.println(httpRequestData); //printing/sending data to database
3 Mode of Communication Radio Frequency: It is the oscillation rate of an alternating electric current or voltage or of a magnetic, electric or electromagnetic field or mechanical system in the frequency range from around twenty thousand times per second to around three hundred billion times per second. HTTP/HTTPS: Hypertext Transfer Protocol Secure is an extension of the Hypertext Transfer Protocol (HTTP). It is used for secure communication over a computer network and is widely used on the Internet. In HTTPS, the communication protocol is encrypted using transport layer security (TLS), or, formerly, its predecessor, secure sockets layer (SSL). The protocol is therefore also often referred to as HTTP over TLS or HTTP over SSL. TCP/IP: or the Transmission Control Protocol/Internet Protocol, is a suite of communication protocols used to interconnect network devices on the Internet. TCP/IP can also be used as a communications protocol in a private network (an intranet or an extranet).
3.1 Output and Readings This is the normal readings when the system is running in fine condition (Figs. 8 and 9).
180
P. K. Dutta et al.
4 Limitation The smart monitoring process involves an efficient, low cost embedded system is presented with different models during this paper. Within the proposed architecture, functions of various modules were discussed. This data are going to be helpful for future analysis and it is often easily shared to other end users [14]. This model is often further expanded to watch the developing cities and industrial zones for pollution monitoring to guard the general public health from pollution, and this model provides an efficient and low cost solution for continuous monitoring of environment. It has both an analogue and a digital output, but here we use the analogue output that’s connected to the analogue pin of the microcontroller. The DHT11 sensor is employed to watch humidity and air temperature, but I have just used it to live humidity here. To calculate the parameters, the sensor offers completely calibrated digital outputs. This sends 40-bit data for both temperature and humidity, which also contains the byte of the checksum (bit error check). The main project is all about poisonous gas sensing which senses the different kinds of pollution that is gas, smoke, and various other things. The smart thanks to monitor environment and an efficient, low cost embedded system is presented with different models during this paper within the proposed architecture functions of various modules were discussed. The readings help us analyze the continuity of the model for design and development of a robust system of data acquisition. This was a small-scale project done. On a larger scale project, it can have many more sensors like particulate sensors and ozone gas, fire, and other poisonous gas sensors. Gas sensors depend on which kind of factory is it and what kind of poisonous gas it can emit. For the minute accurate readings, we need industrial sensors as shown in Figs. 11 and 12. Fig. 11 Database when sensors are connected in normal environment
Early Detection of Poisonous Gas Leakage in Pipelines …
181
Fig. 12 Database when sensors are exposed to environment having smoke and LPG gas
5 Future-Scope This project on a larger scale can have 5 to 6 NODEMCU(s) connected to the same server via Wi-Fi and the data readings taken at every interval of time. If any large change is sensed, then it will send you an e-mail or will notify you. The data collected on a large scale can be analyzed as well and to save the environment by knowing the rising change in the environment.
6 Conclusion Power saving features and back-up of the power supply would also be an essential addition to the system in the event that the main power supply is down. The warning device may also be wired to a monitor and a phone that dials directly to the controller and also to the authority if dangerous amounts of CO content are continually identified for more than one or two hours. You can build a wireless system to get the data Inspect fumes and emission leakage exactly on time as the designed system analyzes leakage then alarm indicates the people and workers move to a safe place. Before disaster befalling time prevention people and workers from harmlessly escaping unsafe situations and saving human life dodged major disasters. Thus, I conclude that the system will play a vital role in the current space insuring about the safety of the place.
Glossary MQ9, MQ135 Gas sensors
182
P. K. Dutta et al.
IoT Internet of things Wi-Fi Wireless fidelity LCD Liquid crystal display USB Universal Serial Bus PWM Pulse width modulation DC Direct current DHT 22 Temperature sensor
References 1. N. Kaur, R. Mahajan, D. Bagai, P.G. Student, Air quality monitoring system based on Arduino microcontroller. Int. J. Innovative Res. Sci. Eng. Technol. 5(6), 9635–9646 (2016) 2. K. Cornelius, N.K. Kumar, S. Pradhan, P. Patel, N. Vinay, An efficient tracking system for air and sound pollution using IoT, in 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS) (IEEE, 2020), pp. 22–25 3. A. Singh, D. Pathak, P. Pandit, S. Patil, P.C. Golar, IOT based air and sound pollution monitoring system. Int. J. Adv. Res. Electr. Electron. Instrum. Eng. 6(3) (2017) 4. S. Taneja, N. Sharma, K. Oberoi, Y. Navoria, Predicting trends in air pollution in Delhi using data mining, in 2016 1st India International Conference on Information Processing (IICIP) (IEEE, 2016), pp. 1–6 5. M. Ashwini, N. Rakesh, Enhancement and performance analysis of LEACH algorithm in IOT, in 2017 International Conference on Inventive Systems and Control (ICISC) (IEEE, 2017), pp. 1–5 6. B.C. Kavitha, R. Vallikannu, IoT based intelligent industry monitoring system, in 2019 6th International Conference on Signal Processing and Integrated Networks (SPIN) (IEEE, 2019), pp. 63–65 7. R. Sindhwani, P. Goyal, S. Kumar, A. Kumar, Assessment of gaseous and respirable suspended particulate matter (PM10) emission estimates over megacity Delhi: Past trends and future scenario (2000–2020), in Center for atmospheric sciences (2012), pp. 123–140 8. A. Guthi, Implementation of an efficient noise and air pollution monitoring system using Internet of Things (IoT). Int. J. Adv. Res. Comput. Commun. Eng. 5(7), 237–242 (2016) 9. V. Kameshwaran, M.R. Baskar, Realtime low-cost air and noise pollution monitoring system. Int. J. Pure Appl. Math. 119(18), 1589–1593 (2018) 10. M. Benammar, A. Abdaoui, S.H.M. Ahmad, F. Touati, A. Kadri, A modular IoT platform for real-time indoor air quality monitoring. Sensors 18(2), 581 (2018) 11. Y. Gao, W. Dong, K. Guo, X. Liu, Y. Chen, X. Liu, J. Bu, C. Chen, Mosaic: A low-cost mobile sensing system for urban air quality monitoring, in IEEE INFOCOM 2016-The 35th Annual IEEE International Conference on Computer Communications, (IEEE, 2016), pp. 1–9 12. R. Nayak, M.R. Panigrahy, V.K. Rai, T.A. Rao, IOT based air pollution monitoring system 3 (2017) 13. R.N. Shaw, P. Walde, A. Ghosh, IOT based MPPT for performance improvement of solar PV arrays operating under partial shade dispersion, in 2020 IEEE 9th Power India International Conference (PIICON) (SONEPAT, India, 2020), pp. 1–4. https://doi.org/10.1109/piicon49524. 2020.9112952 14. M. Kumar, V.M. Shenbagaraman, R.N. Shaw, A. Ghosh, Predictive data analysis for energy management of a smart factory leading to sustainability, in Innovations in Electrical and Electronic Engineering. Lecture Notes in Electrical Engineering, vol. 661, eds. by M. Favorskaya, S. Mekhilef, R. Pandey, N. Singh (Springer, Singapore, 2021). https://doi.org/10.1007/978981-15-4692-1_58